Skip to content

Add BENCHMARK_RANK_TRUNCATION: compile-method-independent fixed-limit truncation detector#275

Draft
nikhilbarhate99 wants to merge 1 commit into
gpu-mode:mainfrom
nikhilbarhate99:feat/benchmark-rank-truncation-detector
Draft

Add BENCHMARK_RANK_TRUNCATION: compile-method-independent fixed-limit truncation detector#275
nikhilbarhate99 wants to merge 1 commit into
gpu-mode:mainfrom
nikhilbarhate99:feat/benchmark-rank-truncation-detector

Conversation

@nikhilbarhate99

Copy link
Copy Markdown

Problem

Two existing fixture-specialization detectors can be evaded, letting submissions that classify the input's numerical structure and then truncate the computation (factor fewer columns/rows than the full problem) pass as clean:

  • BENCHMARK_UNSAFE_ALGO_DISPATCH requires raw_native_cache, which needs a literal nvcc build and explicitly excludes load_inline. A submission compiled via torch.utils.cpp_extension.load_inline has no nvcc string, so this rule can't fire even when it classify-then-truncates.
  • INPUT_STRUCTURE_TRUNCATION_DISPATCH requires subset_set >= 2 (gather/scatter), so a submission that classifies and truncates the whole batch uniformly (no per-element gather) stays under the threshold.

Observed in the wild on the qr_v2 (linalg QR) leaderboard: submissions probe rank structure (classify_512/1024) and truncate the QR (_cqr_blocked_limit(384), tau[:, limit:]=0, loop range(0, limit) with limit<n), skipping columns that are ~zero for the benchmark inputs — and pass KernelGuard.

Fix

New detector BENCHMARK_RANK_TRUNCATION (auto_filter, fixture_specialization family) that is:

  • Compile-method-independent — does not depend on the nvcc/load_inline distinction.
  • Value-independent / problem-agnostic — catches arbitrary fixed-limit truncation via fingerprints that a legitimate full factorization never has: output tail-zeroing (X[...k:] = 0), limit < n, and a loop bounded by limit (range(0, limit, ...)) — plus classifier / data-family / per-matrix-limit / hardcoded-rank signals. It is not restricted to the qr_v2 ranks {256,384,768}.
  • Precision-preserving — exempts a legitimate rank-revealing QR, whose truncation bound is derived adaptively at runtime (matrix_rank, (diag>tol).sum(), count_nonzero); the value-independent fixed-limit rule is suppressed when an adaptive-rank computation is present.

Validation

  • Flips known load_inline truncators to hacked / should_filter=True.
  • Zero false positives on confirmed-clean builds: a full blocked Householder QR, an adaptive rank-revealing QR (including one that tail-zeros with a data-derived rank), and a clean reference submission.

Marking as draft for discussion — happy to adjust thresholds or move it to telemetry-first + enrichment per the precision-first policy in the blog.

…rary fixed-limit truncation detector

Catches input-structure classification + truncation regardless of compile method (closes the
load_inline evasion of BENCHMARK_UNSAFE_ALGO_DISPATCH) and ARBITRARY fixed-limit truncation via
value-independent fingerprints (output tail-zeroing + limit<n + loop-bounded-by-limit), not just
the qr_v2 ranks {256,384,768}. Exempts legitimate adaptive rank-revealing QR (runtime data-derived
bound). Validated zero-FP on clean full-QR + adaptive-RRQR; flips known load_inline truncators to
auto_filter (suki/tenzin/porco/cholopt).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant