Add BENCHMARK_RANK_TRUNCATION: compile-method-independent fixed-limit truncation detector by nikhilbarhate99 · Pull Request #275 · gpu-mode/kernelguard

nikhilbarhate99 · 2026-06-22T19:52:45Z

Problem

Two existing fixture-specialization detectors can be evaded, letting submissions that classify the input's numerical structure and then truncate the computation (factor fewer columns/rows than the full problem) pass as clean:

BENCHMARK_UNSAFE_ALGO_DISPATCH requires raw_native_cache, which needs a literal nvcc build and explicitly excludes load_inline. A submission compiled via torch.utils.cpp_extension.load_inline has no nvcc string, so this rule can't fire even when it classify-then-truncates.
INPUT_STRUCTURE_TRUNCATION_DISPATCH requires subset_set >= 2 (gather/scatter), so a submission that classifies and truncates the whole batch uniformly (no per-element gather) stays under the threshold.

Observed in the wild on the qr_v2 (linalg QR) leaderboard: submissions probe rank structure (classify_512/1024) and truncate the QR (_cqr_blocked_limit(384), tau[:, limit:]=0, loop range(0, limit) with limit<n), skipping columns that are ~zero for the benchmark inputs — and pass KernelGuard.

Fix

New detector BENCHMARK_RANK_TRUNCATION (auto_filter, fixture_specialization family) that is:

Compile-method-independent — does not depend on the nvcc/load_inline distinction.
Value-independent / problem-agnostic — catches arbitrary fixed-limit truncation via fingerprints that a legitimate full factorization never has: output tail-zeroing (X[...k:] = 0), limit < n, and a loop bounded by limit (range(0, limit, ...)) — plus classifier / data-family / per-matrix-limit / hardcoded-rank signals. It is not restricted to the qr_v2 ranks {256,384,768}.
Precision-preserving — exempts a legitimate rank-revealing QR, whose truncation bound is derived adaptively at runtime (matrix_rank, (diag>tol).sum(), count_nonzero); the value-independent fixed-limit rule is suppressed when an adaptive-rank computation is present.

Validation

Flips known load_inline truncators to hacked / should_filter=True.
Zero false positives on confirmed-clean builds: a full blocked Householder QR, an adaptive rank-revealing QR (including one that tail-zeros with a data-derived rank), and a clean reference submission.

Marking as draft for discussion — happy to adjust thresholds or move it to telemetry-first + enrichment per the precision-first policy in the blog.

…rary fixed-limit truncation detector Catches input-structure classification + truncation regardless of compile method (closes the load_inline evasion of BENCHMARK_UNSAFE_ALGO_DISPATCH) and ARBITRARY fixed-limit truncation via value-independent fingerprints (output tail-zeroing + limit<n + loop-bounded-by-limit), not just the qr_v2 ranks {256,384,768}. Exempts legitimate adaptive rank-revealing QR (runtime data-derived bound). Validated zero-FP on clean full-QR + adaptive-RRQR; flips known load_inline truncators to auto_filter (suki/tenzin/porco/cholopt).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BENCHMARK_RANK_TRUNCATION: compile-method-independent fixed-limit truncation detector#275

Add BENCHMARK_RANK_TRUNCATION: compile-method-independent fixed-limit truncation detector#275
nikhilbarhate99 wants to merge 1 commit into
gpu-mode:mainfrom
nikhilbarhate99:feat/benchmark-rank-truncation-detector

nikhilbarhate99 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikhilbarhate99 commented Jun 22, 2026

Problem

Fix

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant