Skip to content

fix: critical bugs, architecture improvements, and test infrastructure#3

Merged
LessUp merged 5 commits into
masterfrom
fix/bugs-and-architecture-improvements
Jul 1, 2026
Merged

fix: critical bugs, architecture improvements, and test infrastructure#3
LessUp merged 5 commits into
masterfrom
fix/bugs-and-architecture-improvements

Conversation

@LessUp

@LessUp LessUp commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

  • CRITICAL: Fix getTheoreticalPeakGflops — Remove spurious * 1000 that made peak GFLOPS 1000x too large, corrupting all efficiency calculations and breaking performance test bounds
  • CRITICAL: Fix CPU test compilation — Rename test_device_info_cpu.cpp.cu because it includes kernel headers with CUDA <<<>>> launch syntax unparseable by g++
  • CRITICAL: Fix duplicate main() — Remove redundant main() from CPU test files that conflicted with GTest::gtest_main
  • HIGH: Fix integer overflow in tensor_core_benchmark.cuh (M * K int multiplication)
  • HIGH: Fix division by zero in measureGpuTime when benchmark_runs <= 0
  • HIGH: Fix negative dimension handling in KernelConstraints::isSatisfied
  • MEDIUM: Fix NaN/Inf reference handling in verification comparison logic
  • MEDIUM: Fix exception safety in printSkipReason (diagnostic function should not throw)
  • MEDIUM: Fix file write error checking in exportRooflineData
  • MEDIUM: Extract shared overflow-check utilities (safeGridSize, checkMatrixElementCount) to eliminate DRY violation
  • MEDIUM: Adjust performance test bounds for corrected peak calculation
  • Infrastructure: Add CPU test execution to CI (ctest -L cpu)
  • Infrastructure: Add CMake test helpers with standardized target creation and CTest labels
  • Infrastructure: Refactor test files — migrate CPU tests to .cpp, split device_info_seam into CPU/CUDA test files

Test plan

  • CI format check passes
  • CI CUDA build passes (all targets compile)
  • CI CPU tests pass (ctest -L cpu)
  • GPU runtime tests pass on local CUDA machine (requires GPU)

Generated with Devin

LessUp and others added 5 commits July 1, 2026 18:44
## Critical Bug Fixes

- **getTheoreticalPeakGflops**: Remove spurious `* 1000` that made peak
  GFLOPS 1000x too large, causing all efficiency calculations to report
  0.1% instead of 100%. This also broke the PeakPerformanceReference test
  upper bound check.

- **CPU test compilation**: Rename test_device_info_cpu.cpp to .cu because
  it includes tensor_core_sgemm.cuh which contains CUDA `<<<>>>` kernel
  launch syntax that g++ cannot parse. Add CUDA compile options to the
  sgemm_add_cpu_test CMake helper for .cu sources.

- **Duplicate main()**: Remove redundant `main()` from CPU-only test files
  (test_benchmark_settings.cpp, test_device_info_cpu.cu) that conflicted
  with GTest::gtest_main linkage.

## High-Severity Fixes

- **Integer overflow**: Fix `M * K` int multiplication overflow in
  tensor_core_benchmark.cuh by using `static_cast<size_t>` consistently
  for all matrix size calculations.

- **Division by zero**: Add validation in measureGpuTime() to reject
  benchmark_runs <= 0, preventing undefined behavior.

- **Negative dimensions**: Add M > 0 && K > 0 && N > 0 check in
  KernelConstraints::isSatisfied() to prevent undefined modulo behavior
  with non-positive dimensions.

## Medium-Severity Fixes

- **NaN/Inf verification**: Handle NaN/Inf in reference values in
  compareMatricesImpl() instead of producing NaN comparison results.

- **Exception safety in diagnostics**: Replace CUDA_CHECK (throwing)
  with non-throwing error handling in printSkipReason().

- **File write error checking**: Add file.fail() check after writing
  roofline export data.

- **DRY violation**: Extract safeGridSize() and checkMatrixElementCount()
  to cuda_utils.cuh, eliminating duplicated overflow-check code between
  tensor_core_sgemm.cuh and tensor_core_benchmark.cuh.

- **Performance test bounds**: Adjust PeakPerformanceReference bounds
  (100-200000 GFLOPS) to accommodate the corrected peak calculation and
  future GPU architectures.

## Infrastructure Improvements

- **CI**: Add CPU test execution step (`ctest -L cpu`) to CI workflow.
- **CMake helpers**: Add SgemmTestHelpers.cmake with standardized test
  target creation functions (sgemm_add_cpu_test, sgemm_add_cuda_test,
  sgemm_add_cuda_perf_test) with proper CTest labels.
- **Test refactoring**: Migrate CPU-only tests from .cu to .cpp, split
  device_info_seam.cu into separate CPU and CUDA test files.
- **Kernel catalog**: Add KernelConstraints struct with isSatisfied()
  method, refactor BenchmarkRunner to use catalog-driven dispatch.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add missing `include(GoogleTest)` before SgemmTestHelpers.cmake to
  resolve `gtest_discover_tests` unknown command error
- Apply clang-format to test_utils.cu, test_sgemm.cu, and
  test_device_info_cuda.cu to fix format check failures

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The refactored verify.cuh removed the static shouldFlagAsIncorrect
method from SGEMMVerifier, breaking test_sgemm.cu which calls
SGEMMVerifier::shouldFlagAsIncorrect(result). Restore it as a
deprecated compatibility method.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The test uses tensorCoresAvailable() which is defined in
tensor_core_sgemm.cuh, but the header was not included.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The test expected 19481.0 GFLOPS but the correct calculation is
108 * 64 * 2 * 1.41 = 19491.84 GFLOPS. Fix the expected value and
remove the stale * 1000 from the comment.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@LessUp LessUp merged commit 4b39e85 into master Jul 1, 2026
2 checks passed
@LessUp LessUp deleted the fix/bugs-and-architecture-improvements branch July 1, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant