fix: critical bugs, architecture improvements, and test infrastructure#3
Merged
Merged
Conversation
## Critical Bug Fixes - **getTheoreticalPeakGflops**: Remove spurious `* 1000` that made peak GFLOPS 1000x too large, causing all efficiency calculations to report 0.1% instead of 100%. This also broke the PeakPerformanceReference test upper bound check. - **CPU test compilation**: Rename test_device_info_cpu.cpp to .cu because it includes tensor_core_sgemm.cuh which contains CUDA `<<<>>>` kernel launch syntax that g++ cannot parse. Add CUDA compile options to the sgemm_add_cpu_test CMake helper for .cu sources. - **Duplicate main()**: Remove redundant `main()` from CPU-only test files (test_benchmark_settings.cpp, test_device_info_cpu.cu) that conflicted with GTest::gtest_main linkage. ## High-Severity Fixes - **Integer overflow**: Fix `M * K` int multiplication overflow in tensor_core_benchmark.cuh by using `static_cast<size_t>` consistently for all matrix size calculations. - **Division by zero**: Add validation in measureGpuTime() to reject benchmark_runs <= 0, preventing undefined behavior. - **Negative dimensions**: Add M > 0 && K > 0 && N > 0 check in KernelConstraints::isSatisfied() to prevent undefined modulo behavior with non-positive dimensions. ## Medium-Severity Fixes - **NaN/Inf verification**: Handle NaN/Inf in reference values in compareMatricesImpl() instead of producing NaN comparison results. - **Exception safety in diagnostics**: Replace CUDA_CHECK (throwing) with non-throwing error handling in printSkipReason(). - **File write error checking**: Add file.fail() check after writing roofline export data. - **DRY violation**: Extract safeGridSize() and checkMatrixElementCount() to cuda_utils.cuh, eliminating duplicated overflow-check code between tensor_core_sgemm.cuh and tensor_core_benchmark.cuh. - **Performance test bounds**: Adjust PeakPerformanceReference bounds (100-200000 GFLOPS) to accommodate the corrected peak calculation and future GPU architectures. ## Infrastructure Improvements - **CI**: Add CPU test execution step (`ctest -L cpu`) to CI workflow. - **CMake helpers**: Add SgemmTestHelpers.cmake with standardized test target creation functions (sgemm_add_cpu_test, sgemm_add_cuda_test, sgemm_add_cuda_perf_test) with proper CTest labels. - **Test refactoring**: Migrate CPU-only tests from .cu to .cpp, split device_info_seam.cu into separate CPU and CUDA test files. - **Kernel catalog**: Add KernelConstraints struct with isSatisfied() method, refactor BenchmarkRunner to use catalog-driven dispatch. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add missing `include(GoogleTest)` before SgemmTestHelpers.cmake to resolve `gtest_discover_tests` unknown command error - Apply clang-format to test_utils.cu, test_sgemm.cu, and test_device_info_cuda.cu to fix format check failures Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The refactored verify.cuh removed the static shouldFlagAsIncorrect method from SGEMMVerifier, breaking test_sgemm.cu which calls SGEMMVerifier::shouldFlagAsIncorrect(result). Restore it as a deprecated compatibility method. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The test uses tensorCoresAvailable() which is defined in tensor_core_sgemm.cuh, but the header was not included. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The test expected 19481.0 GFLOPS but the correct calculation is 108 * 64 * 2 * 1.41 = 19491.84 GFLOPS. Fix the expected value and remove the stale * 1000 from the comment. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
* 1000that made peak GFLOPS 1000x too large, corrupting all efficiency calculations and breaking performance test boundstest_device_info_cpu.cpp→.cubecause it includes kernel headers with CUDA<<<>>>launch syntax unparseable by g++main()from CPU test files that conflicted withGTest::gtest_maintensor_core_benchmark.cuh(M * Kint multiplication)measureGpuTimewhenbenchmark_runs <= 0KernelConstraints::isSatisfiedprintSkipReason(diagnostic function should not throw)exportRooflineDatasafeGridSize,checkMatrixElementCount) to eliminate DRY violationctest -L cpu)Test plan
ctest -L cpu)Generated with Devin