Skip to content

test: clean up gRPC server process groups#1492

Merged
rapids-bot[bot] merged 1 commit into
NVIDIA:mainfrom
mlubin:test/grpc-process-group-cleanup
Jul 1, 2026
Merged

test: clean up gRPC server process groups#1492
rapids-bot[bot] merged 1 commit into
NVIDIA:mainfrom
mlubin:test/grpc-process-group-cleanup

Conversation

@mlubin

@mlubin mlubin commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Makes the gRPC integration test's ServerProcess helper clean up reliably. Each spawned cuopt_grpc_server is now put in its own process group (setpgid), and teardown signals the whole group, so the GPU worker child can no longer be orphaned when a test ends or a server is restarted.

stop() now returns success/failure and is idempotent, the process group is verified before it's ever used as a kill target, and teardown paths assert the group is gone. Adds ServerProcessLifecycleTests covering idle stop, idempotent stop, and crash-during-solve cleanup.

Test-only change; no library or API changes.

The change is motivated by CI failures like https://github.com/NVIDIA/cuopt/actions/runs/28380812204/job/84090251116?pr=1481. The Codex-diagnosed issue here is:

  1. ErrorRecoveryTests.ClientHandlesServerCrashDuringSolve starts a 120-second MIP in a forked gRPC worker.
  2. ServerProcess::stop() signals only the server parent.
  3. After five seconds it SIGKILLs that parent, but not the worker forked in spawn_worker().
  4. The test reports success while the orphan worker continues using the GPU.
  5. INCUMBENT_CALLBACK_TEST starts next and competes with that worker for VRAM

@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mlubin mlubin added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Jun 30, 2026
@mlubin mlubin force-pushed the test/grpc-process-group-cleanup branch 2 times, most recently from 1bbc8d7 to df1c0de Compare June 30, 2026 18:23
@mlubin

mlubin commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test df1c0de

@mlubin mlubin marked this pull request as ready for review June 30, 2026 18:28
@mlubin mlubin requested a review from a team as a code owner June 30, 2026 18:28
@mlubin mlubin requested review from nguidotti and rg20 June 30, 2026 18:28
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0456766-19e7-499c-a053-15103c9535a2

📥 Commits

Reviewing files that changed from the base of the PR and between fd1e773 and d4ec23e.

📒 Files selected for processing (1)
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp

📝 Walkthrough

Walkthrough

Refactors ServerProcess in the gRPC integration test file to manage a process group, add group shutdown helpers, and make stop() return bool. Fixture teardown and restart paths now assert successful shutdowns.

Changes

ServerProcess process-group lifecycle refactor

Layer / File(s) Summary
ServerProcess struct, start(), and pgid_ member
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
Adds the new includes, introduces pgid_ and destructor cleanup reporting, and hardens start() with reuse checks, binary-path validation, fork handling, and child process-group setup.
stop(), shutdown helpers, and wait_for_ready crash detection
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
Replaces stop() with process-group SIGTERM and optional SIGKILL shutdown, adds lifecycle reset and group reaping helpers, and changes readiness failure detection to check for server death with WNOHANG.
Fixture TearDown stop() assertions
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
Updates teardown and restart paths across the gRPC integration fixtures to assert successful stop() calls before cleanup or restart.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main test-only change to gRPC server process-group cleanup.
Description check ✅ Passed The description clearly matches the changeset by explaining process-group cleanup, idempotent stop behavior, and the related lifecycle tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 162-181: The stop() logic in grpc_integration_test.cpp only sends
SIGKILL inside the branch where reap_parent() fails, so a lingering GPU worker
can survive when the server parent exits promptly. Update stop() to escalate to
kill(-pgid_, SIGKILL) whenever wait_for_group_exit() shows the process group has
not exited, not just when reap_parent() returns false, and keep the cleanup
around clear_lifecycle_state() only after the full group is confirmed gone.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f3c4bb1d-39d0-4334-81f9-b3f48630ebc2

📥 Commits

Reviewing files that changed from the base of the PR and between 2ca56e7 and df1c0de.

📒 Files selected for processing (1)
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp

Comment thread cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
@mlubin mlubin force-pushed the test/grpc-process-group-cleanup branch from df1c0de to fd1e773 Compare June 30, 2026 18:38
@mlubin

mlubin commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test fd1e773

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

CI Test Summary

✅ All 31 test job(s) passed.

Signed-off-by: Miles Lubin <mlubin@nvidia.com>
@mlubin mlubin force-pushed the test/grpc-process-group-cleanup branch from fd1e773 to d4ec23e Compare June 30, 2026 20:37
@mlubin

mlubin commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test d4ec23e

@mlubin

mlubin commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot Bot merged commit a1adfcc into NVIDIA:main Jul 1, 2026
91 checks passed
@mlubin mlubin deleted the test/grpc-process-group-cleanup branch July 1, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants