Skip to content

Add status and public/secret scores to /user/submissions list#502

Open
robobryce wants to merge 3 commits into
gpu-mode:mainfrom
robobryce:feat/user-submissions-scores-status
Open

Add status and public/secret scores to /user/submissions list#502
robobryce wants to merge 3 commits into
gpu-mode:mainfrom
robobryce:feat/user-submissions-scores-status

Conversation

@robobryce

@robobryce robobryce commented Jun 23, 2026

Copy link
Copy Markdown

Problem

GET /user/submissions (the list endpoint, backed by LeaderboardDB.get_user_submissions) doesn't expose two things that only the per-submission detail endpoint carries:

  1. Secret leaderboard score — the secret-run score (the actual ranking metric) is never in the list payload; the list query only selects public runs (NOT secret).
  2. Failure status — the list returns no per-run passed flag and no status. A finished-but-failed submission is indistinguishable from a clean one: both show done. (Depending on which run failed, the row either keeps its passing public runs with score: null — failed public run — or comes back with runs: [] — failed secret run, via the existing anti-cheat filter. Neither carries a failure signal.)

So a CLI/webapp listing can't show the secret score or tell "done" from "failed" without one detail request per row (the /user/submissions/{id} endpoint does carry secret/passed), which is an N+1 against a ~480 KB-per-row endpoint.

Note the public score is not a gap — it's already exposed per-run in runs[].score.

Change

Add two fields to each submission in the list payload, via one extra aggregate query over the same runs rows:

  • status: "pending" (not done), "failed" (done with any failed run), or "done". Mirrors the canonical done/failed derivation in get_hf_export_rows.sql.
  • secret_score: the secret leaderboard run's geomean score (the ranking metric; MIN across GPUs, matching how the public score is summarized). Visible to the owner, as the detail endpoint already exposes it.

The existing runs, done, and the other fields are untouched, so current clients keep working (additive JSON keys only — verified popcorn-cli's by-key parser ignores them).

Performance

The added aggregate runs at ~0.25 ms for a full 100-submission page (600 runs) on the test Postgres — ~1% of the ~21 ms get_user_submissions call. Same WHERE submission_id = ANY(%s) access pattern as the existing runs query, over the same rows; no new join.

Tests

Added to tests/test_leaderboard_db.py:

  • test_get_user_submissions_status_and_secret_score_on_success
  • test_get_user_submissions_status_failed_when_secret_run_failed (failed secret run → runs: [])
  • test_get_user_submissions_status_failed_keeps_runs_when_public_run_failed (failed public run → passing runs retained)
  • test_get_user_submissions_status_pending_when_not_done

Full tests/test_leaderboard_db.py: 54 passed (docker-compose test Postgres). ruff check . --exclude examples/ --line-length 120: clean.

🤖 Generated with Claude Code

The list endpoint (`db.get_user_submissions`) returned only the public,
ranking-filtered run score per submission — it dropped the secret
leaderboard score and any failure signal entirely. So a CLI/webapp
listing could not show the secret (ranking) score, and a submission with
a failed run was indistinguishable from a successful one (both showed as
"done" with whatever public score survived the anti-cheat filter).

Surfacing these required an extra detail fetch per row client-side. Add
them to the list payload directly:

- `status`: "pending" (not done), "failed" (done with any failed run),
  or "done". Mirrors the canonical done/failed logic in
  get_hf_export_rows.sql.
- `public_score` / `secret_score`: the geomean leaderboard scores. The
  public score keeps the existing ranking-eligibility filter; the secret
  score is the owner's passing secret leaderboard run (owners already see
  this via the detail endpoint).

The existing `runs`/`done` fields are unchanged for backward compat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@brycelelbach

Copy link
Copy Markdown

A submission with a failed run comes back as done with runs: [], indistinguishable from a successful submission with no surviving public score.

I'm not sure this is true. If you do popcorn submission show on a submission that had a failed run, you will see the other runs. Can you confirm that a failed run comes back with runs: []? I believe it comes back with actual run data, it's just that (at least) one of them is marked as a failed.

Comment on lines +1339 to +1346
SELECT submission_id,
MIN(score) FILTER (
WHERE mode = 'leaderboard' AND secret AND passed
) AS secret_score,
bool_or(NOT passed) AS has_failed_run
FROM leaderboard.runs
WHERE submission_id = ANY(%s)
GROUP BY submission_id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you benchmark how long this sequel query will take?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. On the docker-compose test Postgres, seeded with 100 submissions × 6 runs (600 rows), the added aggregate query runs at ~0.25 ms/call for a full 100-submission page, versus ~21 ms for the whole get_user_submissions call — about 1% overhead. It's the same access pattern as the existing runs query in this method (WHERE submission_id = ANY(%s), grouped) over the same rows, so no new join or table. And it replaces what a client otherwise needs up to 100 detail round-trips to compute.

Comment on lines +1340 to +1342
MIN(score) FILTER (
WHERE mode = 'leaderboard' AND secret AND passed
) AS secret_score,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this here. Can you explain this to me? Why MIN(score)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A submission can have a secret leaderboard run per GPU type, so there can be more than one secret score. MIN takes the best (lowest = fastest) one, matching how the existing public score is summarized for the row (the runs are likewise per-GPU and the caller/CLI takes the min). For single-GPU leaderboards like qr_v2 there's exactly one, so MIN is just that value. Happy to switch to per-GPU secret scores instead if you'd prefer symmetry with runs, but a single ranking number seemed more useful for the list view.

Comment thread src/libkernelbot/leaderboard_db.py Outdated

# The public leaderboard score (lowest across GPUs), already
# ranking-eligible by construction of `runs_query`.
public_scores = [r["score"] for r in public_runs if r["score"] is not None]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. I thought the public scores were already reported by the submissions endpoint. So why do we need new code to report it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the public score was already exposed (in runs[].score), so the separate public_score field was redundant. I've dropped it in 08d2604. The list payload now only adds the two things that genuinely weren't derivable from the existing runs: status and secret_score (the secret runs are never selected by the list query, so the secret score and the per-run passed flags don't leave the server today).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol bro come on I expected better.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that one's on me — I added a field that already existed in runs[].score without checking. Dropped it; the patch now adds only status and secret_score.

@brycelelbach

Copy link
Copy Markdown

Did you validate that this change will not break existing clients like popcorn-cli?

Per review feedback:
- Drop the new `public_score` field. The public leaderboard score is
  already exposed per-run in `runs[].score`; a separate top-level field
  was redundant. The list payload now only adds what was genuinely
  missing: `status` and `secret_score`.
- Expand the comment to explain why these two (and not the public score)
  need new code, and why MIN(score) (best secret run across GPUs, to
  match how the public score is summarized).
- Tests: assert the public score via runs[].score; add a case showing a
  failed *public* run keeps its passing runs (status 'failed', runs
  non-empty), distinct from a failed *secret* run (full anti-cheat hide,
  runs []).

Aggregate query measured at ~0.25 ms for a full 100-submission page
(600 runs), ~1% of the ~21 ms get_user_submissions call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robobryce

Copy link
Copy Markdown
Author

Good catch — my description was imprecise, and you're right that it depends on which run failed. I checked against production:

  • Failed public run (e.g. public leaderboard run passed=false, secret passed): the list row keeps its passing runs — e.g. submission 831151 returns runs: [{score:null},{score:null}], not empty. (show shows all 6, as you said.)
  • Failed secret run: the anti-cheat filter fully hides it — e.g. 831149 returns runs: [].

So "comes back with runs: []" is only true for the secret-failure case. But the point the PR addresses holds either way: neither carries a failure signal, so a finished-but-failed submission is indistinguishable from a clean one in the list (both done, and the surviving public score may be null for unrelated reasons). That's what status fixes. I've corrected the PR description and added a test (..._keeps_runs_when_public_run_failed) pinning down the public-vs-secret distinction.

@robobryce

Copy link
Copy Markdown
Author

Yes. The change is purely additive — it adds status and secret_score keys to each list entry and removes nothing (id, leaderboard_name, file_name, submission_time, done, runs are unchanged).

For popcorn-cli specifically: its list parser reads fields by key off a serde_json::Value (sub["id"], sub["runs"], r["gpu_type"], r["score"], …) and ignores any keys it doesn't look up, so extra fields can't break deserialization. I confirmed this against both the released client and current popcorn-cli main. The webapp (kernelboard) deserializes the same way. So no existing client breaks; they simply ignore the new fields until updated to use them.

@brycelelbach

Copy link
Copy Markdown

Yes. The change is purely additive — it adds status and secret_score keys to each list entry and removes nothing (id, leaderboard_name, file_name, submission_time, done, runs are unchanged).

For popcorn-cli specifically: its list parser reads fields by key off a serde_json::Value (sub["id"], sub["runs"], r["gpu_type"], r["score"], …) and ignores any keys it doesn't look up, so extra fields can't break deserialization. I confirmed this against both the released client and current popcorn-cli main. The webapp (kernelboard) deserializes the same way. So no existing client breaks; they simply ignore the new fields until updated to use them.

Okay, but did you test it?

@robobryce

Copy link
Copy Markdown
Author

Fair — I tested it now instead of arguing from the code.

I generated the exact new-shape list payload my patched get_user_submissions produces (FastAPI-serialized, so Decimal→string and the new status/secret_score keys present), served it from a mock at POPCORN_API_URL, and ran the real released popcorn-cli 1.3.13 binary (the existing client in the wild) against it:

$ POPCORN_API_URL=... popcorn-cli --no-tui submissions list --leaderboard qr_v2
ID       Leaderboard   File      Time                 GPU(s)   Status   Score
1        qr_v2         ok.py     2026-06-23T22:56...   B200     done       -
2        qr_v2         bad.py    2026-06-23T22:56...   -        done       -
list_rc=0
$ popcorn-cli --no-tui submissions show 1   # exit 0, renders all runs

Exit 0, both list and show render every row, no crash — it just ignores status/secret_score. (The - scores there are 1.3.13's pre-existing string-parse bug, fixed separately in #62, not this change.) I also ran the current main build (with #62): same payload, renders fine, full-precision scores, new fields ignored. So: existing clients don't break.

Pin the backward-compat guarantee in a test: the pre-existing top-level
and per-run keys must remain present, so a future change can't silently
drop a field existing clients (popcorn-cli, kernelboard) rely on. New
fields are added alongside, never replacing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robobryce

Copy link
Copy Markdown
Author

Also pinned the additive guarantee in the test itself (93e566f): test_get_user_submissions_status_and_secret_score_on_success now asserts the pre-existing top-level keys (id, leaderboard_name, file_name, submission_time, done, runs) and per-run keys (gpu_type, score) are still present, so a future change can't silently drop a field existing clients depend on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants