AIR CLI Integration: Implement the air get command#5600
AIR CLI Integration: Implement the air get command#5600riddhibhagwat-db wants to merge 7 commits into
air get command#5600Conversation
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.
Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.
Co-authored-by: Isaac
Approval status: pending
|
air status commandair get command
Integration test reportCommit: 8bfb38b
22 interesting tests: 15 SKIP, 7 KNOWN
Top 25 slowest tests (at least 2 minutes):
|
Co-authored-by: Isaac
Co-authored-by: Isaac
|
|
||
| cmdio.LogString(ctx, "Training Configuration:") | ||
| cmdio.LogString(ctx, string(content)) | ||
| cmdio.LogString(ctx, "") |
There was a problem hiding this comment.
This helper function LogString writes to stderr, instead of stdout which was the original Python code behavior: https://github.com/databricks/cli/blob/main/libs/cmdio/log.go#L14-L18
There was a problem hiding this comment.
fixed, thanks for the catch!
|
|
||
| runID, err := strconv.ParseInt(args[0], 10, 64) | ||
| if err != nil || runID <= 0 { | ||
| return fmt.Errorf("invalid RUN_ID %q: must be a positive integer", args[0]) |
There was a problem hiding this comment.
In json mode does this return a plain Go error instead of json envelope?
This and a few other places should return json if --json flag is passed
There was a problem hiding this comment.
Fixed this, thanks for the catch!
| // Accelerators describes the run's GPUs, e.g. "8x H100". | ||
| Accelerators string `json:"-"` | ||
| // User is the run's creator. Text-only; JSON omits it, matching `air get --json`. | ||
| User string `json:"-"` |
There was a problem hiding this comment.
I think this is one area where we can update. I think user can be included in json
| @@ -0,0 +1,36 @@ | |||
|
|
|||
| === get (text) | |||
| >>> [CLI] experimental air get 123 | |||
There was a problem hiding this comment.
this looks different than the output from the wheel side right? can you add a before / after screenshot to the PR description for easy review?
it's ok if the match in format is "coming next" I just want to make sure I understand how big the diff is exactly.
pardis-beikzadeh-db
left a comment
There was a problem hiding this comment.
Independent review against the Python air source (handle_status + cli_display/json_output). The success JSON envelope shape, MLflow URL construction, sweep table, and YAML panel all port faithfully — nice work. A few divergences inline below (#1, the -o json error path, is the one I'd treat as the most important; the rest are retry/rounding correctness).
Two more are easiest to review visually — could you add before/after side-by-sides to the PR description showing the old air vs the new databricks experimental air get output for: (a) the text view of a run, and (b) -o json of a run, plus a not-found case? That lets us confirm at a glance:
- text field ordering (Retries/Duration order and MLflow/User placement differ from the Python table at
cli_display.py:249), and - the JSON
started_atformat — Python emits…+00:00via.isoformat()(cli_entrypoint.py:1931), while the Go side emits…Zvia RFC3339 (format.go:44), which is a value change for strict consumers.
| if err != nil { | ||
| // The backend returns this when the run ID is unknown to the user. | ||
| if errors.Is(err, apierr.ErrResourceDoesNotExist) { | ||
| return fmt.Errorf("run %d not found: check the run ID and that it is a job run ID", runID) |
There was a problem hiding this comment.
In -o json mode the Python CLI emits a structured error envelope and exits 1 — print_json_error("NOT_FOUND"/"INTERNAL_ERROR", kind, msg, retryable) → {v, ts, error:{...}} (cli_entrypoint.py:2017-2022). Here RunE returns a bare error regardless of output mode, so the framework prints a plain Error: … string. A consumer parsing the JSON error envelope from air get --json would break. Consider rendering the error envelope when output is JSON. (This JSON not-found branch is also currently untested.)
| endMillis = time.Now().UnixMilli() | ||
| } | ||
|
|
||
| d := (endMillis - run.StartTime) / 1000 |
There was a problem hiding this comment.
nit: Python rounds to the nearest second: round((end - started_ms) / 1000) (cli_entrypoint.py:1934). Integer / 1000 truncates here, so e.g. an 11,500 ms run reports 11 vs Python's 12. Suggest rounding, e.g. (endMillis - run.StartTime + 500) / 1000.
| return nil | ||
| } | ||
| // The MLflow output is attached to the task run, not the parent job run. | ||
| taskRunID := run.Tasks[0].RunId |
There was a problem hiding this comment.
Python resolves the task for runs/get-output as tasks[-1] (latest attempt, to handle retries; jobs_api_client.py:68). Using Tasks[0] here links a retried run to its stale first attempt's MLflow output. Suggest run.Tasks[len(run.Tasks)-1].RunId.
The training-config block is command result data, but it was emitted via cmdio.LogString, which targets stderr. Write it to cmd.OutOrStdout() instead so it lands on stdout, matching the Python `air get`. Download/read failures stay on stderr as warnings. Co-authored-by: Isaac
`air get` derived Submitted and Duration from run-level start/end and truncated milliseconds to seconds. Port Python's _reported_attempt_timing so a retried run reports its latest attempt, and round to the nearest second to match Python's round(). Drops the run-level RunDuration shortcut, which diverged on retries. Co-authored-by: Isaac
mlflowURL resolved runs/get-output against Tasks[0], linking a retried run to its stale first attempt. Use the last task (latest attempt) to match Python (jobs_api_client.py:68). Co-authored-by: Isaac
…N with Python
In -o json mode, error paths now emit the structured error envelope
({v, ts, error:{code, kind, message, retryable}}) and exit non-zero, matching
the Python air CLI's print_json_error instead of letting the framework print a
bare "Error: ..." string. Covers invalid RUN_ID, run-not-found, backend
failures, and client/auth failures (wrapped PreRunE).
Also align the success envelope with the Python CLI:
- dashboard_url: construct {host}/jobs/runs/{id}?o={workspace_id} (via
CurrentWorkspaceID) instead of using the API's run_page_url
- started_at: datetime.isoformat() form ("+00:00" with microseconds), not
RFC3339 "Z"
- duration_seconds: rounded half-to-even to match Python's round()
- use run-level start/end times for started_at and duration_seconds, dropping
the last-attempt preference, which had no Python equivalent
Co-authored-by: Isaac
adb8fb8 to
8e76b86
Compare
Changes
Implements
databricks experimental ai get RUN_ID, the Go port of the Pythonair getcommand. It fetches the run viaJobs.GetRunand renders:User), and the run's dashboard URL.jobs/runs/get-output(thegen_ai_compute_outputfield is not modeled by the typed SDK, so it's fetched via a direct REST call).Why
getis the first real command integrated from the air cli and it sets the conventions the rest of the CLI will follow. The{v, ts, data}envelope mirrors the Python CLI so existing machine consumers keep working. The implementation is a faithful port ofhandle_status+ thecli_displayhelpers, verified field-by-field against the Python source:_display_foreach_sweep_status) and the training-config panel (_fetch_and_display_yaml_config); JSON output omits both, exactly matchingair get <run> --json.gen_ai_compute_outputfield (direct REST call), and the MLflow link / YAML fetch are best-effort (logic matches python cli)Tests
buildGetData, and all template branches (single-run minimal/all-fields, sweep, sweep-with-no-tasks).unittest.mocksuite) coverbuildSweepInfo,printConfigYAML,mlflowURL(overhttptest, since it bypasses the typed SDK), and theRunEinvalid-id / not-found branches.acceptance/experimental/air/get) runs the command end-to-end against a stubbed Jobs API: text output,-o json, and an invalid run ID.