Evaluation metrics for chest X-ray radiology report generation (RRG)
radscore is a lightweight, pip-installable toolkit for evaluating
AI-generated radiology reports against ground-truth references. It bundles the
metrics most used in the RRG literature — lexical overlap, contextual
similarity, and clinically-aware factuality scores — behind a single CLI and a
small Python API, so you can score a model's predictions with one command.
- One-command evaluation of a JSON file of
prediction/referencepairs. - Seven metrics out of the box, including clinically-aware ones (CheXbert, F1-RadGraph) — not just n-gram overlap.
- Bootstrap confidence intervals for robust reporting (
--bootstrap-ci). - Two output modes: one CSV per file, or an aggregated CSV per experiment/dataset for easy leaderboard-style comparison.
- Incremental / resumable:
--skip-existingreuses already-computed metrics. - Optional GREEN scorer (Stanford AIMI) via a git submodule, imported lazily.
- Reproducible (fixed random seed) and HPC-friendly.
| Metric | What it measures | Backend |
|---|---|---|
| BLEU-1 / BLEU-4 | N-gram precision overlap | sacrebleu / evaluate |
| ROUGE-L / ROUGE-2 | Longest-common-subsequence / bigram recall | rouge-score |
| BERTScore | Contextual embedding similarity | bert-score |
| F1-RadGraph | Clinical entity & relation factuality | RadGraph (F1RadGraphv2, partial reward) |
| CheXbert F1 | Agreement on 14-/5-class findings (Micro/Macro) | CheXbert labeler |
| GREEN (optional) | LLM-based clinical error grading | Stanford AIMI GREEN |
Requires Python 3.10 or 3.11 (the pinned torch / numpy versions have
wheels for these; 3.11 recommended).
git clone --recurse-submodules https://github.com/fruffini/radscore.git radscore
cd radscore
# create and activate an isolated environment (use a Python 3.11 interpreter)
python3.11 -m venv radscore-env
source radscore-env/bin/activate # Windows: radscore-env\Scripts\activate
python -m pip install -U pip # upgrade pip first
python -m pip install -e . # core install
python -m pip install -e ".[wandb]" # optional: Weights & Biases logging
python -m pip install -e ".[dev]" # optional: test toolingGREEN is kept as a git submodule under third_party/GREEN and is only needed
for --compute-green:
# 1. check out the submodule (skipped if you cloned with --recurse-submodules)
git submodule update --init --recursive
# 2. install it. GREEN's setup.py pins python_requires "==3.12.1", so on any
# other Python pass --ignore-requires-python (its real deps work on 3.10/3.11)
python -m pip install -e third_party/GREEN --ignore-requires-pythonIf GREEN is not installed, every other metric still works — the CLI imports it
lazily, only when --compute-green is passed. Because the import is lazy and
falls back to the submodule path, --compute-green also works if you only
install GREEN's dependencies without the editable install above.
If
pip install -e third_party/GREENreports "not a valid editable requirement", the submodule directory is empty — run step 1 first.
A ready-to-run example lives in examples/template.json:
radscore --filepath examples/template.json --output-mode per-fileThis computes the default scorers (CheXbert, F1-RadGraph, BLEU-1, BLEU-4,
ROUGE-L) and writes a CSV under results/.
Prefer a guided walkthrough? See the tutorial notebook:
examples/radscore_tutorial.ipynb.
First runs download model weights (CheXbert, RadGraph, BERTScore) from the Hugging Face Hub. Set
HF_HOMEto control the cache location.
A JSON list of records. Only prediction and reference are required by
the core radscore command; the other fields are optional and used by the
per-category tools and planned label-based metrics.
| Field | Type | Required | Used by |
|---|---|---|---|
prediction |
string | yes | all commands — the generated report |
reference |
string | yes | all commands — the ground-truth report |
label |
list of 14 ints (0/1) | optional | ground-truth multilabel vector (see below) |
target_category |
string | optional | the two per-category commands (grouping key) |
image_path |
string | optional | metadata / identifier only; never read by metrics |
[
{
"image_path": "patient_001.png",
"prediction": "Small right pleural effusion with basal atelectasis. No pneumothorax.",
"reference": "Increased right basal density related to atelectasis and pleural effusion.",
"label": [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1],
"target_category": "Atelectasis"
}
]label is the ground-truth presence/absence of each of the 14 CheXbert
conditions for that study, in CheXbert head order:
0 Enlarged Cardiomediastinum 7 Atelectasis
1 Cardiomegaly 8 Pneumothorax
2 Lung Opacity 9 Pleural Effusion
3 Lung Lesion 10 Pleural Other
4 Edema 11 Fracture
5 Consolidation 12 Support Devices
6 Pneumonia 13 No Finding
A single report is multilabel — it can be positive for several conditions at
once (e.g. the vector above marks Lung Opacity, Atelectasis, Pleural Effusion,
Support Devices and No-Finding... i.e. multiple 1s). This vector is the basis
for the micro / macro multilabel F1 analysis described under
Per-category evaluation. See
examples/template.json for a runnable example.
# Default scorers, one CSV per input file
radscore --filepath preds.json --output-mode per-file
# Pick scorers, add bootstrap CIs, aggregate per experiment/dataset
radscore --filepath preds.json \
--scorers CheXbert,F1-RadGraph,ROUGE-L \
--bootstrap-ci \
--output-mode per-experiment
# Add the GREEN metric (requires the submodule installed)
radscore --filepath preds.json --compute-green
# Reuse already-computed metrics, only fill in what's missing
radscore --filepath preds.json --skip-existing
# Weighted average across target categories
radscore-weighted --filepath preds.json
# Per-target-category CheXbert breakdown
radscore-chexbert-class --filepath preds.jsonKey flags (radscore --help for the full list):
| Flag | Description |
|---|---|
--filepath |
Path to the predictions JSON (required). |
--scorers |
Comma-separated subset of metrics to run. |
--bootstrap-ci |
Report median + 95% bootstrap confidence intervals. |
--output-mode |
per-file or per-experiment (aggregated CSV). |
--compute-green |
Also compute the GREEN score. |
--skip-existing |
Skip metrics already present in the output CSV. |
--save-breakdowns |
Save per-condition CheXbert breakdown CSVs. |
Output path convention: an input under
outputs/<experiment>/<dataset>/foo.json is written to
results/<experiment>/<dataset>/....
Beyond scoring a whole file, radscore ships two helpers that report metrics
per target condition rather than over the full set. They are built for
robustness / shortcut studies (e.g. occlusion experiments) but apply to any
analysis where each prediction is associated with one condition of interest.
Data layout. Still a single flat JSON list — no nesting. The grouping key
is the per-record target_category (a CheXbert condition name). The important
subtlety is how occlusion-style data is laid out:
The same study appears once per target condition. For a given image, the input contains several records that share the same
image_path,referenceand ground-truthlabel, but differ in theirtarget_category— and in theirprediction, because each record is the report the model produced when that condition's region was occluded. Each record therefore isolates the single condition on which to quantify a label-specific F1.
[
{ "image_path": "studyA.png", "prediction": "...occluded Atelectasis...", "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Atelectasis" },
{ "image_path": "studyA.png", "prediction": "...occluded Lung Opacity...", "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Lung_Opacity" },
{ "image_path": "studyA.png", "prediction": "...occluded Pleural Effusion...", "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Pleural_Effusion" }
]Note the three records above are the same study (label and reference
identical) repeated for three different target conditions. target_category
values may be written with spaces or underscores (Pleural Effusion /
Pleural_Effusion) — both are normalized to the canonical CheXbert condition.
The two commands then differ in what they compute:
target_category values |
What it reports | |
|---|---|---|
radscore-weighted |
any string | the full metric suite per group + a weighted_avg row |
radscore-chexbert-class |
must be a CheXbert condition | label-specific CheXbert F1 per condition; each record counts only toward its own class |
Micro vs. macro. Over the 14-class multilabel label system, two report-level
aggregations are standard: micro-F1 pools the true/false positives/negatives
across all conditions before computing F1 (dominated by frequent findings),
while macro-F1 computes F1 per condition and averages them (treats rare and
frequent findings equally). The dedicated radscore-label-f1 command (below)
computes both — and the per-condition and target-specific breakdowns — directly
against the ground-truth label vector.
radscore-weighted — groups samples by target_category, computes the full
set of metrics independently for each group, and adds a final weighted_avg row
where each group is weighted by its sample count. Any string is accepted as a
category; missing values fall back to "unknown". Output is one CSV with a row
per category plus the weighted average.
radscore-weighted --filepath examples/template.json \
--scorers CheXbert,F1-RadGraph,BLEU-1,BLEU-4,ROUGE-Lradscore-chexbert-class — runs the CheXbert labeler on all predictions
and references, then for each condition C computes the binary F1 of that one
condition (predicted-report label vs reference-report label) over only the
records whose target_category == C. So a record tagged Cardiomegaly
contributes solely to the Cardiomegaly F1. The category must map to a
CheXbert condition
(Cardiomegaly, Edema, Consolidation, Atelectasis, Pleural Effusion,
Lung Opacity, Pneumonia, Pneumothorax, Fracture, Support Devices,
Enlarged Cardiomediastinum, Lung Lesion, Pleural Other, No Finding);
samples missing it raise an error. Supports --bootstrap-ci.
radscore-chexbert-class --filepath examples/template.jsonBoth commands are entirely optional — the core
radscorecommand never needstarget_category.
Scores the model's predicted findings — the CheXbert labels extracted from
the prediction text — against the curated ground-truth label vector on each
record (rather than against the reference text). Requires a 14-length binary
label on every record. It produces:
- Multilabel micro / macro F1 over the 14 conditions and the 5-class subset
(
Micro-F1-14,Macro-F1-14,Micro-F1-5,Macro-F1-5), saved to<name>_label_f1.csv, plus per-condition F1 in<name>_label_f1_per_condition.csv. - Target-specific F1 (only if records carry
target_category): the binary F1 of each target condition over just its tagged records, saved to<name>_label_target_f1.csv.
# multilabel + (if present) target-specific F1
radscore-label-f1 --filepath examples/template.json
# uncertain findings as positive, with bootstrap CIs
radscore-label-f1 --filepath preds.json --uncertain-mode rrg+ --bootstrap-ciNote:
radscore's built-inMicro-F1-14/Macro-F1-14(from theCheXbertscorer) compare prediction text vs reference text;radscore-label-f1instead uses the curatedlabelvector as ground truth.
from radscore.cli import run, ReportGenerationEvaluator
# Programmatic scoring
evaluator = ReportGenerationEvaluator(scorers=["BLEU-4", "ROUGE-L", "F1-RadGraph"])
scores = evaluator.evaluate(predictions, references)
# Or drive the full pipeline (CSV output, CIs, GREEN, ...)
run(filepath="preds.json", scorers=["CheXbert", "F1-RadGraph"], bootstrap_ci=True)radscore/
├── src/radscore/
│ ├── cli.py # main CLI + evaluator (radscore)
│ ├── weighted.py # per-category weighted eval (radscore-weighted)
│ ├── chexbert_class.py # per-category CheXbert breakdown (radscore-chexbert-class)
│ ├── label_f1.py # label-grounded multilabel/target F1 (radscore-label-f1)
│ ├── chexbert.py # CheXbert labeler + F1
│ ├── f1radgraph.py # F1-RadGraph (v2, with CIs)
│ ├── rouge.py # ROUGE with bootstrap aggregation
│ └── factuality_*.py # shared CheXbert/condition utilities
├── examples/template.json # runnable input example
├── branding/ # logo, banner, image-gen prompts
├── third_party/GREEN/ # optional GREEN submodule
└── pyproject.toml
See TODO.md. Planned for the next version: optional label-based
F1-score (classification performance against a ground-truth 14-class CheXbert
label vector).
python -m pip install -e ".[dev]" # editable install + pytest
pytest # offline, ~20s; metric maths verified vs scikit-learnContributions welcome — see CONTRIBUTING.md.
radscore stands on excellent prior work:
RadGraph and
GREEN (Stanford AIMI),
CheXbert (Stanford ML Group),
BERTScore,
sacreBLEU, and Hugging Face
evaluate.
- Filippo Ruffini — @fruffini
- Marco Salmè — @marcosal30
If you use radscore in your research, please cite this repository:
@software{radscore,
title = {radscore: Evaluation metrics for chest X-ray radiology report generation},
author = {Ruffini, Filippo and Salmè, Marco},
year = {2026},
url = {https://github.com/fruffini/radscore}
}Released under the PolyForm Noncommercial License 1.0.0. Noncommercial use only — research, teaching, personal, and use by academic / nonprofit / government organizations is permitted; commercial use is not. See the LICENSE for the full terms.

