radscore

Evaluation metrics for chest X-ray radiology report generation (RRG)

radscore is a lightweight, pip-installable toolkit for evaluating AI-generated radiology reports against ground-truth references. It bundles the metrics most used in the RRG literature — lexical overlap, contextual similarity, and clinically-aware factuality scores — behind a single CLI and a small Python API, so you can score a model's predictions with one command.

Features

One-command evaluation of a JSON file of prediction / reference pairs.
Seven metrics out of the box, including clinically-aware ones (CheXbert, F1-RadGraph) — not just n-gram overlap.
Bootstrap confidence intervals for robust reporting (--bootstrap-ci).
Two output modes: one CSV per file, or an aggregated CSV per experiment/dataset for easy leaderboard-style comparison.
Incremental / resumable: --skip-existing reuses already-computed metrics.
Optional GREEN scorer (Stanford AIMI) via a git submodule, imported lazily.
Reproducible (fixed random seed) and HPC-friendly.

Metrics

Metric	What it measures	Backend
BLEU-1 / BLEU-4	N-gram precision overlap	`sacrebleu` / `evaluate`
ROUGE-L / ROUGE-2	Longest-common-subsequence / bigram recall	`rouge-score`
BERTScore	Contextual embedding similarity	`bert-score`
F1-RadGraph	Clinical entity & relation factuality	RadGraph (`F1RadGraphv2`, partial reward)
CheXbert F1	Agreement on 14-/5-class findings (Micro/Macro)	CheXbert labeler
GREEN (optional)	LLM-based clinical error grading	Stanford AIMI GREEN

Installation

Requires Python 3.10 or 3.11 (the pinned torch / numpy versions have wheels for these; 3.11 recommended).

git clone --recurse-submodules https://github.com/fruffini/radscore.git radscore
cd radscore

# create and activate an isolated environment (use a Python 3.11 interpreter)
python3.11 -m venv radscore-env
source radscore-env/bin/activate      # Windows: radscore-env\Scripts\activate
python -m pip install -U pip          # upgrade pip first

python -m pip install -e .            # core install

python -m pip install -e ".[wandb]"   # optional: Weights & Biases logging
python -m pip install -e ".[dev]"     # optional: test tooling

GREEN (optional)

GREEN is kept as a git submodule under third_party/GREEN and is only needed for --compute-green:

# 1. check out the submodule (skipped if you cloned with --recurse-submodules)
git submodule update --init --recursive

# 2. install it. GREEN's setup.py pins python_requires "==3.12.1", so on any
#    other Python pass --ignore-requires-python (its real deps work on 3.10/3.11)
python -m pip install -e third_party/GREEN --ignore-requires-python

If GREEN is not installed, every other metric still works — the CLI imports it lazily, only when --compute-green is passed. Because the import is lazy and falls back to the submodule path, --compute-green also works if you only install GREEN's dependencies without the editable install above.

If pip install -e third_party/GREEN reports "not a valid editable requirement", the submodule directory is empty — run step 1 first.

Quickstart

A ready-to-run example lives in examples/template.json:

radscore --filepath examples/template.json --output-mode per-file

This computes the default scorers (CheXbert, F1-RadGraph, BLEU-1, BLEU-4, ROUGE-L) and writes a CSV under results/.

Prefer a guided walkthrough? See the tutorial notebook: examples/radscore_tutorial.ipynb.

First runs download model weights (CheXbert, RadGraph, BERTScore) from the Hugging Face Hub. Set HF_HOME to control the cache location.

Input format

A JSON list of records. Only prediction and reference are required by the core radscore command; the other fields are optional and used by the per-category tools and planned label-based metrics.

Field	Type	Required	Used by
`prediction`	string	yes	all commands — the generated report
`reference`	string	yes	all commands — the ground-truth report
`label`	list of 14 ints (0/1)	optional	ground-truth multilabel vector (see below)
`target_category`	string	optional	the two per-category commands (grouping key)
`image_path`	string	optional	metadata / identifier only; never read by metrics

[
  {
    "image_path": "patient_001.png",
    "prediction": "Small right pleural effusion with basal atelectasis. No pneumothorax.",
    "reference": "Increased right basal density related to atelectasis and pleural effusion.",
    "label": [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1],
    "target_category": "Atelectasis"
  }
]

The `label` multilabel vector

label is the ground-truth presence/absence of each of the 14 CheXbert conditions for that study, in CheXbert head order:

0  Enlarged Cardiomediastinum   7  Atelectasis
1  Cardiomegaly                 8  Pneumothorax
2  Lung Opacity                 9  Pleural Effusion
3  Lung Lesion                 10  Pleural Other
4  Edema                       11  Fracture
5  Consolidation               12  Support Devices
6  Pneumonia                   13  No Finding

A single report is multilabel — it can be positive for several conditions at once (e.g. the vector above marks Lung Opacity, Atelectasis, Pleural Effusion, Support Devices and No-Finding... i.e. multiple 1s). This vector is the basis for the micro / macro multilabel F1 analysis described under Per-category evaluation. See examples/template.json for a runnable example.

Usage

CLI

# Default scorers, one CSV per input file
radscore --filepath preds.json --output-mode per-file

# Pick scorers, add bootstrap CIs, aggregate per experiment/dataset
radscore --filepath preds.json \
         --scorers CheXbert,F1-RadGraph,ROUGE-L \
         --bootstrap-ci \
         --output-mode per-experiment

# Add the GREEN metric (requires the submodule installed)
radscore --filepath preds.json --compute-green

# Reuse already-computed metrics, only fill in what's missing
radscore --filepath preds.json --skip-existing

# Weighted average across target categories
radscore-weighted --filepath preds.json

# Per-target-category CheXbert breakdown
radscore-chexbert-class --filepath preds.json

Key flags (radscore --help for the full list):

Flag	Description
`--filepath`	Path to the predictions JSON (required).
`--scorers`	Comma-separated subset of metrics to run.
`--bootstrap-ci`	Report median + 95% bootstrap confidence intervals.
`--output-mode`	`per-file` or `per-experiment` (aggregated CSV).
`--compute-green`	Also compute the GREEN score.
`--skip-existing`	Skip metrics already present in the output CSV.
`--save-breakdowns`	Save per-condition CheXbert breakdown CSVs.

Output path convention: an input under outputs/<experiment>/<dataset>/foo.json is written to results/<experiment>/<dataset>/....

Per-category evaluation (optional)

Beyond scoring a whole file, radscore ships two helpers that report metrics per target condition rather than over the full set. They are built for robustness / shortcut studies (e.g. occlusion experiments) but apply to any analysis where each prediction is associated with one condition of interest.

Data layout. Still a single flat JSON list — no nesting. The grouping key is the per-record target_category (a CheXbert condition name). The important subtlety is how occlusion-style data is laid out:

The same study appears once per target condition. For a given image, the input contains several records that share the same image_path, reference and ground-truth label, but differ in their target_category — and in their prediction, because each record is the report the model produced when that condition's region was occluded. Each record therefore isolates the single condition on which to quantify a label-specific F1.

[
  { "image_path": "studyA.png", "prediction": "...occluded Atelectasis...",      "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Atelectasis" },
  { "image_path": "studyA.png", "prediction": "...occluded Lung Opacity...",     "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Lung_Opacity" },
  { "image_path": "studyA.png", "prediction": "...occluded Pleural Effusion...", "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Pleural_Effusion" }
]

Note the three records above are the same study (label and reference identical) repeated for three different target conditions. target_category values may be written with spaces or underscores (Pleural Effusion / Pleural_Effusion) — both are normalized to the canonical CheXbert condition.

The two commands then differ in what they compute:

	`target_category` values	What it reports
`radscore-weighted`	any string	the full metric suite per group + a `weighted_avg` row
`radscore-chexbert-class`	must be a CheXbert condition	label-specific CheXbert F1 per condition; each record counts only toward its own class

Micro vs. macro. Over the 14-class multilabel label system, two report-level aggregations are standard: micro-F1 pools the true/false positives/negatives across all conditions before computing F1 (dominated by frequent findings), while macro-F1 computes F1 per condition and averages them (treats rare and frequent findings equally). The dedicated radscore-label-f1 command (below) computes both — and the per-condition and target-specific breakdowns — directly against the ground-truth label vector.

radscore-weighted — groups samples by target_category, computes the full set of metrics independently for each group, and adds a final weighted_avg row where each group is weighted by its sample count. Any string is accepted as a category; missing values fall back to "unknown". Output is one CSV with a row per category plus the weighted average.

radscore-weighted --filepath examples/template.json \
                  --scorers CheXbert,F1-RadGraph,BLEU-1,BLEU-4,ROUGE-L

radscore-chexbert-class — runs the CheXbert labeler on all predictions and references, then for each condition C computes the binary F1 of that one condition (predicted-report label vs reference-report label) over only the records whose target_category == C. So a record tagged Cardiomegaly contributes solely to the Cardiomegaly F1. The category must map to a CheXbert condition (Cardiomegaly, Edema, Consolidation, Atelectasis, Pleural Effusion, Lung Opacity, Pneumonia, Pneumothorax, Fracture, Support Devices, Enlarged Cardiomediastinum, Lung Lesion, Pleural Other, No Finding); samples missing it raise an error. Supports --bootstrap-ci.

radscore-chexbert-class --filepath examples/template.json

Both commands are entirely optional — the core radscore command never needs target_category.

Label-grounded F1 (`radscore-label-f1`)

Scores the model's predicted findings — the CheXbert labels extracted from the prediction text — against the curated ground-truth label vector on each record (rather than against the reference text). Requires a 14-length binary label on every record. It produces:

Multilabel micro / macro F1 over the 14 conditions and the 5-class subset (Micro-F1-14, Macro-F1-14, Micro-F1-5, Macro-F1-5), saved to <name>_label_f1.csv, plus per-condition F1 in <name>_label_f1_per_condition.csv.
Target-specific F1 (only if records carry target_category): the binary F1 of each target condition over just its tagged records, saved to <name>_label_target_f1.csv.

# multilabel + (if present) target-specific F1
radscore-label-f1 --filepath examples/template.json

# uncertain findings as positive, with bootstrap CIs
radscore-label-f1 --filepath preds.json --uncertain-mode rrg+ --bootstrap-ci

Note: radscore's built-in Micro-F1-14 / Macro-F1-14 (from the CheXbert scorer) compare prediction text vs reference text; radscore-label-f1 instead uses the curated label vector as ground truth.

Python API

from radscore.cli import run, ReportGenerationEvaluator

# Programmatic scoring
evaluator = ReportGenerationEvaluator(scorers=["BLEU-4", "ROUGE-L", "F1-RadGraph"])
scores = evaluator.evaluate(predictions, references)

# Or drive the full pipeline (CSV output, CIs, GREEN, ...)
run(filepath="preds.json", scorers=["CheXbert", "F1-RadGraph"], bootstrap_ci=True)

Project structure

radscore/
├── src/radscore/
│   ├── cli.py             # main CLI + evaluator (radscore)
│   ├── weighted.py        # per-category weighted eval (radscore-weighted)
│   ├── chexbert_class.py  # per-category CheXbert breakdown (radscore-chexbert-class)
│   ├── label_f1.py        # label-grounded multilabel/target F1 (radscore-label-f1)
│   ├── chexbert.py        # CheXbert labeler + F1
│   ├── f1radgraph.py      # F1-RadGraph (v2, with CIs)
│   ├── rouge.py           # ROUGE with bootstrap aggregation
│   └── factuality_*.py    # shared CheXbert/condition utilities
├── examples/template.json # runnable input example
├── branding/              # logo, banner, image-gen prompts
├── third_party/GREEN/     # optional GREEN submodule
└── pyproject.toml

Roadmap

See TODO.md. Planned for the next version: optional label-based F1-score (classification performance against a ground-truth 14-class CheXbert label vector).

Development

python -m pip install -e ".[dev]"   # editable install + pytest
pytest                              # offline, ~20s; metric maths verified vs scikit-learn

Contributions welcome — see CONTRIBUTING.md.

Acknowledgements

radscore stands on excellent prior work: RadGraph and GREEN (Stanford AIMI), CheXbert (Stanford ML Group), BERTScore, sacreBLEU, and Hugging Face evaluate.

Authors

Filippo Ruffini — @fruffini
Marco Salmè — @marcosal30

Citation

If you use radscore in your research, please cite this repository:

@software{radscore,
  title  = {radscore: Evaluation metrics for chest X-ray radiology report generation},
  author = {Ruffini, Filippo and Salmè, Marco},
  year   = {2026},
  url    = {https://github.com/fruffini/radscore}
}

License

Released under the PolyForm Noncommercial License 1.0.0. Noncommercial use only — research, teaching, personal, and use by academic / nonprofit / government organizations is permitted; commercial use is not. See the LICENSE for the full terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

radscore

Features

Metrics

Installation

GREEN (optional)

Quickstart

Input format

The `label` multilabel vector

Usage

CLI

Per-category evaluation (optional)

Label-grounded F1 (`radscore-label-f1`)

Python API

Project structure

Roadmap

Development

Acknowledgements

Authors

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
branding		branding
examples		examples
src/radscore		src/radscore
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

radscore

Features

Metrics

Installation

GREEN (optional)

Quickstart

Input format

The label multilabel vector

Usage

CLI

Per-category evaluation (optional)

Label-grounded F1 (radscore-label-f1)

Python API

Project structure

Roadmap

Development

Acknowledgements

Authors

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `label` multilabel vector

Label-grounded F1 (`radscore-label-f1`)

Packages