Skip to content

fruffini/radscore

Repository files navigation

radscore

radscore

Evaluation metrics for chest X-ray radiology report generation (RRG)

Python PyPI License: PolyForm Noncommercial 1.0.0 PyTorch Status


radscore is a lightweight, pip-installable toolkit for evaluating AI-generated radiology reports against ground-truth references. It bundles the metrics most used in the RRG literature — lexical overlap, contextual similarity, and clinically-aware factuality scores — behind a single CLI and a small Python API, so you can score a model's predictions with one command.

radscore logo

Features

  • One-command evaluation of a JSON file of prediction / reference pairs.
  • Seven metrics out of the box, including clinically-aware ones (CheXbert, F1-RadGraph) — not just n-gram overlap.
  • Bootstrap confidence intervals for robust reporting (--bootstrap-ci).
  • Two output modes: one CSV per file, or an aggregated CSV per experiment/dataset for easy leaderboard-style comparison.
  • Incremental / resumable: --skip-existing reuses already-computed metrics.
  • Optional GREEN scorer (Stanford AIMI) via a git submodule, imported lazily.
  • Reproducible (fixed random seed) and HPC-friendly.

Metrics

Metric What it measures Backend
BLEU-1 / BLEU-4 N-gram precision overlap sacrebleu / evaluate
ROUGE-L / ROUGE-2 Longest-common-subsequence / bigram recall rouge-score
BERTScore Contextual embedding similarity bert-score
F1-RadGraph Clinical entity & relation factuality RadGraph (F1RadGraphv2, partial reward)
CheXbert F1 Agreement on 14-/5-class findings (Micro/Macro) CheXbert labeler
GREEN (optional) LLM-based clinical error grading Stanford AIMI GREEN

Installation

Requires Python 3.10 or 3.11 (the pinned torch / numpy versions have wheels for these; 3.11 recommended).

git clone --recurse-submodules https://github.com/fruffini/radscore.git radscore
cd radscore

# create and activate an isolated environment (use a Python 3.11 interpreter)
python3.11 -m venv radscore-env
source radscore-env/bin/activate      # Windows: radscore-env\Scripts\activate
python -m pip install -U pip          # upgrade pip first

python -m pip install -e .            # core install

python -m pip install -e ".[wandb]"   # optional: Weights & Biases logging
python -m pip install -e ".[dev]"     # optional: test tooling

GREEN (optional)

GREEN is kept as a git submodule under third_party/GREEN and is only needed for --compute-green:

# 1. check out the submodule (skipped if you cloned with --recurse-submodules)
git submodule update --init --recursive

# 2. install it. GREEN's setup.py pins python_requires "==3.12.1", so on any
#    other Python pass --ignore-requires-python (its real deps work on 3.10/3.11)
python -m pip install -e third_party/GREEN --ignore-requires-python

If GREEN is not installed, every other metric still works — the CLI imports it lazily, only when --compute-green is passed. Because the import is lazy and falls back to the submodule path, --compute-green also works if you only install GREEN's dependencies without the editable install above.

If pip install -e third_party/GREEN reports "not a valid editable requirement", the submodule directory is empty — run step 1 first.

Quickstart

A ready-to-run example lives in examples/template.json:

radscore --filepath examples/template.json --output-mode per-file

This computes the default scorers (CheXbert, F1-RadGraph, BLEU-1, BLEU-4, ROUGE-L) and writes a CSV under results/.

Prefer a guided walkthrough? See the tutorial notebook: examples/radscore_tutorial.ipynb.

First runs download model weights (CheXbert, RadGraph, BERTScore) from the Hugging Face Hub. Set HF_HOME to control the cache location.

Input format

A JSON list of records. Only prediction and reference are required by the core radscore command; the other fields are optional and used by the per-category tools and planned label-based metrics.

Field Type Required Used by
prediction string yes all commands — the generated report
reference string yes all commands — the ground-truth report
label list of 14 ints (0/1) optional ground-truth multilabel vector (see below)
target_category string optional the two per-category commands (grouping key)
image_path string optional metadata / identifier only; never read by metrics
[
  {
    "image_path": "patient_001.png",
    "prediction": "Small right pleural effusion with basal atelectasis. No pneumothorax.",
    "reference": "Increased right basal density related to atelectasis and pleural effusion.",
    "label": [0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1],
    "target_category": "Atelectasis"
  }
]

The label multilabel vector

label is the ground-truth presence/absence of each of the 14 CheXbert conditions for that study, in CheXbert head order:

0  Enlarged Cardiomediastinum   7  Atelectasis
1  Cardiomegaly                 8  Pneumothorax
2  Lung Opacity                 9  Pleural Effusion
3  Lung Lesion                 10  Pleural Other
4  Edema                       11  Fracture
5  Consolidation               12  Support Devices
6  Pneumonia                   13  No Finding

A single report is multilabel — it can be positive for several conditions at once (e.g. the vector above marks Lung Opacity, Atelectasis, Pleural Effusion, Support Devices and No-Finding... i.e. multiple 1s). This vector is the basis for the micro / macro multilabel F1 analysis described under Per-category evaluation. See examples/template.json for a runnable example.

Usage

CLI

# Default scorers, one CSV per input file
radscore --filepath preds.json --output-mode per-file

# Pick scorers, add bootstrap CIs, aggregate per experiment/dataset
radscore --filepath preds.json \
         --scorers CheXbert,F1-RadGraph,ROUGE-L \
         --bootstrap-ci \
         --output-mode per-experiment

# Add the GREEN metric (requires the submodule installed)
radscore --filepath preds.json --compute-green

# Reuse already-computed metrics, only fill in what's missing
radscore --filepath preds.json --skip-existing

# Weighted average across target categories
radscore-weighted --filepath preds.json

# Per-target-category CheXbert breakdown
radscore-chexbert-class --filepath preds.json

Key flags (radscore --help for the full list):

Flag Description
--filepath Path to the predictions JSON (required).
--scorers Comma-separated subset of metrics to run.
--bootstrap-ci Report median + 95% bootstrap confidence intervals.
--output-mode per-file or per-experiment (aggregated CSV).
--compute-green Also compute the GREEN score.
--skip-existing Skip metrics already present in the output CSV.
--save-breakdowns Save per-condition CheXbert breakdown CSVs.

Output path convention: an input under outputs/<experiment>/<dataset>/foo.json is written to results/<experiment>/<dataset>/....

Per-category evaluation (optional)

Beyond scoring a whole file, radscore ships two helpers that report metrics per target condition rather than over the full set. They are built for robustness / shortcut studies (e.g. occlusion experiments) but apply to any analysis where each prediction is associated with one condition of interest.

Data layout. Still a single flat JSON list — no nesting. The grouping key is the per-record target_category (a CheXbert condition name). The important subtlety is how occlusion-style data is laid out:

The same study appears once per target condition. For a given image, the input contains several records that share the same image_path, reference and ground-truth label, but differ in their target_category — and in their prediction, because each record is the report the model produced when that condition's region was occluded. Each record therefore isolates the single condition on which to quantify a label-specific F1.

[
  { "image_path": "studyA.png", "prediction": "...occluded Atelectasis...",      "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Atelectasis" },
  { "image_path": "studyA.png", "prediction": "...occluded Lung Opacity...",     "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Lung_Opacity" },
  { "image_path": "studyA.png", "prediction": "...occluded Pleural Effusion...", "reference": "...", "label": [0,0,1,0,0,0,0,1,0,1,0,0,1,1], "target_category": "Pleural_Effusion" }
]

Note the three records above are the same study (label and reference identical) repeated for three different target conditions. target_category values may be written with spaces or underscores (Pleural Effusion / Pleural_Effusion) — both are normalized to the canonical CheXbert condition.

The two commands then differ in what they compute:

target_category values What it reports
radscore-weighted any string the full metric suite per group + a weighted_avg row
radscore-chexbert-class must be a CheXbert condition label-specific CheXbert F1 per condition; each record counts only toward its own class

Micro vs. macro. Over the 14-class multilabel label system, two report-level aggregations are standard: micro-F1 pools the true/false positives/negatives across all conditions before computing F1 (dominated by frequent findings), while macro-F1 computes F1 per condition and averages them (treats rare and frequent findings equally). The dedicated radscore-label-f1 command (below) computes both — and the per-condition and target-specific breakdowns — directly against the ground-truth label vector.

radscore-weighted — groups samples by target_category, computes the full set of metrics independently for each group, and adds a final weighted_avg row where each group is weighted by its sample count. Any string is accepted as a category; missing values fall back to "unknown". Output is one CSV with a row per category plus the weighted average.

radscore-weighted --filepath examples/template.json \
                  --scorers CheXbert,F1-RadGraph,BLEU-1,BLEU-4,ROUGE-L

radscore-chexbert-class — runs the CheXbert labeler on all predictions and references, then for each condition C computes the binary F1 of that one condition (predicted-report label vs reference-report label) over only the records whose target_category == C. So a record tagged Cardiomegaly contributes solely to the Cardiomegaly F1. The category must map to a CheXbert condition (Cardiomegaly, Edema, Consolidation, Atelectasis, Pleural Effusion, Lung Opacity, Pneumonia, Pneumothorax, Fracture, Support Devices, Enlarged Cardiomediastinum, Lung Lesion, Pleural Other, No Finding); samples missing it raise an error. Supports --bootstrap-ci.

radscore-chexbert-class --filepath examples/template.json

Both commands are entirely optional — the core radscore command never needs target_category.

Label-grounded F1 (radscore-label-f1)

Scores the model's predicted findings — the CheXbert labels extracted from the prediction text — against the curated ground-truth label vector on each record (rather than against the reference text). Requires a 14-length binary label on every record. It produces:

  • Multilabel micro / macro F1 over the 14 conditions and the 5-class subset (Micro-F1-14, Macro-F1-14, Micro-F1-5, Macro-F1-5), saved to <name>_label_f1.csv, plus per-condition F1 in <name>_label_f1_per_condition.csv.
  • Target-specific F1 (only if records carry target_category): the binary F1 of each target condition over just its tagged records, saved to <name>_label_target_f1.csv.
# multilabel + (if present) target-specific F1
radscore-label-f1 --filepath examples/template.json

# uncertain findings as positive, with bootstrap CIs
radscore-label-f1 --filepath preds.json --uncertain-mode rrg+ --bootstrap-ci

Note: radscore's built-in Micro-F1-14 / Macro-F1-14 (from the CheXbert scorer) compare prediction text vs reference text; radscore-label-f1 instead uses the curated label vector as ground truth.

Python API

from radscore.cli import run, ReportGenerationEvaluator

# Programmatic scoring
evaluator = ReportGenerationEvaluator(scorers=["BLEU-4", "ROUGE-L", "F1-RadGraph"])
scores = evaluator.evaluate(predictions, references)

# Or drive the full pipeline (CSV output, CIs, GREEN, ...)
run(filepath="preds.json", scorers=["CheXbert", "F1-RadGraph"], bootstrap_ci=True)

Project structure

radscore/
├── src/radscore/
│   ├── cli.py             # main CLI + evaluator (radscore)
│   ├── weighted.py        # per-category weighted eval (radscore-weighted)
│   ├── chexbert_class.py  # per-category CheXbert breakdown (radscore-chexbert-class)
│   ├── label_f1.py        # label-grounded multilabel/target F1 (radscore-label-f1)
│   ├── chexbert.py        # CheXbert labeler + F1
│   ├── f1radgraph.py      # F1-RadGraph (v2, with CIs)
│   ├── rouge.py           # ROUGE with bootstrap aggregation
│   └── factuality_*.py    # shared CheXbert/condition utilities
├── examples/template.json # runnable input example
├── branding/              # logo, banner, image-gen prompts
├── third_party/GREEN/     # optional GREEN submodule
└── pyproject.toml

Roadmap

See TODO.md. Planned for the next version: optional label-based F1-score (classification performance against a ground-truth 14-class CheXbert label vector).

Development

python -m pip install -e ".[dev]"   # editable install + pytest
pytest                              # offline, ~20s; metric maths verified vs scikit-learn

Contributions welcome — see CONTRIBUTING.md.

Acknowledgements

radscore stands on excellent prior work: RadGraph and GREEN (Stanford AIMI), CheXbert (Stanford ML Group), BERTScore, sacreBLEU, and Hugging Face evaluate.

Authors

Citation

If you use radscore in your research, please cite this repository:

@software{radscore,
  title  = {radscore: Evaluation metrics for chest X-ray radiology report generation},
  author = {Ruffini, Filippo and Salmè, Marco},
  year   = {2026},
  url    = {https://github.com/fruffini/radscore}
}

License

Released under the PolyForm Noncommercial License 1.0.0. Noncommercial use only — research, teaching, personal, and use by academic / nonprofit / government organizations is permitted; commercial use is not. See the LICENSE for the full terms.

Releases

No releases published

Packages

 
 
 

Contributors

Languages