Benchmarking

The cellarium-cas package ships benchmarking modules that let you evaluate annotation quality against a labelled reference dataset. Two metric families are supported:

F-measure — standard precision, recall, and F1 computed from a flat confusion matrix. Micro metrics treat the problem globally; macro metrics average F1 across classes with nonzero support.
Hierarchical F-measure — hierarchical precision, recall, and F1 following the Kiritchenko et al. approach. Predictions are evaluated using ontology ancestor sets rather than exact label matches, so partial credit is awarded for predictions that are close in the Cell Ontology tree.

The CLI writes one cm_raw_k{n}/ directory per k value (1 … --f-measure-top-k). Hierarchical F-measure always uses cm_raw_k1/ (top-1 predictions). Flat F-measure computes metrics for every k and stores them in a single wide-format CSV with columns suffixed _k{n} (e.g. f1_micro_k1, f1_micro_k2).

Prerequisites

Install the benchmarking extras:

pip install "cellarium-cas[benchmark]"

Preparing annotate output directories

Run cellarium-cas annotate with --infer-labels and --save-ontology-resource to produce the files required for benchmarking:

cellarium-cas annotate \
    --input-path labelled_dataset.h5ad \
    --output-dir ./run_model_a_sample_1 \
    --cas-api-token $CAS_API_TOKEN \
    --infer-labels \
    --save-ontology-resource

Each annotate output directory must contain:

inferred_labels.csv — predicted cell type labels (cas_cell_type_name_1 holds the most granular CL term ID).
ontology_resource.json — the Cell Ontology resource used during annotation.
metadata.json — provenance (input_path, model_name, …).
ontology_response.json — the raw annotation response.

The ground-truth labels are read from the original .h5ad at metadata["input_path"]. That column must contain CL ontology term IDs (e.g. "CL:0000540") for hierarchical metrics. Use cas_cell_type_name_k when the ground-truth column contains CL term IDs; use cas_cell_type_label_k only if the ground-truth column and benchmark label universe are human-readable labels too.

Running the full pipeline via CLI

The cellarium-cas benchmark all command runs all four pipeline steps in one call:

cellarium-cas benchmark all \
    --annotate-dirs ./annotate_outputs \
    --output-dir    ./benchmark_results \
    --gt-label      cell_type_ontology_term_id \
    --inferred-label cas_cell_type_name_1 \
    --f-measure-top-k 3

--annotate-dirs may be a directory whose immediate subdirectories are annotate output dirs, or a .txt file listing one path per line.

--f-measure-top-k computes flat F-measure for every k from 1 to the given value. All k results appear as separate column sets (suffixed _k{n}) in a single wide-format CSV — no separate output files per k.

After the command completes, benchmark_results/ contains:

cm_raw_k1/                       # per-sample sparse confusion matrices (top-1 predictions)
cm_raw_k2/                       # per-sample sparse confusion matrices (top-2 hit predictions)
...                              # one directory per k up to --f-measure-top-k
cm_aggregate_k1/                 # per-model aggregated confusion matrices (top-1)
...                              # one directory per k
f_measure_per_sample.csv         # F-measure per annotate run (wide format; columns suffixed _k{n})
f_measure_per_group.csv          # F-measure per model (aggregated; wide format)
hierarchical_f_measure_per_sample.csv
hierarchical_f_measure_per_group.csv

Running steps individually

The pipeline can also be run step by step, which is useful when you want to re-compute only the metric CSVs after adding new runs:

# Step 1 — build per-sample confusion matrices
cellarium-cas benchmark confusion-matrix \
    --annotate-dirs ./annotate_outputs \
    --output-dir    ./benchmark_results \
    --gt-label      cell_type_ontology_term_id \
    --inferred-label cas_cell_type_name_1 \
    --f-measure-top-k 3

# Step 2 — aggregate by model name
cellarium-cas benchmark aggregate \
    --output-dir ./benchmark_results

# Step 3 — F-measure CSVs
cellarium-cas benchmark f-measure \
    --output-dir ./benchmark_results

# Step 4 — hierarchical F-measure CSVs
cellarium-cas benchmark hierarchical \
    --output-dir ./benchmark_results

Output columns

f_measure_per_sample.csv and f_measure_per_group.csv:

Metric columns are repeated for each k value and suffixed with _k{n} (e.g. tp_k1, f1_micro_k2). Identity columns have no suffix.

Column	Description
`model_name` / `group_name`	Model name (per-sample has `model_name` + `test_sample`; per-group has `group_name`).
`tp_k{n}`	Global true positives (trace of the confusion matrix) for top-k hit predictions.
`fp_k{n}`	Global false positives (total − trace).
`fn_k{n}`	Global false negatives (total − trace).
`precision_micro_k{n}`	Micro precision (= accuracy for single-label multiclass).
`recall_micro_k{n}`	Micro recall.
`f1_micro_k{n}`	Micro F1.
`f1_macro_k{n}`	Macro F1 averaged over classes with nonzero support.
`precision_macro_k{n}`	Macro precision averaged over classes with nonzero support.
`recall_macro_k{n}`	Macro recall averaged over classes with nonzero support.
`precision_weighted_k{n}`	Support-weighted precision (weighted by per-class true support).
`recall_weighted_k{n}`	Support-weighted recall.
`f1_weighted_k{n}`	Support-weighted F1.

hierarchical_f_measure_per_sample.csv and hierarchical_f_measure_per_group.csv:

Column	Description
`h_tp`	Global hierarchical true positives.
`h_fp`	Global hierarchical false positives.
`h_fn`	Global hierarchical false negatives.
`h_precision_micro`	Micro hierarchical precision.
`h_recall_micro`	Micro hierarchical recall.
`h_f1_micro`	Micro hierarchical F1.
`h_f1_macro`	Macro hierarchical F1 (per-class, averaged over true classes with nonzero support).
`h_precision_macro`	Macro hierarchical precision (per-node, averaged over nodes with nonzero true support).
`h_recall_macro`	Macro hierarchical recall (per-node, averaged over nodes with nonzero true support).
`h_precision_weighted`	Support-weighted hierarchical precision (weighted by per-node hierarchical true support).
`h_recall_weighted`	Support-weighted hierarchical recall.
`h_f1_weighted`	Support-weighted hierarchical F1.

Using the Python API directly

Low-level functions operate on arrays and confusion matrices:

import scipy.sparse
from cellarium.cas.benchmarking import (
    build_confusion_matrix,
    compute_f_measure_from_cm,
    compute_hierarchical_f_measure_from_cm,
)
from cellarium.cas.postprocessing.cell_ontology.cell_ontology_cache import CellOntologyCache

# Load ontology cache (provides ancestor mapping)
import json
with open("ontology_resource.json") as f:
    resource = json.load(f)
cache = CellOntologyCache(resource)

# Build a confusion matrix from label lists
label_order = resource["cl_names"]
cm = build_confusion_matrix(y_true, y_pred, label_order)

# Standard F-measure
f_metrics = compute_f_measure_from_cm(cm)

# Hierarchical F-measure
h_metrics = compute_hierarchical_f_measure_from_cm(
    cm, label_order, cache
)

The high-level pipeline functions mirror the CLI steps:

from cellarium.cas.cli._benchmark_impl import (
    run_confusion_matrix_step,
    run_aggregate_step,
    run_f_measure_step,
    run_hierarchical_f_measure_step,
)

run_confusion_matrix_step("./annotate_outputs", "./bench", "cell_type_ontology_term_id", "cas_cell_type_name_1")
run_aggregate_step("./bench")
run_f_measure_step("./bench")
run_hierarchical_f_measure_step("./bench")

Aggregating custom groups

The CLI aggregate command groups confusion matrices by model_name automatically. For custom groupings (e.g. by assay type or tissue), use aggregate_confusion_matrices directly in a notebook:

from cellarium.cas.benchmarking.confusion_matrix import (
    load_confusion_matrix,
    aggregate_confusion_matrices,
    save_confusion_matrix,
)

cms = []
for run_dir in my_custom_group_dirs:
    cm, _ = load_confusion_matrix(run_dir)
    cms.append(cm)

agg = aggregate_confusion_matrices(cms)

Azimuth integration

The cellarium.cas.benchmarking.azimuth helpers convert Azimuth annotation outputs into CAS-compatible annotate directories so they can be evaluated with the same pipeline. See cellarium.cas.benchmarking.azimuth.helpers.azimuth_to_cas_annotation for usage.

Important: when level_specs is auto-detected, Azimuth levels are ordered most granular first (rank 1 = finest level), matching CAS convention. If you pass explicit level_specs, list them most-granular first.