:tocdepth: 3 Benchmarking ############ The ``cellarium-cas`` package ships benchmarking modules that let you evaluate annotation quality against a labelled reference dataset. Two metric families are supported: * **F-measure** — standard precision, recall, and F1 computed from a flat confusion matrix. Micro metrics treat the problem globally; macro metrics average F1 across classes with nonzero support. * **Hierarchical F-measure** — hierarchical precision, recall, and F1 following the Kiritchenko et al. approach. Predictions are evaluated using ontology ancestor sets rather than exact label matches, so partial credit is awarded for predictions that are close in the Cell Ontology tree. The CLI writes one ``cm_raw_k{n}/`` directory per k value (1 … ``--f-measure-top-k``). Hierarchical F-measure always uses ``cm_raw_k1/`` (top-1 predictions). Flat F-measure computes metrics for every k and stores them in a single wide-format CSV with columns suffixed ``_k{n}`` (e.g. ``f1_micro_k1``, ``f1_micro_k2``). Prerequisites +++++++++++++ Install the benchmarking extras:: pip install "cellarium-cas[benchmark]" Preparing annotate output directories ++++++++++++++++++++++++++++++++++++++ Run ``cellarium-cas annotate`` with ``--infer-labels`` and ``--save-ontology-resource`` to produce the files required for benchmarking:: cellarium-cas annotate \ --input-path labelled_dataset.h5ad \ --output-dir ./run_model_a_sample_1 \ --cas-api-token $CAS_API_TOKEN \ --infer-labels \ --save-ontology-resource Each annotate output directory must contain: * ``inferred_labels.csv`` — predicted cell type labels (``cas_cell_type_name_1`` holds the most granular CL term ID). * ``ontology_resource.json`` — the Cell Ontology resource used during annotation. * ``metadata.json`` — provenance (``input_path``, ``model_name``, …). * ``ontology_response.json`` — the raw annotation response. The ground-truth labels are read from the original ``.h5ad`` at ``metadata["input_path"]``. That column must contain CL ontology term IDs (e.g. ``"CL:0000540"``) for hierarchical metrics. Use ``cas_cell_type_name_k`` when the ground-truth column contains CL term IDs; use ``cas_cell_type_label_k`` only if the ground-truth column and benchmark label universe are human-readable labels too. Running the full pipeline via CLI ++++++++++++++++++++++++++++++++++ The ``cellarium-cas benchmark all`` command runs all four pipeline steps in one call:: cellarium-cas benchmark all \ --annotate-dirs ./annotate_outputs \ --output-dir ./benchmark_results \ --gt-label cell_type_ontology_term_id \ --inferred-label cas_cell_type_name_1 \ --f-measure-top-k 3 ``--annotate-dirs`` may be a directory whose immediate subdirectories are annotate output dirs, or a ``.txt`` file listing one path per line. ``--f-measure-top-k`` computes flat F-measure for every k from 1 to the given value. All k results appear as separate column sets (suffixed ``_k{n}``) in a single wide-format CSV — no separate output files per k. After the command completes, ``benchmark_results/`` contains:: cm_raw_k1/ # per-sample sparse confusion matrices (top-1 predictions) cm_raw_k2/ # per-sample sparse confusion matrices (top-2 hit predictions) ... # one directory per k up to --f-measure-top-k cm_aggregate_k1/ # per-model aggregated confusion matrices (top-1) ... # one directory per k f_measure_per_sample.csv # F-measure per annotate run (wide format; columns suffixed _k{n}) f_measure_per_group.csv # F-measure per model (aggregated; wide format) hierarchical_f_measure_per_sample.csv hierarchical_f_measure_per_group.csv Running steps individually +++++++++++++++++++++++++++ The pipeline can also be run step by step, which is useful when you want to re-compute only the metric CSVs after adding new runs:: # Step 1 — build per-sample confusion matrices cellarium-cas benchmark confusion-matrix \ --annotate-dirs ./annotate_outputs \ --output-dir ./benchmark_results \ --gt-label cell_type_ontology_term_id \ --inferred-label cas_cell_type_name_1 \ --f-measure-top-k 3 # Step 2 — aggregate by model name cellarium-cas benchmark aggregate \ --output-dir ./benchmark_results # Step 3 — F-measure CSVs cellarium-cas benchmark f-measure \ --output-dir ./benchmark_results # Step 4 — hierarchical F-measure CSVs cellarium-cas benchmark hierarchical \ --output-dir ./benchmark_results Output columns ++++++++++++++ **f_measure_per_sample.csv** and **f_measure_per_group.csv**: Metric columns are repeated for each k value and suffixed with ``_k{n}`` (e.g. ``tp_k1``, ``f1_micro_k2``). Identity columns have no suffix. .. list-table:: :header-rows: 1 * - Column - Description * - ``model_name`` / ``group_name`` - Model name (per-sample has ``model_name`` + ``test_sample``; per-group has ``group_name``). * - ``tp_k{n}`` - Global true positives (trace of the confusion matrix) for top-k hit predictions. * - ``fp_k{n}`` - Global false positives (total − trace). * - ``fn_k{n}`` - Global false negatives (total − trace). * - ``precision_micro_k{n}`` - Micro precision (= accuracy for single-label multiclass). * - ``recall_micro_k{n}`` - Micro recall. * - ``f1_micro_k{n}`` - Micro F1. * - ``f1_macro_k{n}`` - Macro F1 averaged over classes with nonzero support. * - ``precision_macro_k{n}`` - Macro precision averaged over classes with nonzero support. * - ``recall_macro_k{n}`` - Macro recall averaged over classes with nonzero support. * - ``precision_weighted_k{n}`` - Support-weighted precision (weighted by per-class true support). * - ``recall_weighted_k{n}`` - Support-weighted recall. * - ``f1_weighted_k{n}`` - Support-weighted F1. **hierarchical_f_measure_per_sample.csv** and **hierarchical_f_measure_per_group.csv**: .. list-table:: :header-rows: 1 * - Column - Description * - ``h_tp`` - Global hierarchical true positives. * - ``h_fp`` - Global hierarchical false positives. * - ``h_fn`` - Global hierarchical false negatives. * - ``h_precision_micro`` - Micro hierarchical precision. * - ``h_recall_micro`` - Micro hierarchical recall. * - ``h_f1_micro`` - Micro hierarchical F1. * - ``h_f1_macro`` - Macro hierarchical F1 (per-class, averaged over true classes with nonzero support). * - ``h_precision_macro`` - Macro hierarchical precision (per-node, averaged over nodes with nonzero true support). * - ``h_recall_macro`` - Macro hierarchical recall (per-node, averaged over nodes with nonzero true support). * - ``h_precision_weighted`` - Support-weighted hierarchical precision (weighted by per-node hierarchical true support). * - ``h_recall_weighted`` - Support-weighted hierarchical recall. * - ``h_f1_weighted`` - Support-weighted hierarchical F1. Using the Python API directly ++++++++++++++++++++++++++++++ Low-level functions operate on arrays and confusion matrices:: import scipy.sparse from cellarium.cas.benchmarking import ( build_confusion_matrix, compute_f_measure_from_cm, compute_hierarchical_f_measure_from_cm, ) from cellarium.cas.postprocessing.cell_ontology.cell_ontology_cache import CellOntologyCache # Load ontology cache (provides ancestor mapping) import json with open("ontology_resource.json") as f: resource = json.load(f) cache = CellOntologyCache(resource) # Build a confusion matrix from label lists label_order = resource["cl_names"] cm = build_confusion_matrix(y_true, y_pred, label_order) # Standard F-measure f_metrics = compute_f_measure_from_cm(cm) # Hierarchical F-measure h_metrics = compute_hierarchical_f_measure_from_cm( cm, label_order, cache ) The high-level pipeline functions mirror the CLI steps:: from cellarium.cas.cli._benchmark_impl import ( run_confusion_matrix_step, run_aggregate_step, run_f_measure_step, run_hierarchical_f_measure_step, ) run_confusion_matrix_step("./annotate_outputs", "./bench", "cell_type_ontology_term_id", "cas_cell_type_name_1") run_aggregate_step("./bench") run_f_measure_step("./bench") run_hierarchical_f_measure_step("./bench") Aggregating custom groups +++++++++++++++++++++++++ The CLI ``aggregate`` command groups confusion matrices by ``model_name`` automatically. For custom groupings (e.g. by assay type or tissue), use ``aggregate_confusion_matrices`` directly in a notebook:: from cellarium.cas.benchmarking.confusion_matrix import ( load_confusion_matrix, aggregate_confusion_matrices, save_confusion_matrix, ) cms = [] for run_dir in my_custom_group_dirs: cm, _ = load_confusion_matrix(run_dir) cms.append(cm) agg = aggregate_confusion_matrices(cms) Azimuth integration +++++++++++++++++++ The ``cellarium.cas.benchmarking.azimuth`` helpers convert Azimuth annotation outputs into CAS-compatible annotate directories so they can be evaluated with the same pipeline. See ``cellarium.cas.benchmarking.azimuth.helpers.azimuth_to_cas_annotation`` for usage. **Important:** when ``level_specs`` is auto-detected, Azimuth levels are ordered **most granular first** (rank 1 = finest level), matching CAS convention. If you pass explicit ``level_specs``, list them most-granular first.