Cellarium Cell Annotation Service (CAS) Quickstart Tutorial

This Notebook is a short tutorial on using Cellarium CAS. Please read the instructions and run each cell in the presented order. Once you have finished going through the tutorial, please feel free to go back and modify it as needed for annotating your own datasets.

Installing Cellarium CAS client library

As a first step, we need to install Cellarium CAS client library, cellarium-cas, along with all dependencies needed for visualizations. To this end, run the next cell.

Note: If you have already installed cellarium-cas without the visualization dependencies, you should still run the next cell.

[ ]:

!pip install --no-cache-dir cellarium-cas[vis]

Load the AnnData file

In this tutorial, we will annotate a peripheral blood mononuclear cell (PBMC) scRNA-seq dataset from 10x Genomics.

Note: The original dataset, “10k PBMCs from a Healthy Donor (v3 chemistry)”, can be found here.

For the purpose of this tutorial, we have selected 4,000 cells selected at random from the full dataset. We have additionally precomputed UMAP embeddings of these cells using a standard scanpy workflow and performed unsupervised Leiden clustering.

Note: For a quick tutorial on scRNA-seq data quality control, preprocessing, embedding, and clustering using scanpy, we recommend this tutorial.

Note: We emphasize that CAS requires raw integer mRNA counts. If you are adapting this tutorial to your own dataset and your data is already normalized and/or restricted to a small gene set (e.g. highly variable genes), it is not suitable for CAS. If you have the raw counts in an AnnData layer or stored in the .raw attribute, please make sure that the .X attribute of your AnnData file is populated with the raw counts.

[ ]:

import scanpy as sc
import warnings

# suppressing some of the informational warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# set default figure resolution and size
sc.set_figure_params(dpi=80)

[ ]:

# Download the sample AnnData file
!curl -O https://storage.googleapis.com/cellarium-file-system-public/cellarium-cas-tutorial/pbmc_10x_v3_4k.h5ad

[ ]:

# Read the sample AnnData object
adata = sc.read('pbmc_10x_v3_4k.h5ad')

Let us inspect the loaded AnnData file:

[ ]:

adata

The AnnData file contains 4000 (cells) x 33538 (genes), a cluster_label attribute (under .obs), and PCA and UMAP embeddings (under .obsm).

[ ]:

adata.obs

Let us inspect the UMAP embedding already available in the AnnData file:

[ ]:

sc.pl.umap(adata)

Also, let us inspect the unsupervised Leiden clustering of the PCA embeddings for a sanity check:

[ ]:

sc.pl.umap(adata, color='cluster_label')

Note: The UMAP embeddings and unsupervised clustering of the data are both optional and are not needed by CAS. However, these attributes are required for visualizing and inspecting the CAS output using our visualization tools.

Finally, let us inspect the .var attribute of the loaded AnnData file:

[ ]:

adata.var

We notice that Gene Symbols (names) serve as the index of the .var DataFrame, and Ensembl Gene IDs are provided under gene_ids column. We take note of both for the next steps.

Note: CAS requires both Gene Symbols and Ensembl Gene IDs. If you do not have either available in your AnnData file, please update your AnnData file before proceeding to the next steps. We recommend using BioMart for converting Gene Symbols to Ensembl Gene IDs or vice versa.

Submit the loaded AnnData file to Cellarium CAS for annotation

As a first step, please populate your CAS API token in the next cell:

[ ]:

api_token = "<your-cellarium-cas-api-key>"

You can now connect to the Cellarium CAS backend and authenticate the session with our API token:

[ ]:

from cellarium.cas.client import CASClient

cas = CASClient(api_token=api_token)

The response will contain a list of annotation models and their brief descriptions. You need to choose the model that is suitable for your dataset. For this tutorial, we set cas_model_name to None, which implies choosing the default model. The default model is suitable for annotating human scRNA-seq datasets.

[1]:

# Select the annotation model; 'None' for choosing the default model
cas_model_name = None

At this point, we are ready to submit our AnnData file to CAS for annotation.

Note: Before you proceed, you may need to modify the next cell as necessary for your dataset. CAS must be pointed to the appropriate columns in the .var DataFrame for fetching Gene Symbols and Ensembl Gene IDs. This is done by setting feature_names_column_name and feature_ids_column_name arguments accordingly. If either appears as the index of the .var DataFrame, use index as argument. Otherwise, use the appropriate column name.

[ ]:

# Submit AnnData to CAS for ontology-aware cell type query
cas_ontology_aware_response = cas.annotate_matrix_cell_type_ontology_aware_strategy(
    matrix=adata,
    chunk_size=500,
    feature_ids_column_name='gene_ids',
    feature_names_column_name='index',
    cas_model_name=cas_model_name)

Let us take a quick look at the anatomy of the CAS ontology-aware cell type query response. In brief, the response is a Python object of type CellTypeOntologyAwareResults with results that contain as many elements as the number of cells in the queried AnnData file:

[ ]:

type(cas_ontology_aware_response)

[ ]:

len(cas_ontology_aware_response.data)

The list entry at position i is a dictionary that contains a number of cell type ontology terms and their relevance scores for the i’th cell:

[ ]:

cas_ontology_aware_response.data[2425]

By gleaning at the above response, we may infer that cell number 2425 is a natural killer cell.

Exploring the Cellarium CAS response

We recommend exploring the CAS response using our provided CASCircularTreePlotUMAPDashApp Dash App for a more streamlined and holistic visualization of the CAS response.

Tooltip: The visualization displays various cell type ontology terms as colored circles in a circular dendrogram. The relationships underlying this dendrogram correspond to “is_a” relationships from Cell Ontology (CL). Since these relationships are not mutually exclusive, a term can have multiple parent terms, meaning the same term can appear along different branches of the tree representation. The radius of each circle (whether it is a clade or a leaf node) signifies the occurrence of the term in the entire dataset, regardless of its relevance score. The color of the circle indicates the relevance score of the term in cells where it was found to have non-vanishing relevance.

Here are some of the interactive capabilities of the visualization app:

Cell selection: By default, all cells are selected, and the cell type ontology dendrogram shows an aggregated summary over all cells. You can restrict the aggregation to a subset of cells by selecting your desired subset over the UMAP scatter plot clicking a single cell or using the rectangular select or lasso select tool. The dendrogram will react to your custom cell selection. If your input AnnData file includes clustering, you can restrict score aggregation to each cluster by selecting your cluster in the Settings panel (accessible via the gear icon in the upper right of the app).

Highlighting ontology term relevance scores: You can highlight cell type ontology term relevance scores over the UMAP scatter plot by clicking on the circles in the dendrogram. Only the selected cells will be scored, and the rest will be grayed out. You can revert to selecting all cells from the settings panel or by using the rectangular select tool to select all cells.

Studying the ontology term relevance scores for a single cell: You can display the term relevance scores for individual cells by clicking on a single cell in the UMAP scatter plot.

Advanced settings: By default, only terms above a specified relevance threshold with occurrence above another threshold over the selected cells are shown. You can modify these thresholds in the Settings panel (accessible via the gear icon in the upper right of the app).

Note: The number of cells displayed should be limited to roughly 50K. Beyond that, performance may suffer. If you need to visualize more cells, please attempt to downsample your cells.

[ ]:

from cellarium.cas._io import suppress_stderr
from cellarium.cas.visualization import CASCircularTreePlotUMAPDashApp
from cellarium.cas.postprocessing import insert_cas_ontology_aware_response_into_adata

# Insert the CAS ontology-aware cell type query response into the AnnData object for the visualization application
insert_cas_ontology_aware_response_into_adata(cas_ontology_aware_response, adata)

DASH_SERVER_PORT = 8050

with suppress_stderr():
    CASCircularTreePlotUMAPDashApp(
        adata=adata,  # the AnnData file
        root_node="CL_0000255",  # set to CL root node to "eukaryotic cell"
        cluster_label_obs_column="cluster_label",  # (optional) The .obs column name containing cluster labels
    ).run(port=DASH_SERVER_PORT, debug=False, jupyter_width="100%")

Best cell type label assignment

Often times, we just want the best cell type labels and not a scored cell type ontology graph (!). Such a call can be made, however, with the understanding that the notion of the best cell type call for any given cell is not a well-defined task in general. Crude ontology terms (e.g. T cell, B cell) often have higher relevance scores whereas more granular labels (e.g. CD8-positive, alpha-beta T cell, IgG-negative class switched memory B cell) often have lower relevance scores. If the best call is construed as the most confident call, thensuch a call will be naturally too crude and uninformative. Therefore, there is an inherent trade-off between accuracy and cell type call granularity.

We have implemented some basic functionalities to help users navigate the scored ontology graph and make cell type calls. Our current notion of the best cell type call is one that that is furthest away from the root node (here, eukaryotic cell) while having a relevance score above a user-provided threshold. This definition allows us to sort the cell type ontology terms and report the top-k calls for each cell. We show top-3 calls for each cell and each cluster for demonstration.

[ ]:

import cellarium.cas.postprocessing.ontology_aware as pp
from cellarium.cas.postprocessing.cell_ontology import CellOntologyCache

with suppress_stderr():
    cl = CellOntologyCache()

Assign cell type calls to individual cells

[ ]:

pp.compute_most_granular_top_k_calls_single(
    adata=adata,
    cl=cl,
    min_acceptable_score=0.2,  # minimum acceptable evidence score for a cell type call
    top_k=3,  # how many top calls to make?
    obs_prefix="cas_cell_type"  # .obs column to write the top-k calls to
)

[ ]:

adata.obs

[ ]:

sc.pl.umap(adata, color='cas_cell_type_label_1')
sc.pl.umap(adata, color='cas_cell_type_label_2')
sc.pl.umap(adata, color='cas_cell_type_label_2')

Assign cell type calls to predefined cell clusters

[ ]:

pp.compute_most_granular_top_k_calls_cluster(
    adata=adata,
    cl=cl,
    min_acceptable_score=0.2,  # minimum acceptable evidence score for a cell type call
    cluster_label_obs_column='cluster_label',  # .obs column containing cluster labels
    top_k=3,  # how many top calls to make?
    obs_prefix='cas_cell_type_cluster'  # .obs column to write the top-k calls to
)

[ ]:

adata.obs

[ ]:

sc.pl.umap(adata, color='cas_cell_type_cluster_label_1')
sc.pl.umap(adata, color='cas_cell_type_cluster_label_2')
sc.pl.umap(adata, color='cas_cell_type_cluster_label_3')