Cell Type Prediction: A Review of Current Approaches and Emerging Challenges - OmnibusX

Cell Type Prediction: A Review of Current Approaches and Emerging Challenges

Introduction

Recent advances in single-cell RNA sequencing (scRNAseq) technologies have revolutionized the study of cellular heterogeneity, characteristics, and activity in various environments. Yet accurate, reproducible, harmonized, consistent, and automated cell type annotation remains a significant challenge. In this blog post, we will explore some of the major approaches of automated cell type annotation and their challenges. 

Approaches

The problem of automated cell type annotation in scRNAseq data has given rise to a variety of methods, each employing distinct strategies to address the complexities inherent in cellular data. These approaches generally fall into four broad categories:

Marker-based approaches

Marker-based methods for automated cell type annotation involve two primary steps. The first step is the construction of a comprehensive cell type marker database. Publicly available resources such as PanglaoDB and CellMarker, as well as proprietary databases like the OmnibusX marker repository, form the backbone of these approaches. Once the marker database is established, the second step entails applying an algorithm to classify cell types based on these predefined markers. Various algorithms may be employed, including prior-knowledge classifiers such as SCINA, DigitalCellSorter, and Garnett CV, or enrichment-based classifiers like those integrated within OmnibusX.

While effective, this approach is fundamentally dependent on the quality and completeness of the cell type marker database. It is important to note that many markers within these databases are derived from disparate technologies, which may introduce inconsistencies when applied to scRNAseq data due to limitations such as capture efficiency, dropout rates, and other technical challenges specific to single-cell sequencing.

Reference-based approaches

The reference-based approach to automated cell type prediction follows two key steps. The first step involves constructing a reference dataset, and the second leverages this reference to predict the cell types in new data. There are multiple strategies for building the reference:

  • Single Dataset Reference: In this method, a single scRNAseq dataset is utilized as the reference, such as the Tabula Sapiens atlas. This approach can mitigate technical variability between different datasets, but it comes with limitations. The number of cell types represented is constrained, and the diversity of conditions and tissues is often limited to those included within the single dataset.
  • Combined Dataset Reference: Alternatively, multiple scRNAseq datasets can be combined to create a more comprehensive reference, as seen with resources like the Human Cell Landscape Atlas (HCLA). This strategy ensures greater variety in terms of cell types, conditions, and tissues. However, it introduces technical variability stemming from differences in sequencing platforms (e.g., 10X Chromium, Smart-seq2, Drop-seq), sequencing depths, and quantification methods. This variability can introduce noise, complicating the accuracy of the cell type prediction.

The effectiveness of the reference-based approach is highly contingent upon the quality and scope of the reference dataset, particularly in terms of the diversity of cell types, conditions, and tissues it encompasses. A notable drawback of this approach is its inability to label cells as "Unassigned" or "Unknown" when encountering cell types not represented in the reference, potentially leading to misannotations.

Classification machine learning method approaches

The classification-based machine learning approach to cell type prediction comprises two primary steps. The first step involves assembling a training set composed of multiple scRNAseq datasets. The second step is to train a machine learning classifier—such as k-Nearest Neighbors (kNN), Random Forest, or Neural Networks—using this curated training set. Once trained, the model is employed to predict the cell types in new datasets.

While promising, this approach is highly dependent on the quality and diversity of the training set. Additionally, it faces challenges in mitigating technical variability between datasets, such as differences in sequencing platforms or experimental conditions. Moreover, many of these classifiers are not inherently designed for scRNAseq data, which can impact their performance when applied to single-cell datasets with unique complexities like high dimensionality, dropout, and batch effects.

Foundation model approaches

Inspired by the success of large language models such as ChatGPT, Mistral, and GeminiAI, foundation model approaches in scRNAseq data analysis represent a novel and promising direction. The process begins with training a model on vast amounts of scRNAseq data, creating a generalized model capable of serving various purposes, including multi-batch integration, multi-omics integration, perturbation response prediction, gene network inference, and cell type prediction. A prominent example of this approach is scGPT.

Despite its potential, several challenges remain. The quality and diversity of the training data are critical factors influencing the model's performance. Additionally, controlling what the model learns from the data poses a significant challenge, particularly in ensuring that the model captures biologically relevant features without introducing biases. Moreover, the computational demands of building foundation models are substantial, requiring extensive resources for both training and fine-tuning.

OmnibusX Cell Type Prediction

Cell Type Marker database curation

To address the challenges of cell type annotation in scRNA-seq data, OmnibusX employs a rigorously curated and highly specific map of cell type and subtype marker genes. While marker gene studies have been conducted extensively over the past decade, inconsistencies in naming conventions and a lack of cross-study validation have often led to ambiguous and non-reproducible results. To overcome these issues, we undertook a meticulous curation process that involved refining and validating marker gene sets from over 280 publications, ensuring their accuracy and utility in identifying cell types across diverse studies and tissues.

Our process began with a comprehensive literature review to compile validated marker genes associated with specific cell types and subtypes. We enhanced this list by conducting differential gene expression (DGE) analyses, ensuring that markers reflected accurate cell type distinctions. For cell subtype markers, we focused on sibling cell type contexts to ensure precision.

We then verified the consistency of these markers across multiple datasets, excluding genes influenced by external factors like age, gender, or technology. This step also included harmonizing disparate naming conventions, aligning them under a unified cell population ontology. In cases where cross-study agreement was low, we revisited and refined the marker selection process, ensuring robust cross-dataset reproducibility.

Through this rigorous approach, OmnibusX developed a comprehensive map of 166 cell type and subtype-specific marker gene sets, significantly improving the accuracy and reproducibility of cell type annotations in scRNA-seq studies.

A Customized prediction algorithm

To automate the assignment of cell labels based on our curated marker gene sets, OmnibusX employs a specialized algorithm built upon the AUCell framework. This algorithm operates through three main steps to ensure accurate cell type prediction:

  1. Noise Reduction: In scRNA-seq experiments, low-level gene expression can result from contamination or alignment errors. To address this, our algorithm first establishes an expression threshold for each gene, filtering out lowly and randomly expressed genes, effectively removing background noise.
  2. AUCell Enrichment and Label Assignment: Using the AUCell algorithm, we calculate the enrichment scores of each marker gene set within the transcriptomic profiles of individual cells. The cell is then assigned a label corresponding to the marker gene set with the highest enrichment, thereby classifying cells based on their gene expression patterns.
  3. Smoothing for Unassigned Cells: To mitigate the effects of drop-out events—common in single-cell datasets—our algorithm employs a smoothing step. If a cell remains unassigned due to missing marker gene expression, we reassign its label based on the consensus of its 15 nearest neighboring cells, ensuring a more accurate annotation.

The OmnibusX prediction algorithm is integrated into a user-friendly application, offering researchers an intuitive platform for scRNA-seq data annotation. The application also provides interactive visualizations, quality control metrics, and additional downstream analysis capabilities. Together, the OmnibusX algorithm and application deliver an accessible and robust solution for precise cell type annotation in scRNA-seq studies.

Conclusion

Cell type prediction remains a critical challenge in single-cell omics, but advances in methodology and the integration of curated marker gene sets have significantly improved the accuracy and reproducibility of cell annotations. OmnibusX offers a powerful, user-friendly solution for researchers working with scRNAseq, scATACseq, and 10X Visium HD data. Our highly curated marker gene sets and advanced prediction algorithm support robust cell type prediction for both human and mouse datasets, making it an invaluable tool for diverse applications in single-cell research.

We invite you to experience the precision and ease of OmnibusX firsthand. Download the OmnibusX desktop application today at https://omnibusx.com/apps and take advantage of our two-month trial. During this period, you can explore the full capabilities of our cell type prediction feature, applying it across scRNAseq, scATACseq, and 10X Visium HD datasets. Whether you’re working with human or mouse data, OmnibusX empowers you with the tools to streamline and enhance your single-cell analyses.