Single Cell RNAseq Analysis

Kerry Cobb

Steps

  1. Quality control
  2. Normalization
  3. Feature selection
  4. Dimensionality reduction
  5. Integration
  6. Clustering
  7. Annotation

Steps

  1. Quality control
  2. Normalization
  3. Feature selection
  4. Dimensionality reduction
  5. Integration
  6. Clustering
  7. Annotation
  • Not necessarily a linear process

Quality Control

  • Filter empty droplets
  • Filter droplets with multiple cells
  • Remove low quality cells
    • Cells with high mitochondrial gene counts
      • Suggests cell death or damage
  • Filter cells with low gene counts
  • Filter genes expressed in few cells

Quality Control

  • Filter empty droplets
  • Filter droplets with multiple cells
  • Remove low quality cells
    • Cells with high mitochondrial gene counts
      • Suggests cell death or damage
  • Filter cells with low gene counts
  • Filter genes expressed in few cells

Quality Control

  • Filter empty droplets
  • Filter droplets with multiple cells
  • Remove low quality cells
    • Cells with high mitochondrial gene counts
      • Suggests cell death or damage
  • Filter cells with low gene counts
  • Filter genes expressed in few cells

Quality Control

  • Filter empty droplets
  • Filter droplets with multiple cells
  • Remove low quality cells
    • Cells with high mitochondrial gene counts
      • Suggests cell death or damage
  • Filter cells with low gene counts
  • Filter genes expressed in few cells

Quality Control

  • Can set hard filtering thresholds
    • These end up being arbitrary as there is rarely a clear boundary
    • Time intensive
      • Doesn’t scale well with large datasets
  • Median Absolute Deviation (MAD) based filtering
    • Uses the distribution of QC metrics to set thresholds
    • More robust to outliers
    • Can be automated
    • MAD = median(|x - median(x)|), where x is the QC metric

Quality Control

  • Best to be as permissive as possible
    • Don’t want to filter out real cells
  • Can always filter more later
  • Iterative and data driven approach is best
    • But beware of The garden of forking paths
      • Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking”
      • Be careful with your conclusions

Doublet Detection

  • Can be filtered out using tools like the scDblFinder R package or Scrublet Python package

Correction of ambient RNA

  • RNA in the environment of the cell
    • Can be from dead cells, lysed cells, or other sources
  • Potentially a significant source of noise in single cell RNAseq data
  • Can be corrected for using tools like SoupX
    • These tools use the expression profiles of empty droplets to estimate the ambient RNA profile
    • They then subtract this profile from the expression data of each cell

Normalization

  • Each step of single cell workflow introduces a degree of variablity
    • Capture of cells and mRNA molecules
    • Reverse transcription
    • Amplification
    • Sequencing
  • Count matrix contains widely varying variance terms
  • Statistical methods assume uniform variance

Normalization - Depth bias

  • Two genes: A & B
  • Two cell types: Blue & Green
  • Normalize by dividing UMI counts for each gene by total

Cambridge Institute

  • There is not differential expression, only a difference in sequencing depth

Normalization

  • Normalization adjusts the raw counts by scaling to a specified range.
  • Reduces technical differences so that differences between are primarily biological
  • See Ahlmann-Eltze et al. 2020 for review & benchmarking of normalization methods
  • Different techniques are better suited to different downstream analyses

Normalization - Techniques

  • Shifted logarithm
    • Works well for dimensionality reduction and differential expression
  • Scran’s pooling-based size factor estimation method
    • Works well or batch correction
  • Analytic Pearson residuals

Feature Selection

  • Many genes are not informative for downstream analysis
  • We want to:
    • Select gene that captue biologically meaningful variation
    • Reduce the number of genes that only contribute noise

Luecken & Theis, 2019

Dimensionality Reduction

  • scRNAseq data suffers from the “curse of dimensionality”
    • The data have a high number of dimensions (genes)
    • Data contains more noise and redundancy
      • Added dimensions do not add more information

Dimensionality Reduction

  • We already reduced the dimensionality of the data by selecting a subset of genes
  • We can reduce it further with dimensionality reduction algorithms

Luecken & Theis, 2019

  • The most popular:
    • Principal Component Analysis (PCA)
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
    • Uniform Manifold Approximation and Projection (UMAP)

Principal Component Analysis (PCA)

  • PCA creates new set of uncorrelated variables (principal components) that capture the most variance in the data
  • The first principal component captures the most variance, the second captures the second most, and so on
  • We can select the top principal components to use for downstream analysis
  • Highly efficient and easy to interpret
  • Not ideal for visualization

t-distributed Stochastic Neighbor Embedding (t-SNE)

  • Graph based, non-linear dimensionality reduction technique
  • Only local distances are preserved
  • Stochastic, so results can vary between runs
  • Good for visualizing clusters in high-dimensional data
  • See StatQuest Video for a great explanation

Uniform Manifold Approximation and Projection (UMAP)

  • Graph based, non-linear dimensionality reduction technique
  • Similar in principle to t-SNE
  • Uses principal components to create a graph
  • Preserves both local and global structure

Integration

  • Batch effects are a major challenge in scRNA-seq data analysis
    • Arise from processing cells in separate batches
      • Such as individual sample
    • Obscure biological realities
  • Removing batch effects is essential for analyses utlizing multiple batches

10X Genomics

Integration

Luecken & Theis, 2019

Clustering

  • Clustering is the process of grouping cells based on their expression profiles
  • Common clustering algorithms:
    • K-means
    • Hierarchical clustering
    • Louvain clustering
    • Leiden clustering
  • Louvain clustering was very popular for scRNA-seq data
  • Leiden is now the preferred algorithm
    • More robust to noise and outliers
    • Better at detecting small clusters
  • Clusters can be visualized with UMAP or t-SNE

Clustering

  • Keep in mind:
    • Clustering is an approximation of the underlying biological reality
    • Different levels can be appropriate for different questions
      • There may not be a single “correct” clustering
    • Clustering algorithm will create as many cluster as you ask it to
    • Don’t overlook continuous variation

Annotation

  • Assigning cell types to clusters
  • Can be done using:
    • Marker genes
    • Reference datasets
    • Automated annotation tools
  • Marker genes are genes that are specifically expressed in a particular cell type
  • Reference datasets are datasets that have already been annotated

Steps

  1. Quality control
  2. Normalization
  3. Dimensionality reduction
  4. Clustering
  5. Integration
  6. Annotation