Skip to main content

Clustering Overview

The clustering module performs Leiden clustering and differential expression analysis to identify distinct cell populations in your data.

Pipeline Context

Clustering is part of Stage H. For the complete pipeline, see Annotation Pipeline.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│ CLUSTERING PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Expression Matrix (n_cells × n_markers) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. SCALE │ │
│ │ • Zero-center each marker │ │
│ │ • Clip values at ±10 standard deviations │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 2. PCA │ │
│ │ • n_components = 30 (default) │ │
│ │ • Stores in adata.obsm["X_pca"] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 3. NEIGHBORS │ │
│ │ • k = 15 (default) │ │
│ │ • Uses PCA coordinates │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 4. LEIDEN │ │
│ │ • resolution = 0.6 (default) │ │
│ │ • GPU: rapids_singlecell / CPU: igraph │ │
│ │ • Stores in adata.obs["cluster_lvl0"] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 5. UMAP (optional) │ │
│ │ • 2D embedding for visualization │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Preprocessing

Before clustering begins, several preprocessing steps prepare the data:

Layer Selection

The pipeline automatically selects the best available data layer in this priority order:

  1. batchcorr - Batch-corrected expression values (preferred)
  2. aligned - Aligned expression values
  3. X - Raw expression matrix (fallback)

Technical Marker Exclusion

Technical markers that should not influence clustering are automatically moved to .obs:

  • DAPI - Nuclear stain
  • Collagen IV - Structural marker
  • Beta-actin - Housekeeping gene

These markers remain available for visualization but do not affect cluster assignments.

Low-Variance Filtering

Markers with very low variance (std < 1e-3) are excluded from clustering to prevent numerical instability and improve cluster quality.

GPU Non-Determinism

GPU-accelerated Leiden produces different cluster counts across runs (~5 clusters). For reproducibility, use --no-gpu or run clustering once and use --annotation-only for iterations.

Key Parameters

ParameterDefaultDescriptionWhen to Adjust
resolution0.6Leiden clustering resolutionIncrease for more clusters, decrease for fewer
n_pcs30Number of principal componentsLower if markers < 30; increase for complex datasets
neighbors_k15k for k-NN graph constructionIncrease for smoother clusters; decrease to preserve rare populations
use_gpufalseEnable GPU accelerationSet true for large datasets (>100k cells)
min_cells10Minimum cells per clusterIncrease to filter noise clusters
random_state42Random seed for reproducibilityChange to test stability

GPU Acceleration

The clustering module supports GPU acceleration via RAPIDS and cuGraph for significantly faster processing on large datasets.

When GPU is Available

  • rapids_singlecell handles PCA and neighbor computation
  • cuGraph performs GPU-accelerated Leiden clustering
  • Typical speedup: 10-50x for datasets >100k cells

Automatic Fallback

If GPU is unavailable or --no-gpu is specified:

  • scanpy handles PCA and neighbor computation
  • igraph performs CPU-based Leiden clustering

Hardware Requirements

  • NVIDIA GPU with CUDA support
  • RAPIDS libraries installed (rapids-singlecell, cugraph)

CLI Usage

Basic Usage

celltype-refinery cluster \
--input merged.h5ad \
--resolution 0.6 \
--n-pcs 30 \
--out output/clustered

High-Resolution Clustering

celltype-refinery cluster \
--input merged.h5ad \
--resolution 1.2 \
--neighbors-k 10 \
--out output/high_res

GPU-Accelerated Processing

celltype-refinery cluster \
--input large_dataset.h5ad \
--use-gpu \
--resolution 0.6 \
--out output/gpu_clustered

Reproducible CPU-Only Run

celltype-refinery cluster \
--input merged.h5ad \
--no-gpu \
--random-state 42 \
--out output/reproducible

Annotation-Only Mode (Skip Reclustering)

celltype-refinery cluster \
--input already_clustered.h5ad \
--annotation-only \
--out output/reannotated

See Also