First Annotation Tutorial

A complete walkthrough of annotating your first dataset with CellType-Refinery.

Overview

In this tutorial, you'll:

Prepare your input data
Create a marker map
Run clustering
Annotate cell types
Review and refine results

Prerequisites

CellType-Refinery installed
A merged AnnData file with normalized expression data
Knowledge of expected cell types in your tissue

Step 1: Prepare Input Data

Your input should be an AnnData (.h5ad) file with:

X matrix: Normalized expression values
obs: Cell metadata (sample_id, region, etc.)
var: Marker/antibody names

import scanpy as sc

# Load your data
adata = sc.read_h5ad("merged_data.h5ad")

# Check structure
print(f"Cells: {adata.n_obs}")
print(f"Markers: {adata.n_vars}")
print(f"Layers: {list(adata.layers.keys())}")

Step 2: Create Marker Map

Define your cell-type hierarchy in JSON:

{
  "_marker_map_metadata": {
    "version": "1.0",
    "tissue": "example_tissue",
    "description": "Example marker map"
  },
  "Epithelium": {
    "markers": ["EpCAM", "Pan-Cytokeratin"],
    "anti_markers": ["CD45", "Vimentin"],
    "subtypes": {
      "Type A": {
        "markers": ["Marker1", "Marker2"]
      },
      "Type B": {
        "markers": ["Marker3", "Marker4"]
      }
    }
  },
  "Immune Cells": {
    "markers": ["CD45"],
    "anti_markers": ["EpCAM"],
    "subtypes": {
      "T Cells": {
        "markers": ["CD3"]
      },
      "Macrophages": {
        "markers": ["CD68", "CD163"]
      }
    }
  },
  "Stromal": {
    "markers": ["Vimentin"],
    "anti_markers": ["CD45", "EpCAM"]
  }
}

Save as marker_map.json.

Step 3: Run Clustering

Cluster cells using Leiden algorithm:

celltype-refinery cluster \
  --input merged_data.h5ad \
  --resolution 0.6 \
  --n-pcs 30 \
  --out output/clustering

Or in Python:

from celltype_refinery.core.clustering import ClusteringEngine

engine = ClusteringEngine(resolution=0.6, n_pcs=30)
result = engine.run(adata, output_dir="output/clustering")

print(f"Found {result.n_clusters} clusters")

Step 4: Annotate Cell Types

Run annotation with your marker map:

celltype-refinery annotate \
  --input output/clustering/clustered.h5ad \
  --marker-map marker_map.json \
  --out output/annotation

Or in Python:

from celltype_refinery.core.annotation import AnnotationEngine

engine = AnnotationEngine(marker_map_path="marker_map.json")
result = engine.run(adata, output_dir="output/annotation")

# Review results
print(result.summary())

Step 5: Review Results

Check the annotation outputs:

import pandas as pd

# Load cluster annotations
annotations = pd.read_csv("output/annotation/cluster_annotations.csv")
print(annotations[["cluster_id", "assigned_label", "score", "confidence"]])

# Check cell-type distribution
print(adata.obs["cell_type_curated"].value_counts())

Understanding Scores

Score Range	Confidence	Interpretation
> 2.0	HIGH	Strong marker expression
1.0 - 2.0	MEDIUM	Moderate expression
0.5 - 1.0	LOW	Weak expression
< 0.5	VERY_LOW	Consider refinement

Step 6: Refine Annotations

If some clusters have low confidence, refine them:

# Diagnostic mode first (no changes)
celltype-refinery refine \
  --input output/annotation/annotated.h5ad \
  --auto \
  --out output/refinement

# Review diagnostic report
cat output/refinement/diagnostic_report.csv

# Execute refinement
celltype-refinery refine \
  --input output/annotation/annotated.h5ad \
  --auto \
  --execute \
  --out output/refinement

Step 7: Run Analysis

Generate composition and spatial analysis:

celltype-refinery analyze \
  --input output/refinement/refined.h5ad \
  --out output/analysis

Output Structure

After completing all steps:

output/
├── clustering/
│   ├── clustered.h5ad
│   └── cluster_stats.csv
├── annotation/
│   ├── annotated.h5ad
│   ├── cluster_annotations.csv
│   ├── marker_scores.csv
│   └── mapping_table.csv
├── refinement/
│   ├── refined.h5ad
│   ├── diagnostic_report.csv
│   └── curation_log.json
└── analysis/
    ├── composition/
    ├── spatial/
    └── review/

Troubleshooting

High Unassigned Rate

If many cells are unassigned:

Check marker names match between data and marker map
Lower gating thresholds (--min-coverage 0.2)
Review marker expression with sc.pl.dotplot()

Incorrect Annotations

If cell types seem wrong:

Check marker specificity in your data
Add anti-markers to distinguish similar types
Use manual overrides for known misassignments

Low Confidence Scores

If scores are consistently low:

Verify data normalization
Check for batch effects
Consider fewer, more specific markers

Next Steps

Core Workflows - Learn advanced patterns
Refinement Guide - Deep dive into refinement
Configuration Reference - Full parameter docs

Overview​

Prerequisites​

Step 1: Prepare Input Data​

Step 2: Create Marker Map​

Step 3: Run Clustering​

Step 4: Annotate Cell Types​

Step 5: Review Results​

Understanding Scores​

Step 6: Refine Annotations​

Step 7: Run Analysis​

Output Structure​

Troubleshooting​

High Unassigned Rate​

Incorrect Annotations​

Low Confidence Scores​

Next Steps​