Skip to main content

Refinement Overview

The refinement module improves cell type annotations through automatic and manual policies. It provides a flexible framework for correcting, merging, splitting, and relabeling clusters based on statistical evidence and expert knowledge.

Deep Dive

For the decision logic behind refinement, see Refinement Decision Logic.

Architecture

The refinement system uses a two-policy architecture that allows automatic recommendations to be combined with manual overrides:

┌─────────────────────────────────────────────────────────────────────────────┐
│ REFINEMENT ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ AutoPolicy │ │ ManualPolicy │ │
│ │ (--auto flag) │ │ (--config YAML) │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ ▼ ▼ │
│ base_plan overlay_plan │
│ │ │ │
│ └───────────┬───────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ merge_plans() │ │
│ │ (overlay wins on │ │
│ │ conflicts) │ │
│ └──────────┬──────────┘ │
│ ▼ │
│ merged_plan │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ RefinementEngine │ │
│ │ (execute plan) │ │
│ └──────────┬──────────┘ │
│ ▼ │
│ refined.h5ad │
└─────────────────────────────────────────────────────────────────────────────┘

Key Components

  • AutoPolicy: Analyzes scoring metrics and automatically generates refinement recommendations based on configurable thresholds
  • ManualPolicy: Reads user-defined YAML configuration files specifying exact operations to perform
  • merge_plans(): Combines base and overlay plans, with manual overrides taking precedence on conflicts
  • RefinementEngine: Executes the merged plan and produces the refined output

Execution Modes

The refinement module supports two primary execution modes that control whether changes are applied to the data:

Diagnostic Mode (Default)

When run without the --execute flag, the refinement module operates in diagnostic mode:

  • Analyzes the input data and generates recommendations
  • Produces a diagnostic_report.csv with suggested refinements
  • Does not modify the input data
  • Useful for reviewing proposed changes before committing
# Diagnostic mode - review recommendations first
celltype-refinery refine input.h5ad --auto

Execution Mode

When the --execute flag is provided, refinements are applied:

  • Executes all planned operations on the data
  • Outputs a new refined.h5ad file with updated annotations
  • Original input file remains unchanged
  • Generates an execution log for audit purposes
# Execution mode - apply refinements
celltype-refinery refine input.h5ad --auto --execute

Policy Combinations

The refinement module supports three policy configurations:

Auto-only Mode

Uses only the AutoPolicy to generate and apply refinements based on statistical analysis:

celltype-refinery refine input.h5ad --auto --execute

Best for:

  • Initial annotation passes
  • Large datasets requiring automated curation
  • When manual review is not feasible

Manual-only Mode

Uses only the ManualPolicy with a user-defined YAML configuration:

celltype-refinery refine input.h5ad --config curation.yaml --execute

Best for:

  • Expert-driven curation workflows
  • Applying known corrections from literature
  • Reproducible, version-controlled refinements

Hybrid Mode

Combines AutoPolicy recommendations with manual overrides. Manual configurations take precedence when conflicts occur:

celltype-refinery refine input.h5ad --auto --config overrides.yaml --execute

Best for:

  • Leveraging automation while maintaining expert control
  • Correcting specific auto-generated recommendations
  • Iterative refinement workflows

Operation Types

The refinement module supports five operation types, each serving a distinct purpose:

OperationDescriptionSource
OverrideDirect label assignmentManualPolicy
MergeCombine similar clustersManualPolicy
SubclusterRe-cluster at finer resolutionAutoPolicy or ManualPolicy
RelabelUpdate label without re-clusteringAutoPolicy
RescoreRecompute scoresAutoPolicy (automatic)

Operation Details

Override: Directly assigns a new cell type label to a cluster, bypassing all scoring logic. Used when expert knowledge contradicts automated predictions.

Merge: Combines two or more clusters into a single cluster. Typically used when clusters represent the same cell type but were over-split during initial clustering.

Subcluster: Re-clusters cells within a cluster at a finer resolution. Used when a cluster contains heterogeneous cell populations that should be separated.

Relabel: Updates the assigned label based on re-evaluation of scores without modifying cluster membership. Used when the top-scoring label changes after threshold adjustments.

Rescore: Recomputes confidence scores for all clusters. Automatically triggered after structural changes (merge, subcluster) to ensure score consistency.

Execution Order

Operations are executed in a specific order to ensure consistency and avoid conflicts:

override → merge → subcluster → relabel → rescore

This ordering ensures that:

  1. Direct overrides are applied first, preventing unnecessary computation
  2. Merges consolidate clusters before subclustering decisions
  3. Subclustering creates new clusters that can then be relabeled
  4. Relabeling uses the final cluster structure
  5. Rescoring reflects all structural changes

AutoPolicy Selection Criteria

The AutoPolicy uses a rules-based system to determine which operations to recommend for each cluster. Key decision factors include:

  • Confidence Score: Primary metric for label reliability
  • Delta Score: Difference between top two candidate labels
  • Cluster Size: Number of cells in the cluster
  • Marker Expression: Presence of canonical markers for predicted types
  • Entropy: Distribution of cells across candidate labels

Decision Thresholds

MetricLowMediumHigh
Confidence< 0.30.3 - 0.7> 0.7
Delta< 0.10.1 - 0.3> 0.3

For detailed decision trees and threshold tuning, see the Tuning Guide and Refinement Decision Logic documentation.

CLI Examples

Basic Usage

# View help for refine command
celltype-refinery refine --help

# Diagnostic run with auto policy
celltype-refinery refine input.h5ad --auto

# Execute auto refinements
celltype-refinery refine input.h5ad --auto --execute

Manual Configuration

# Diagnostic run with manual config
celltype-refinery refine input.h5ad --config curation.yaml

# Execute manual refinements
celltype-refinery refine input.h5ad --config curation.yaml --execute

# Hybrid mode with overrides
celltype-refinery refine input.h5ad --auto --config overrides.yaml --execute

Output Control

# Specify output path
celltype-refinery refine input.h5ad --auto --execute --output refined_output.h5ad

# Generate detailed diagnostic report
celltype-refinery refine input.h5ad --auto --report detailed_report.csv

# Verbose logging
celltype-refinery refine input.h5ad --auto --execute --verbose

Advanced Options

# Custom confidence threshold
celltype-refinery refine input.h5ad --auto --execute --min-confidence 0.5

# Limit operations to specific clusters
celltype-refinery refine input.h5ad --auto --execute --clusters 0,1,5,12

# Dry run to preview execution plan
celltype-refinery refine input.h5ad --auto --dry-run

See Also