Output Files Reference
This page describes all output files generated by Stage H (clustering and annotation) and Stage I (refinement).
Stage H Outputs
Stage H produces the initial clustering and hierarchical cell type annotation. All outputs are written to the specified output directory.
coarse_clusters.h5ad
The primary output AnnData file containing clustered and annotated cells.
Added .obs columns:
| Column | Type | Description |
|---|---|---|
cluster_lvl0 | string | Leiden cluster ID |
cell_type_auto | string | Assigned cell type label |
cell_type_auto_root | string | Root category (Epithelium, Immune, etc.) |
cell_type_auto_conf | float | Confidence score (min margin along path) |
cell_type_auto_path | string | Full hierarchy path |
DAPI_intensity | float | Technical marker intensity |
Collagen_IV_intensity | float | Technical marker intensity |
Beta_actin_intensity | float | Technical marker intensity |
Added .uns keys:
| Key | Description |
|---|---|
de_wilcoxon_{layer} | Differential expression results per layer |
stage_h_params | Run parameters including resolution and thresholds |
stage_h_gating_params | Gating thresholds used for marker positivity |
stage_h_marker_map | Compiled marker map used for annotation |
stage_h_timestamp | ISO timestamp of when Stage H was run |
cluster_annotations.csv
Per-cluster summary with assignment details. This file provides a human-readable overview of all cluster assignments.
| Column | Type | Description |
|---|---|---|
cluster_id | string | Cluster identifier |
assigned_label | string | Final cell type label |
assigned_path | string | Full hierarchy path (e.g., "Immune Cells / Lymphoids / T Cells") |
assigned_level | int | Hierarchy depth reached (0 = root) |
assigned_score | float | Score at assigned level |
root_label | string | Root category |
confidence | float | Min margin along path |
stop_reason | string | Why descent stopped |
n_cells | int | Number of cells in cluster |
coverage | float | Marker coverage (0-1) |
resolved_markers | string | Comma-separated list of markers used |
is_ambiguous_root | bool | True if root was ambiguous |
composition | JSON | Per-cell vote composition (for ambiguous clusters) |
Stop reason values:
| Value | Meaning |
|---|---|
leaf_reached | Reached a terminal node in hierarchy |
no_children_pass | No child types passed gating threshold |
low_margin | Winner margin below confidence threshold |
insufficient_coverage | Too few markers available at next level |
max_depth | Reached configured maximum depth |
marker_scores.csv
Full scoring matrix showing scores for all (cluster, cell_type) pairs. Useful for understanding why specific assignments were made.
| Column | Type | Description |
|---|---|---|
cluster_id | string | Cluster identifier |
label | string | Cell type label |
path | string | Hierarchy path |
level | int | Hierarchy level |
score | float | Total score |
coverage | float | Marker coverage |
mean_enrichment | float | Average Z-score |
mean_positive_fraction | float | Average positive fraction |
frac_markers_on | float | Fraction of markers above threshold |
de_component | float | DE bonus contribution |
anti_penalty | float | Anti-marker penalty |
resolved_markers | string | Markers used |
missing_markers | string | Markers not in panel |
n_cells | int | Cluster size |
decision_steps.csv
Step-by-step trace of hierarchical assignment decisions. Essential for debugging and understanding the decision process.
| Column | Type | Description |
|---|---|---|
cluster_id | string | Cluster identifier |
step_idx | int | Step number (0 = root selection) |
parent_label | string | Parent node |
child_label | string | Candidate being evaluated |
child_passed_gate | bool | Whether candidate passed gating |
child_score | float | Candidate's score |
child_coverage | float | Candidate's coverage |
child_pos_frac | float | Candidate's positive fraction |
child_enrichment | float | Candidate's enrichment |
child_anti_penalty | float | Candidate's anti-penalty |
selected | bool | Whether this candidate was selected |
margin_to_runner_up | float | Score gap to second place |
fail_reason | string | Why candidate failed (if applicable) |
Fail reason values:
| Value | Meaning |
|---|---|
below_gate | Score below gating threshold |
low_coverage | Insufficient marker coverage |
anti_markers | Penalized by anti-marker expression |
outscored | Another candidate had higher score |
figures/
QC visualization plots for assessing annotation quality:
| File | Description |
|---|---|
cluster_size_distribution.png | Histogram of cluster sizes |
confidence_distribution.png | Histogram of confidence scores |
depth_distribution.png | Histogram of hierarchy depths reached |
stop_reason_summary.png | Bar chart of stop reasons |
root_composition.png | Pie chart of root category distribution |
marker_coverage_heatmap.png | Heatmap of marker coverage by cluster |
Stage I Outputs
Stage I performs iterative refinement of the initial annotations, including subclustering of heterogeneous clusters and relabeling of misannotated ones.
refined.h5ad
Refined AnnData with updated annotations after one or more refinement iterations.
Added/modified .obs columns:
| Column | Type | Description |
|---|---|---|
cluster_lvl1 | string | Refined cluster ID (may include subclusters like "3:0", "3:1") |
cell_type_lvl1 | string | Refined cell type label |
cell_type_lvl0 | string | Original Stage H label (preserved) |
refinement_action | string | Action taken (SUBCLUSTER, RELABEL, KEEP) |
refinement_iteration | int | Iteration when this cell was last modified |
Added .uns keys:
| Key | Description |
|---|---|
stage_refine_iteration_N | Provenance for iteration N |
stage_i_params | Refinement parameters used |
stage_i_summary | Summary statistics of refinement |
Provenance structure (stage_refine_iteration_N):
{
"timestamp": "2024-01-15T10:30:00",
"clusters_processed": 45,
"subclustered": ["3", "7", "12"],
"relabeled": {"5": "B Cells", "9": "Macrophages"},
"skipped": ["1", "2", "4"],
"parameters": {...}
}
diagnostic_report.csv
Recommendations from diagnostic mode. Generated when running Stage I with --diagnostic flag.
| Column | Type | Description |
|---|---|---|
cluster_id | string | Cluster identifier |
current_label | string | Current assignment |
recommendation | string | SUBCLUSTER, RELABEL, or SKIP |
reason | string | Why this recommendation |
criterion | string | Which AutoPolicy criterion matched |
best_child | string | Best child type (for RELABEL) |
child_score | float | Best child's score |
heterogeneity_score | float | Cluster heterogeneity measure |
n_cells | int | Number of cells in cluster |
priority | int | Suggested processing order |
Recommendation values:
| Value | Meaning |
|---|---|
SUBCLUSTER | Cluster is heterogeneous, should be split |
RELABEL | Cluster can be assigned to a more specific type |
SKIP | No refinement needed |
refinement_log.csv
Detailed log of all refinement actions taken across iterations.
| Column | Type | Description |
|---|---|---|
iteration | int | Refinement iteration number |
cluster_id | string | Original cluster identifier |
action | string | Action taken |
old_label | string | Label before refinement |
new_label | string | Label after refinement |
new_clusters | string | Comma-separated list of new cluster IDs (for SUBCLUSTER) |
confidence_before | float | Confidence before action |
confidence_after | float | Confidence after action |
timestamp | string | ISO timestamp |
File Formats
AnnData (.h5ad)
All .h5ad files use the standard AnnData HDF5 format:
- Compatible with scanpy and anndata Python packages
- Can be read with
anndata.read_h5ad() - Stores expression matrices, cell/gene metadata, and analysis results
CSV Files
All CSV files use:
- UTF-8 encoding
- Comma delimiter
- First row as header
- Double-quote escaping for fields containing commas
JSON Fields
Some CSV columns contain JSON-encoded data:
- Parse with standard JSON libraries
- Always valid JSON (not Python dict syntax)
See Also
- Annotation Pipeline - Pipeline overview
- CLI Reference - Command options
- Marker Maps Configuration - Input marker map format