Output Files Reference

This page describes all output files generated by Stage H (clustering and annotation) and Stage I (refinement).

Stage H Outputs

Stage H produces the initial clustering and hierarchical cell type annotation. All outputs are written to the specified output directory.

coarse_clusters.h5ad

The primary output AnnData file containing clustered and annotated cells.

Added .obs columns:

Column	Type	Description
`cluster_lvl0`	string	Leiden cluster ID
`cell_type_auto`	string	Assigned cell type label
`cell_type_auto_root`	string	Root category (Epithelium, Immune, etc.)
`cell_type_auto_conf`	float	Confidence score (min margin along path)
`cell_type_auto_path`	string	Full hierarchy path
`DAPI_intensity`	float	Technical marker intensity
`Collagen_IV_intensity`	float	Technical marker intensity
`Beta_actin_intensity`	float	Technical marker intensity

Added .uns keys:

Key	Description
`de_wilcoxon_{layer}`	Differential expression results per layer
`stage_h_params`	Run parameters including resolution and thresholds
`stage_h_gating_params`	Gating thresholds used for marker positivity
`stage_h_marker_map`	Compiled marker map used for annotation
`stage_h_timestamp`	ISO timestamp of when Stage H was run

cluster_annotations.csv

Per-cluster summary with assignment details. This file provides a human-readable overview of all cluster assignments.

Column	Type	Description
`cluster_id`	string	Cluster identifier
`assigned_label`	string	Final cell type label
`assigned_path`	string	Full hierarchy path (e.g., "Immune Cells / Lymphoids / T Cells")
`assigned_level`	int	Hierarchy depth reached (0 = root)
`assigned_score`	float	Score at assigned level
`root_label`	string	Root category
`confidence`	float	Min margin along path
`stop_reason`	string	Why descent stopped
`n_cells`	int	Number of cells in cluster
`coverage`	float	Marker coverage (0-1)
`resolved_markers`	string	Comma-separated list of markers used
`is_ambiguous_root`	bool	True if root was ambiguous
`composition`	JSON	Per-cell vote composition (for ambiguous clusters)

Stop reason values:

Value	Meaning
`leaf_reached`	Reached a terminal node in hierarchy
`no_children_pass`	No child types passed gating threshold
`low_margin`	Winner margin below confidence threshold
`insufficient_coverage`	Too few markers available at next level
`max_depth`	Reached configured maximum depth

marker_scores.csv

Full scoring matrix showing scores for all (cluster, cell_type) pairs. Useful for understanding why specific assignments were made.

Column	Type	Description
`cluster_id`	string	Cluster identifier
`label`	string	Cell type label
`path`	string	Hierarchy path
`level`	int	Hierarchy level
`score`	float	Total score
`coverage`	float	Marker coverage
`mean_enrichment`	float	Average Z-score
`mean_positive_fraction`	float	Average positive fraction
`frac_markers_on`	float	Fraction of markers above threshold
`de_component`	float	DE bonus contribution
`anti_penalty`	float	Anti-marker penalty
`resolved_markers`	string	Markers used
`missing_markers`	string	Markers not in panel
`n_cells`	int	Cluster size

decision_steps.csv

Step-by-step trace of hierarchical assignment decisions. Essential for debugging and understanding the decision process.

Column	Type	Description
`cluster_id`	string	Cluster identifier
`step_idx`	int	Step number (0 = root selection)
`parent_label`	string	Parent node
`child_label`	string	Candidate being evaluated
`child_passed_gate`	bool	Whether candidate passed gating
`child_score`	float	Candidate's score
`child_coverage`	float	Candidate's coverage
`child_pos_frac`	float	Candidate's positive fraction
`child_enrichment`	float	Candidate's enrichment
`child_anti_penalty`	float	Candidate's anti-penalty
`selected`	bool	Whether this candidate was selected
`margin_to_runner_up`	float	Score gap to second place
`fail_reason`	string	Why candidate failed (if applicable)

Fail reason values:

Value	Meaning
`below_gate`	Score below gating threshold
`low_coverage`	Insufficient marker coverage
`anti_markers`	Penalized by anti-marker expression
`outscored`	Another candidate had higher score

figures/

QC visualization plots for assessing annotation quality:

File	Description
`cluster_size_distribution.png`	Histogram of cluster sizes
`confidence_distribution.png`	Histogram of confidence scores
`depth_distribution.png`	Histogram of hierarchy depths reached
`stop_reason_summary.png`	Bar chart of stop reasons
`root_composition.png`	Pie chart of root category distribution
`marker_coverage_heatmap.png`	Heatmap of marker coverage by cluster

Stage I Outputs

Stage I performs iterative refinement of the initial annotations, including subclustering of heterogeneous clusters and relabeling of misannotated ones.

refined.h5ad

Refined AnnData with updated annotations after one or more refinement iterations.

Added/modified .obs columns:

Column	Type	Description
`cluster_lvl1`	string	Refined cluster ID (may include subclusters like "3:0", "3:1")
`cell_type_lvl1`	string	Refined cell type label
`cell_type_lvl0`	string	Original Stage H label (preserved)
`refinement_action`	string	Action taken (SUBCLUSTER, RELABEL, KEEP)
`refinement_iteration`	int	Iteration when this cell was last modified

Added .uns keys:

Key	Description
`stage_refine_iteration_N`	Provenance for iteration N
`stage_i_params`	Refinement parameters used
`stage_i_summary`	Summary statistics of refinement

Provenance structure (stage_refine_iteration_N):

{
    "timestamp": "2024-01-15T10:30:00",
    "clusters_processed": 45,
    "subclustered": ["3", "7", "12"],
    "relabeled": {"5": "B Cells", "9": "Macrophages"},
    "skipped": ["1", "2", "4"],
    "parameters": {...}
}

diagnostic_report.csv

Recommendations from diagnostic mode. Generated when running Stage I with --diagnostic flag.

Column	Type	Description
`cluster_id`	string	Cluster identifier
`current_label`	string	Current assignment
`recommendation`	string	SUBCLUSTER, RELABEL, or SKIP
`reason`	string	Why this recommendation
`criterion`	string	Which AutoPolicy criterion matched
`best_child`	string	Best child type (for RELABEL)
`child_score`	float	Best child's score
`heterogeneity_score`	float	Cluster heterogeneity measure
`n_cells`	int	Number of cells in cluster
`priority`	int	Suggested processing order

Recommendation values:

Value	Meaning
`SUBCLUSTER`	Cluster is heterogeneous, should be split
`RELABEL`	Cluster can be assigned to a more specific type
`SKIP`	No refinement needed

refinement_log.csv

Detailed log of all refinement actions taken across iterations.

Column	Type	Description
`iteration`	int	Refinement iteration number
`cluster_id`	string	Original cluster identifier
`action`	string	Action taken
`old_label`	string	Label before refinement
`new_label`	string	Label after refinement
`new_clusters`	string	Comma-separated list of new cluster IDs (for SUBCLUSTER)
`confidence_before`	float	Confidence before action
`confidence_after`	float	Confidence after action
`timestamp`	string	ISO timestamp

File Formats

AnnData (.h5ad)

All .h5ad files use the standard AnnData HDF5 format:

Compatible with scanpy and anndata Python packages
Can be read with anndata.read_h5ad()
Stores expression matrices, cell/gene metadata, and analysis results

CSV Files

All CSV files use:

UTF-8 encoding
Comma delimiter
First row as header
Double-quote escaping for fields containing commas

JSON Fields

Some CSV columns contain JSON-encoded data:

Parse with standard JSON libraries
Always valid JSON (not Python dict syntax)

Stage H Outputs​

coarse_clusters.h5ad​

cluster_annotations.csv​

marker_scores.csv​

decision_steps.csv​

figures/​

Stage I Outputs​

refined.h5ad​

diagnostic_report.csv​

refinement_log.csv​

File Formats​

AnnData (.h5ad)​

CSV Files​

JSON Fields​

See Also​