CellTreeQM: Reconstructing Cell Lineage Trees from Phenotypic Features via Metric Learning

Abstract

How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells replicate to generate cell lineages and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage histories, which provides an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation (i.e., acquisition of specialized traits). Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. By contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating the lineage reconstruction problem as tree-metric learning, we systematically explore weakly supervised training settings at different levels of information and present the Cell Lineage Reconstruction Benchmark to facilitate comprehensive evaluation. This benchmark includes (1) synthetic data modeled via Brownian motion with independent noise and spurious signals; (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.

Overview of the CellTreeQM workflow

When the full tree is known as prior knowledge, this is a supervised setting. When no prior information about the tree is available, the setting is unsupervised. In between, we highlight two weakly supervised settings: the High-level Partitioning Setting, where only high-level groupings are available, and the Partially Leaf-labeled Setting, where topological labels are provided for a subset of leaves.

Directional Weight Score

Four-Point Condition and Quartet Loss

Buneman’s theorem (1971) states that for any four leaves A, B, C, D, the three distance sums \(S_1=d_{AB}+d_{CD}\), \(S_2=d_{AC}+d_{BD}\), \(S_3=d_{AD}+d_{BC}\) satisfy: exactly two of them match and both exceed the remaining one (right diagram). This law both detects additivity and uniquely determines the unrooted quartet topology.

Sizes of model trees

In CellTreeQM, we turn the theorem into a loss: \(L_\text{close}=|S_1-S_2|\) forces the two largest sums to coincide, while \(L_\text{push}=\bigl[S_3-\tfrac{S_1+S_2}{2}+m_0\bigr]_+\) keeps the smallest sum sufficiently lower. Minimising \(L_\text{additivity}=L_\text{close}+L_\text{push}\) encourages the latent space to respect tree geometry even when the full lineage is unknown.

This figure illustrates the geometric intuition of the loss. When additivity is violated, the structure of a quartet can be imagined as a “box” with an extra edge. The \(L_\text{close}\) term encourages the top two distance sums to become more similar, thereby reducing the imbalance that creates the box-like distortion. In effect, it ensures that the box is not “fat”. Meanwhile, the \(L_\text{push}\) term increases the gap between the smallest sum and the average of the top two, effectively “widening the bridge.” This widening enhances the tree model’s robustness to noise and distortions from the ideal tree structure.

High-Level Partitioning Results

In the high-level partitioning setting, we assume that only coarse clade groupings are known (e.g., the first few root-level splits), without full tree labels. CellTreeQM uses this partial supervision to infer quartet relationships. Our model significantly outperforms contrastive baselines (Triplet, Quadruplet) across all levels of priors. Notably, CellTreeQM recovers known tree structures with near-perfect accuracy and generalizes well to unseen quartets.