flowchart LR A["Baseline expr. x0"] --> B["Log-normalize, HVGs"] B --> C["Encoder: z_sem"] D["Metadata: type, time, drug"] --> C C --> E["Edit latent"] E --> F["Noise x_T"] F --> G["Denoiser"] G --> H["DDIM reverse"] H --> I["Predicted expr."]
Squidiff: Conditional Diffusion for Single-Cell Development and Perturbation Prediction
Introduction
Single-cell RNA sequencing has made it possible to profile transcriptomes at massive scale, yet measuring responses across every combination of cell state, drug, genetic perturbation, and time point remains expensive. Squidiff is a conditional diffusion generative model aimed at predicting single-cell transcriptomic states along developmental trajectories and under perturbations (for example drugs, gene knockouts, or physical stimuli such as radiation). A central ambition is to learn continuous, high-resolution state landscapes, not only mean shifts between conditions.
This post synthesizes the model’s design, how it relates to earlier perturbation predictors, what the published evaluations suggest, and what to watch for when reproducing or extending the method.
Executive summary
Squidiff conditions a reverse diffusion process on a semantic latent \(z_{\mathrm{sem}}\) from a semantic encoder (described in the paper as VAE-derived). The motivation, borrowed from diffusion autoencoder ideas, is that the “low-level” diffusion latent \(x_T\) alone is often not semantically smooth for interpolation; adding \(z_{\mathrm{sem}}\) steers sampling toward biologically meaningful directions.
Reported results include stronger performance than scGen on an iPSC differentiation prediction task (for example \(R^2\) across days) and stronger performance than scGen and GEARS on a two-gene perturbation example (ZBTB25+PTPN12), with metrics computed on 203 genes in that setting, using Pearson correlation, \(R^2\), and MMD, with very small p-values in the paper’s reporting. For out-of-distribution stress testing, the authors describe a masking experiment in blood vessel organoids: some cell types are withheld from training under irradiation, yet the model still generates irradiated-state transcriptomes for those held-out types with high accuracy according to their analysis; complementary readouts such as ELISA and immunofluorescence are used alongside transcriptomic analyses in the organoid work.
On the engineering side, the authors provide a public codebase, a PyPI package (Squidiff), and a reproducibility repository that claims checkpoints on figshare (access may vary by network or institution). One practical caveat: the GitHub LICENSE file has been reported to contain unresolved merge conflict markers, which is worth fixing for anyone redistributing or building on the code.
Problem framing and datasets
The paper frames single-cell perturbation response prediction as a bottleneck because exhaustively profiling transcriptomes across chemical, genetic, and physical conditions is costly. Squidiff is positioned as an in silico perturbation screen and as a way to model developmental “future” or “past” states under specified stimuli.
Evaluations and illustrations in the manuscript span several settings:
- Differentiation (iPSC toward mesendoderm / definitive endoderm), using a differentiating iPSC scRNA-seq dataset (Cuomo et al., 2020). Expression is log-normalized; highly variable genes (for example top 500) are used for computational tractability.
- Gene perturbation, with emphasis on predicting a two-gene combination (ZBTB25+PTPN12) compared against scGen and GEARS (the paper reports results on 203 genes for that comparison).
- Drug response in human tumor tissue (Zhao et al., 2021), with analysis across multiple cell types (for example UMAP structure and correlation of mean expression).
- Blood vessel organoids (BVO) for developmental interpolation and neutron/proton irradiation perturbations, with pathway-level interpretation and supporting ELISA and immunofluorescence where reported.
- Simulated data (including references to Splatter) to visualize aspects of the diffusion process; full simulation details may require the supplement.
Exact GEO accessions are not consistently recoverable from short excerpts alone; for a publication-grade bibliography, verify accession numbers against the final PDF and supplementary tables.
Model architecture
At a high level, Squidiff is a conditional diffusion model that predicts transcriptome vectors through a reverse diffusion process conditioned on a semantic embedding \(z_{\mathrm{sem}}\).
Reverse diffusion with semantic conditioning
The conditional reverse process is written as
\[ p_\theta(x_{0:T} \mid z_{\mathrm{sem}}) = p(x_T)\prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t, z_{\mathrm{sem}}), \]
with an \(\epsilon_\theta(x_t, t, z_{\mathrm{sem}})\)-parameterized deterministic update in the DDIM style to step from noise toward data-like vectors.
Why \(z_{\mathrm{sem}}\)? The paper argues that the diffusion latent \(x_T\) “typically does not contain high-level semantics,” so interpolation purely in \(x_T\) is not semantically meaningful. Conditioning on \(z_{\mathrm{sem}}\) is intended to make latent manipulation align with biological factors such as perturbation direction or developmental stage.
Implementation sketch (from the released code)
The public package uses an MLP denoiser: a main MLPModel maps gene expression vectors through stacked blocks with a time embedding (SiLU MLP). Conditioning is additive in each block: hidden activations are shifted by learned linear maps of the timestep embedding and, when present, of \(z_{\mathrm{sem}}\)—closer to additive conditioning than U-Net cross-attention in image diffusion. A semantic encoder (EncoderMLPModel) outputs a latent (default size 60) and can concatenate label embeddings or drug-structure / dose features when enabled.
Defaults and the drug-structure hook
From the released code and training scripts, several defaults are easy to miss but matter for reproduction:
- Hidden width on the order of 2048 in the MLP denoiser, with semantic latent dimension 60 unless overridden.
- Diffusion steps commonly set to 1000 (standard DDPM-style schedule), with DDIM as the default sampler when
use_ddim=True. - Drug conditioning: when
use_drug_structureis on, the encoder concatenates control features, drug dose, and a drug-structure vector; the codebase expects a fixed drug embedding width (on the order of 1024 dimensions). How that vector is produced—learned end-to-end versus a precomputed molecular fingerprint or neural fingerprint—should be checked against the paper, supplement, and config you run; treat “structure-aware generalization” claims as paper claims until you trace the exact pipeline. - PyPI: a package named
Squidiffhas been listed on PyPI (for example version 1.0.8 with a February 21, 2025 release date in the index metadata—confirm the current version on PyPI before pinning dependencies). Python ≥3.8 is typical for such releases.
Training objective
Training follows the standard noise-prediction objective:
\[ \mathcal{L}_{\mathrm{MSE}} = \sum_{t=1}^{T} \mathbb{E}_{x_0,\epsilon_t}\left[ \left\lVert \epsilon_\theta(x_t, t, z_{\mathrm{sem}}) - \epsilon_t \right\rVert_2^2 \right], \quad \epsilon_t \sim \mathcal{N}(0, 1). \]
When the implementation learns per-timestep variance (the “improved diffusion” pattern), the objective often adds a variational bound term \(\mathcal{L}_{\mathrm{VB}}\) alongside \(\mathcal{L}_{\mathrm{MSE}}\), so the total loss can be written schematically as \(\mathcal{L} = \mathcal{L}_{\mathrm{MSE}} + \mathcal{L}_{\mathrm{VB}}\) when learn_sigma or equivalent flags are enabled.
In code, \(x_t\) is sampled with a q_sample-style forward diffusion, the model is called on \((x_t, t)\) with access to the baseline \(x_0\) for encoding \(z_{\mathrm{sem}}\), and MSE is computed against the sampled noise \(\epsilon\) when using \(\epsilon\)-prediction; optional VB terms appear when variance is learned rather than fixed.
Conditioning and generation
Conceptually, the model learns perturbation or developmental “direction” in semantic space and uses that direction to generate intermediate or perturbed transcriptomes. For differentiation from day 0 to day 3, the paper describes extracting semantic latents \(z^{(1)}_{\mathrm{sem}}, z^{(2)}_{\mathrm{sem}}\) and using \(\Delta z_{\mathrm{sem}}\) as a direction vector.
In practice, the sampler defaults to DDIM (ddim_sample_loop when use_ddim=True). The API can implement interpolation via something like
\[ z_{\mathrm{interp}} = \mathrm{mean}(z_{\mathrm{origin}}) + (\mathrm{direction}) \cdot \mathrm{scale}, \]
optionally adding noise for per-cell diversity, then generating expression with DDIM conditioned on \(z_{\mathrm{interp}}\).
Why diffusion fits single-cell perturbation prediction
Forward and reverse process
In DDPM-style diffusion, a forward process gradually adds Gaussian noise:
\[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I\right), \]
so that
\[ x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I). \]
The model learns to approximate the reverse process by predicting noise (or equivalent reparameterizations). DDIM enables faster, near-deterministic sampling (\(\eta = 0\)), which matches Squidiff’s emphasis on controlled interpolation.
From a score-matching perspective, learning to denoise at many noise levels acts as a strong regularizer in high-dimensional gene space—useful when populations are heterogeneous and responses are multi-modal.
Compared with VAEs, GANs, and flows
- VAE family (scGen, trVAE, CPA, scVIDR, and related work) optimizes an ELBO with a KL term that can encourage overly smooth latents; scGen uses VAE latent arithmetic for perturbation effects, which is powerful but still bounded by decoder capacity and the latent arithmetic assumption.
- GAN-based scRNA-seq generators (for example scPreGAN-style approaches) can suffer from mode collapse and training instability in high dimensions—problematic when perturbation prediction must preserve both bulk shifts and rare subpopulations.
- Diffusion avoids adversarial training, often achieves better mode coverage than GANs in practice, and uses a simple noise-prediction loss that scales with model capacity.
- Normalizing flows give exact likelihoods but impose architectural constraints that can be tight for very high-dimensional transcriptomes; diffusion trades slower sampling for flexibility, mitigated in part by DDIM.
Squidiff treats preprocessed expression as continuous vectors and applies Gaussian diffusion, with an MLP denoiser rather than a convolutional U-Net. The biological hook is semantic conditioning and semantic vector arithmetic for trajectories and perturbations.
Training and generation algorithms (skeleton)
The following pseudocode aligns the paper’s \(\epsilon\)-prediction objective with typical code paths (q_sample, DDIM sampling, and semantic latent editing).
Training (per minibatch):
z_sem <- Encode(x0, metadata)
t <- sample timestep
eps <- Normal(0, I)
x_t <- sqrt(alpha_bar_t)*x0 + sqrt(1-alpha_bar_t)*eps
eps_hat <- Denoiser(x_t, t, z_sem)
loss <- ||eps - eps_hat||^2
# if learn_sigma: loss <- loss + L_VB (variational bound on variance)
update parameters
Generation for a target condition:
z_base <- Encode(x0_base, metadata_base)
z_target <- mean(z_base) + direction * scale (or interpolate / add noise)
x_T ~ Normal(0, I)
for t from T down to 1:
x_{t-1} <- DDIM_step(x_t, Denoiser(x_t, t, z_target), t)
return x_0_hat
Evidence and empirical comparisons
Summary of reported evaluations
| Task / setting | Baselines | Metrics (reported) | Claimed outcome | Caveats |
|---|---|---|---|---|
| iPSC differentiation | scGen | Pearson, \(R^2\) across days; pseudotime Spearman; Jaccard on DE genes | Squidiff outperforms scGen on \(R^2\); trajectory alignment | Limited excerpt vs CPA / CellOT / biolord / foundation-model hybrids |
| Two-gene perturbation | scGen, GEARS | Pearson, \(R^2\), MMD on 203 genes | Strong wins; small p-values | One highlighted combination; check supplement for breadth |
| Tumor drug response | (not always explicit) | UMAP; mean-expression correlation | Separation and correlation | Baseline comparisons not always spelled out in captions |
| BVO development | (varies) | UMAP trajectories; markers | Continuous trajectories | Fewer quantitative trajectory metrics than iPSC case |
| Irradiation + masking | internal stress test | Pearson / \(R^2\); DE; ELISA; immunofluorescence | Held-out cell types under irradiation | Distinct from unseen perturbation identity in general |
Strong evidence versus suggestive claims
Relatively well supported in the main text and figures: metric-based improvements over scGen on differentiation; metric-based improvements over scGen and GEARS on the named two-gene case; and the irradiation masking experiment as a meaningful test of generalization across cell types within a perturbation setting.
Worth verifying in supplements, code, or independent runs: how drug structure embeddings are formed when use_drug_structure is enabled; how many gene combinations were evaluated beyond the showcase pair; and any headline claims about millions of cells or entirely new compounds against the actual training data and preprocessing.
Practical applicability, assumptions, and reproducibility
Good fit when you need distribution-aware generation of post-perturbation cells (sampling heterogeneity, not only the mean), continuous trajectories (development, dose, time), or modeling of non-linear gene programs where multi-scale denoising may help.
Modeling assumptions include Gaussian diffusion on normalized expression: zero-inflated counts are handled only indirectly through preprocessing, not through a discrete count likelihood.
Compute: Diffusion training and sampling can be slower than single-pass VAE decoders; the paper notes sensitivity to noise schedules (for example tuning by gradually adding noise and selecting a strong configuration) and hardware—for instance training on a single NVIDIA H100 (80 GB) in at least one reported setup, which signals non-trivial GPU requirements.
Generalization has multiple axes: unseen cell type (partially tested via masking), unseen perturbation identity, unseen combination, unseen dose or time. The blog reader should map claims carefully onto these axes.
Reproducibility checklist
- Code on GitHub; install via
pip install Squidiff(as of early 2025 the index showed 1.0.8 released 2025-02-21; re-check PyPI before pinning). - Example training commands with optional drug-structure features; diffusion steps often default around 1000; DDIM sampling commonly default-on.
- Checkpoints described as on figshare in the reproducibility materials—confirm download before relying on them.
- Resolve LICENSE merge markers if you need clear redistribution terms.
Suggested ablations and local validation
When adapting Squidiff beyond the paper’s splits, a practical validation loop includes:
- Define the OOD axis (unseen cell type, unseen perturbation identity, unseen combination, unseen dose or time) and mirror the paper’s masking or holdout strategy before adding new stress tests.
- Metrics beyond means: DE gene ranking stability (top-\(k\) overlap), shifts in subtype proportions, and calibration using multiple stochastic samples from the same conditional model.
- Ablate semantic conditioning: remove or shuffle \(z_{\mathrm{sem}}\) and check whether developmental or perturbation interpolation collapses—this tests the core diffusion-autoencoder motivation.
- Conditioning geometry: compare pure additive shifts in hidden states to FiLM-style feature-wise scaling and shifting; additive bias is simple, but scaling can sometimes improve generalization on shifted covariates.
- Sampler choice: contrast DDIM with \(\eta = 0\) versus stochastic sampling (\(\eta > 0\)) to quantify the diversity–fidelity tradeoff for synthetic cells.
Potential extensions
Directions that sit naturally “next to” Squidiff in the research landscape:
- Discrete or tokenized diffusion over genes or tokens to better respect count sparsity than Gaussian vectors on normalized expression alone.
- Classifier-free or multi-condition guidance to strengthen control when multiple perturbations or dosages must be composed at generation time.
- Hybrid with foundation models: use a pretrained single-cell embedding model to produce \(z_{\mathrm{sem}}\), keeping Squidiff (or a lighter denoiser) as the conditional generator on top of frozen or fine-tuned representations.
- Joint uncertainty: diffusion ensembles or explicit variance heads to complement point predictions in screening settings.
Landscape: from VAE latent shifts to diffusion
The timeline diagram type is fragile across Mermaid versions bundled with static site generators; the same information is shown below as a flowchart (widely supported in Mermaid 9–10).
flowchart TB y2019["2019 · scGen"] y2020["2020 · trVAE"] y2021["2021 · CPA"] y2023a["2023 · CellOT"] y2023b["2023 · scVIDR"] y2024a["2024 · biolord"] y2024b["2024 · GEARS"] y2025["2025 · Squidiff"] y2026["2026+ · diffusion-style models"] y2019 --> y2020 --> y2021 --> y2023a y2023a --> y2023b --> y2024a --> y2024b --> y2025 --> y2026
A compact narrative:
- The VAE era emphasized interpretable latent shifts and compositionality but could smooth or mean-bias generations under strong shift.
- Optimal transport (CellOT) emphasized mapping whole distributions and subpopulations.
- GEARS brought gene–gene knowledge graphs to multigene perturbation prediction.
- Diffusion (Squidiff and successors) emphasizes iterative denoising for fidelity and coverage, with semantic conditioning and DDIM-friendly interpolation.
| Model | Era | Core idea | Strengths | Caveats |
|---|---|---|---|---|
| scGen | 2019 | VAE + latent arithmetic | Simple, widely used baseline | Latent geometry assumptions; decoder limits |
| trVAE | ~2020 | Conditional VAE + MMD | Distribution alignment for transfer | Hyperparameters; still VAE-backed |
| CPA | 2021+ | Compositional perturbation autoencoder | Drugs, doses, combinations | Encoder–decoder bottleneck |
| CellOT | 2023 | Neural optimal transport | Heterogeneous populations | OT assumptions in complex mixtures |
| scVIDR | 2023 | VAE dose–response | Dose trajectories | Flexibility under multi-perturb regimes |
| biolord | 2024 | Disentangled generative shifts | Inaccessible states | Disentanglement fragility |
| GEARS | 2024 | GNN + gene graph | Multigene; unseen combos | Depends on graph quality; not focused on long developmental trajectories |
| Squidiff | 2025 | Conditional diffusion + semantic latent | Continuous landscapes; strong reported metrics | Cost; schedule sensitivity; count model not explicit |
sequenceDiagram
autonumber
participant X0 as Baseline cells
participant Enc as Semantic encoder
participant Z as Semantic latent
participant Sam as DDIM sampler
participant Den as Denoiser
participant Out as Predicted expr.
X0->>Enc: Encode with metadata
Enc-->>Z: z_base
Note over Z: Set z_target for condition
Sam->>Sam: Draw initial noise x_T
loop Each reverse step
Sam->>Den: Noise prediction at step t
Den-->>Sam: Epsilon hat
Sam->>Sam: Apply DDIM update
end
Sam-->>Out: Final x0 hat
Conclusion
Squidiff connects two threads that matter for generative genomics: diffusion as a flexible generative prior over high-dimensional expression vectors, and semantic latents to make interpolation and perturbation directions align with biology rather than with arbitrary noise-space geometry. The published comparisons to scGen and GEARS, together with the organoid irradiation masking experiment, make a serious case—but the field should still press on which generalization axes are covered, how drug and structure conditioning is instantiated, and whether findings hold across many held-out perturbations and datasets.
For practitioners, the open codebase and package lower the barrier to try the model on new screens; treating the LICENSE hygiene and checkpoint availability as first-class engineering checks will pay off before building production pipelines on top.
This article is a synthesis for educational purposes and does not substitute for reading the primary paper and supplements. Metrics and claims are described as reported in the manuscript; independent reproduction is encouraged.