Squidiff: Conditional Diffusion for Single-Cell Development and Perturbation Prediction

Author

Adrian

Published

April 11, 2026

Introduction

Single-cell RNA sequencing has made it possible to profile transcriptomes at massive scale, yet measuring responses across every combination of cell state, drug, genetic perturbation, and time point remains expensive. Squidiff is a conditional diffusion generative model aimed at predicting single-cell transcriptomic states along developmental trajectories and under perturbations (for example drugs, gene knockouts, or physical stimuli such as radiation). A central ambition is to learn continuous, high-resolution state landscapes, not only mean shifts between conditions.

This post synthesizes the model’s design, how it relates to earlier perturbation predictors, what the published evaluations suggest, and what to watch for when reproducing or extending the method.

Executive summary

Squidiff conditions a reverse diffusion process on a semantic latent \(z_{\mathrm{sem}}\) from a semantic encoder (described in the paper as VAE-derived). The motivation, borrowed from diffusion autoencoder ideas, is that the “low-level” diffusion latent \(x_T\) alone is often not semantically smooth for interpolation; adding \(z_{\mathrm{sem}}\) steers sampling toward biologically meaningful directions.

Reported results include stronger performance than scGen on an iPSC differentiation prediction task (for example \(R^2\) across days) and stronger performance than scGen and GEARS on a two-gene perturbation example (ZBTB25+PTPN12), with metrics computed on 203 genes in that setting, using Pearson correlation, \(R^2\), and MMD, with very small p-values in the paper’s reporting. For out-of-distribution stress testing, the authors describe a masking experiment in blood vessel organoids: some cell types are withheld from training under irradiation, yet the model still generates irradiated-state transcriptomes for those held-out types with high accuracy according to their analysis; complementary readouts such as ELISA and immunofluorescence are used alongside transcriptomic analyses in the organoid work.

On the engineering side, the authors provide a public codebase, a PyPI package (Squidiff), and a reproducibility repository that claims checkpoints on figshare (access may vary by network or institution). One practical caveat: the GitHub LICENSE file has been reported to contain unresolved merge conflict markers, which is worth fixing for anyone redistributing or building on the code.

Problem framing and datasets

The paper frames single-cell perturbation response prediction as a bottleneck because exhaustively profiling transcriptomes across chemical, genetic, and physical conditions is costly. Squidiff is positioned as an in silico perturbation screen and as a way to model developmental “future” or “past” states under specified stimuli.

Evaluations and illustrations in the manuscript span several settings:

Differentiation (iPSC toward mesendoderm / definitive endoderm), using a differentiating iPSC scRNA-seq dataset (Cuomo et al., 2020). Expression is log-normalized; highly variable genes (for example top 500) are used for computational tractability.
Gene perturbation, with emphasis on predicting a two-gene combination (ZBTB25+PTPN12) compared against scGen and GEARS (the paper reports results on 203 genes for that comparison).
Drug response in human tumor tissue (Zhao et al., 2021), with analysis across multiple cell types (for example UMAP structure and correlation of mean expression).
Blood vessel organoids (BVO) for developmental interpolation and neutron/proton irradiation perturbations, with pathway-level interpretation and supporting ELISA and immunofluorescence where reported.
Simulated data (including references to Splatter) to visualize aspects of the diffusion process; full simulation details may require the supplement.

Exact GEO accessions are not consistently recoverable from short excerpts alone; for a publication-grade bibliography, verify accession numbers against the final PDF and supplementary tables.

Model architecture

At a high level, Squidiff is a conditional diffusion model that predicts transcriptome vectors through a reverse diffusion process conditioned on a semantic embedding \(z_{\mathrm{sem}}\).

Reverse diffusion with semantic conditioning

The conditional reverse process is written as

\[ p_\theta(x_{0:T} \mid z_{\mathrm{sem}}) = p(x_T)\prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t, z_{\mathrm{sem}}), \]

with an \(\epsilon_\theta(x_t, t, z_{\mathrm{sem}})\)-parameterized deterministic update in the DDIM style to step from noise toward data-like vectors.

Why \(z_{\mathrm{sem}}\)? The paper argues that the diffusion latent \(x_T\) “typically does not contain high-level semantics,” so interpolation purely in \(x_T\) is not semantically meaningful. Conditioning on \(z_{\mathrm{sem}}\) is intended to make latent manipulation align with biological factors such as perturbation direction or developmental stage.

Implementation sketch (from the released code)

The public package uses an MLP denoiser: a main MLPModel maps gene expression vectors through stacked blocks with a time embedding (SiLU MLP). Conditioning is additive in each block: hidden activations are shifted by learned linear maps of the timestep embedding and, when present, of \(z_{\mathrm{sem}}\)—closer to additive conditioning than U-Net cross-attention in image diffusion. A semantic encoder (EncoderMLPModel) outputs a latent (default size 60) and can concatenate label embeddings or drug-structure / dose features when enabled.

Defaults and the drug-structure hook

From the released code and training scripts, several defaults are easy to miss but matter for reproduction:

Hidden width on the order of 2048 in the MLP denoiser, with semantic latent dimension 60 unless overridden.
Diffusion steps commonly set to 1000 (standard DDPM-style schedule), with DDIM as the default sampler when use_ddim=True.
Drug conditioning: when use_drug_structure is on, the encoder concatenates control features, drug dose, and a drug-structure vector; the codebase expects a fixed drug embedding width (on the order of 1024 dimensions). How that vector is produced—learned end-to-end versus a precomputed molecular fingerprint or neural fingerprint—should be checked against the paper, supplement, and config you run; treat “structure-aware generalization” claims as paper claims until you trace the exact pipeline.
PyPI: a package named Squidiff has been listed on PyPI (for example version 1.0.8 with a February 21, 2025 release date in the index metadata—confirm the current version on PyPI before pinning dependencies). Python ≥3.8 is typical for such releases.

Training objective

Training follows the standard noise-prediction objective:

\[ \mathcal{L}_{\mathrm{MSE}} = \sum_{t=1}^{T} \mathbb{E}_{x_0,\epsilon_t}\left[ \left\lVert \epsilon_\theta(x_t, t, z_{\mathrm{sem}}) - \epsilon_t \right\rVert_2^2 \right], \quad \epsilon_t \sim \mathcal{N}(0, 1). \]

When the implementation learns per-timestep variance (the “improved diffusion” pattern), the objective often adds a variational bound term \(\mathcal{L}_{\mathrm{VB}}\) alongside \(\mathcal{L}_{\mathrm{MSE}}\), so the total loss can be written schematically as \(\mathcal{L} = \mathcal{L}_{\mathrm{MSE}} + \mathcal{L}_{\mathrm{VB}}\) when learn_sigma or equivalent flags are enabled.

In code, \(x_t\) is sampled with a q_sample-style forward diffusion, the model is called on \((x_t, t)\) with access to the baseline \(x_0\) for encoding \(z_{\mathrm{sem}}\), and MSE is computed against the sampled noise \(\epsilon\) when using \(\epsilon\)-prediction; optional VB terms appear when variance is learned rather than fixed.

Conditioning and generation

Conceptually, the model learns perturbation or developmental “direction” in semantic space and uses that direction to generate intermediate or perturbed transcriptomes. For differentiation from day 0 to day 3, the paper describes extracting semantic latents \(z^{(1)}_{\mathrm{sem}}, z^{(2)}_{\mathrm{sem}}\) and using \(\Delta z_{\mathrm{sem}}\) as a direction vector.

In practice, the sampler defaults to DDIM (ddim_sample_loop when use_ddim=True). The API can implement interpolation via something like

\[ z_{\mathrm{interp}} = \mathrm{mean}(z_{\mathrm{origin}}) + (\mathrm{direction}) \cdot \mathrm{scale}, \]

optionally adding noise for per-cell diversity, then generating expression with DDIM conditioned on \(z_{\mathrm{interp}}\).

flowchart LR
  A["Baseline expr. x0"] --> B["Log-normalize, HVGs"]
  B --> C["Encoder: z_sem"]
  D["Metadata: type, time, drug"] --> C
  C --> E["Edit latent"]
  E --> F["Noise x_T"]
  F --> G["Denoiser"]
  G --> H["DDIM reverse"]
  H --> I["Predicted expr."]

Why diffusion fits single-cell perturbation prediction

Forward and reverse process

In DDPM-style diffusion, a forward process gradually adds Gaussian noise:

\[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I\right), \]

so that

\[ x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I). \]

The model learns to approximate the reverse process by predicting noise (or equivalent reparameterizations). DDIM enables faster, near-deterministic sampling (\(\eta = 0\)), which matches Squidiff’s emphasis on controlled interpolation.

From a score-matching perspective, learning to denoise at many noise levels acts as a strong regularizer in high-dimensional gene space—useful when populations are heterogeneous and responses are multi-modal.

Compared with VAEs, GANs, and flows

VAE family (scGen, trVAE, CPA, scVIDR, and related work) optimizes an ELBO with a KL term that can encourage overly smooth latents; scGen uses VAE latent arithmetic for perturbation effects, which is powerful but still bounded by decoder capacity and the latent arithmetic assumption.
GAN-based scRNA-seq generators (for example scPreGAN-style approaches) can suffer from mode collapse and training instability in high dimensions—problematic when perturbation prediction must preserve both bulk shifts and rare subpopulations.
Diffusion avoids adversarial training, often achieves better mode coverage than GANs in practice, and uses a simple noise-prediction loss that scales with model capacity.
Normalizing flows give exact likelihoods but impose architectural constraints that can be tight for very high-dimensional transcriptomes; diffusion trades slower sampling for flexibility, mitigated in part by DDIM.

Squidiff treats preprocessed expression as continuous vectors and applies Gaussian diffusion, with an MLP denoiser rather than a convolutional U-Net. The biological hook is semantic conditioning and semantic vector arithmetic for trajectories and perturbations.

Training and generation algorithms (skeleton)

The following pseudocode aligns the paper’s \(\epsilon\)-prediction objective with typical code paths (q_sample, DDIM sampling, and semantic latent editing).

Training (per minibatch):
  z_sem <- Encode(x0, metadata)
  t <- sample timestep
  eps <- Normal(0, I)
  x_t <- sqrt(alpha_bar_t)*x0 + sqrt(1-alpha_bar_t)*eps
  eps_hat <- Denoiser(x_t, t, z_sem)
  loss <- ||eps - eps_hat||^2
  # if learn_sigma: loss <- loss + L_VB (variational bound on variance)
  update parameters

Generation for a target condition:
  z_base <- Encode(x0_base, metadata_base)
  z_target <- mean(z_base) + direction * scale   (or interpolate / add noise)
  x_T ~ Normal(0, I)
  for t from T down to 1:
      x_{t-1} <- DDIM_step(x_t, Denoiser(x_t, t, z_target), t)
  return x_0_hat

Evidence and empirical comparisons

Summary of reported evaluations

Task / setting	Baselines	Metrics (reported)	Claimed outcome	Caveats
iPSC differentiation	scGen	Pearson, \(R^2\) across days; pseudotime Spearman; Jaccard on DE genes	Squidiff outperforms scGen on \(R^2\); trajectory alignment	Limited excerpt vs CPA / CellOT / biolord / foundation-model hybrids
Two-gene perturbation	scGen, GEARS	Pearson, \(R^2\), MMD on 203 genes	Strong wins; small p-values	One highlighted combination; check supplement for breadth
Tumor drug response	(not always explicit)	UMAP; mean-expression correlation	Separation and correlation	Baseline comparisons not always spelled out in captions
BVO development	(varies)	UMAP trajectories; markers	Continuous trajectories	Fewer quantitative trajectory metrics than iPSC case
Irradiation + masking	internal stress test	Pearson / \(R^2\); DE; ELISA; immunofluorescence	Held-out cell types under irradiation	Distinct from unseen perturbation identity in general

Strong evidence versus suggestive claims

Relatively well supported in the main text and figures: metric-based improvements over scGen on differentiation; metric-based improvements over scGen and GEARS on the named two-gene case; and the irradiation masking experiment as a meaningful test of generalization across cell types within a perturbation setting.

Worth verifying in supplements, code, or independent runs: how drug structure embeddings are formed when use_drug_structure is enabled; how many gene combinations were evaluated beyond the showcase pair; and any headline claims about millions of cells or entirely new compounds against the actual training data and preprocessing.

Practical applicability, assumptions, and reproducibility

Good fit when you need distribution-aware generation of post-perturbation cells (sampling heterogeneity, not only the mean), continuous trajectories (development, dose, time), or modeling of non-linear gene programs where multi-scale denoising may help.

Modeling assumptions include Gaussian diffusion on normalized expression: zero-inflated counts are handled only indirectly through preprocessing, not through a discrete count likelihood.

Compute: Diffusion training and sampling can be slower than single-pass VAE decoders; the paper notes sensitivity to noise schedules (for example tuning by gradually adding noise and selecting a strong configuration) and hardware—for instance training on a single NVIDIA H100 (80 GB) in at least one reported setup, which signals non-trivial GPU requirements.

Generalization has multiple axes: unseen cell type (partially tested via masking), unseen perturbation identity, unseen combination, unseen dose or time. The blog reader should map claims carefully onto these axes.

Reproducibility checklist

Code on GitHub; install via pip install Squidiff (as of early 2025 the index showed 1.0.8 released 2025-02-21; re-check PyPI before pinning).
Example training commands with optional drug-structure features; diffusion steps often default around 1000; DDIM sampling commonly default-on.
Checkpoints described as on figshare in the reproducibility materials—confirm download before relying on them.
Resolve LICENSE merge markers if you need clear redistribution terms.

Suggested ablations and local validation

When adapting Squidiff beyond the paper’s splits, a practical validation loop includes:

Define the OOD axis (unseen cell type, unseen perturbation identity, unseen combination, unseen dose or time) and mirror the paper’s masking or holdout strategy before adding new stress tests.
Metrics beyond means: DE gene ranking stability (top-\(k\) overlap), shifts in subtype proportions, and calibration using multiple stochastic samples from the same conditional model.
Ablate semantic conditioning: remove or shuffle \(z_{\mathrm{sem}}\) and check whether developmental or perturbation interpolation collapses—this tests the core diffusion-autoencoder motivation.
Conditioning geometry: compare pure additive shifts in hidden states to FiLM-style feature-wise scaling and shifting; additive bias is simple, but scaling can sometimes improve generalization on shifted covariates.
Sampler choice: contrast DDIM with \(\eta = 0\) versus stochastic sampling (\(\eta > 0\)) to quantify the diversity–fidelity tradeoff for synthetic cells.

Potential extensions

Directions that sit naturally “next to” Squidiff in the research landscape:

Discrete or tokenized diffusion over genes or tokens to better respect count sparsity than Gaussian vectors on normalized expression alone.
Classifier-free or multi-condition guidance to strengthen control when multiple perturbations or dosages must be composed at generation time.
Hybrid with foundation models: use a pretrained single-cell embedding model to produce \(z_{\mathrm{sem}}\), keeping Squidiff (or a lighter denoiser) as the conditional generator on top of frozen or fine-tuned representations.
Joint uncertainty: diffusion ensembles or explicit variance heads to complement point predictions in screening settings.

Landscape: from VAE latent shifts to diffusion

The timeline diagram type is fragile across Mermaid versions bundled with static site generators; the same information is shown below as a flowchart (widely supported in Mermaid 9–10).

flowchart TB
  y2019["2019 · scGen"]
  y2020["2020 · trVAE"]
  y2021["2021 · CPA"]
  y2023a["2023 · CellOT"]
  y2023b["2023 · scVIDR"]
  y2024a["2024 · biolord"]
  y2024b["2024 · GEARS"]
  y2025["2025 · Squidiff"]
  y2026["2026+ · diffusion-style models"]
  y2019 --> y2020 --> y2021 --> y2023a
  y2023a --> y2023b --> y2024a --> y2024b --> y2025 --> y2026

A compact narrative:

The VAE era emphasized interpretable latent shifts and compositionality but could smooth or mean-bias generations under strong shift.
Optimal transport (CellOT) emphasized mapping whole distributions and subpopulations.
GEARS brought gene–gene knowledge graphs to multigene perturbation prediction.
Diffusion (Squidiff and successors) emphasizes iterative denoising for fidelity and coverage, with semantic conditioning and DDIM-friendly interpolation.

Model	Era	Core idea	Strengths	Caveats
scGen	2019	VAE + latent arithmetic	Simple, widely used baseline	Latent geometry assumptions; decoder limits
trVAE	~2020	Conditional VAE + MMD	Distribution alignment for transfer	Hyperparameters; still VAE-backed
CPA	2021+	Compositional perturbation autoencoder	Drugs, doses, combinations	Encoder–decoder bottleneck
CellOT	2023	Neural optimal transport	Heterogeneous populations	OT assumptions in complex mixtures
scVIDR	2023	VAE dose–response	Dose trajectories	Flexibility under multi-perturb regimes
biolord	2024	Disentangled generative shifts	Inaccessible states	Disentanglement fragility
GEARS	2024	GNN + gene graph	Multigene; unseen combos	Depends on graph quality; not focused on long developmental trajectories
Squidiff	2025	Conditional diffusion + semantic latent	Continuous landscapes; strong reported metrics	Cost; schedule sensitivity; count model not explicit

sequenceDiagram
  autonumber
  participant X0 as Baseline cells
  participant Enc as Semantic encoder
  participant Z as Semantic latent
  participant Sam as DDIM sampler
  participant Den as Denoiser
  participant Out as Predicted expr.

  X0->>Enc: Encode with metadata
  Enc-->>Z: z_base
  Note over Z: Set z_target for condition
  Sam->>Sam: Draw initial noise x_T
  loop Each reverse step
    Sam->>Den: Noise prediction at step t
    Den-->>Sam: Epsilon hat
    Sam->>Sam: Apply DDIM update
  end
  Sam-->>Out: Final x0 hat

Conclusion

Squidiff connects two threads that matter for generative genomics: diffusion as a flexible generative prior over high-dimensional expression vectors, and semantic latents to make interpolation and perturbation directions align with biology rather than with arbitrary noise-space geometry. The published comparisons to scGen and GEARS, together with the organoid irradiation masking experiment, make a serious case—but the field should still press on which generalization axes are covered, how drug and structure conditioning is instantiated, and whether findings hold across many held-out perturbations and datasets.

For practitioners, the open codebase and package lower the barrier to try the model on new screens; treating the LICENSE hygiene and checkpoint availability as first-class engineering checks will pay off before building production pipelines on top.

This article is a synthesis for educational purposes and does not substitute for reading the primary paper and supplements. Metrics and claims are described as reported in the manuscript; independent reproduction is encouraged.