生成模型蒸馏：从多步采样到实时生成

总体问题：把慢 teacher 压缩成快 student

一个强大的 teacher 生成模型质量高但采样慢；蒸馏要把 teacher 的分布、轨迹、方向场或偏好行为压缩进更低 NFE 的 student。

0. 一句话直觉

分类蒸馏通常匹配 logits；生成模型蒸馏则要决定匹配什么：采样 transition、ODE endpoint、最终样本云、adversarial feature，还是 student 自己 rollout 时会遇到的状态分布。

1. 统一目标

若把 teacher sampler 写成 \(\Phi_T:z,c\mapsto x\)，student sampler 写成 \(\Phi_\theta:z,c\mapsto x\)，生成模型蒸馏的抽象目标可以写成：

\[ \min_\theta \; \mathcal{D}\!\left(p_\theta(x|c),p_T(x|c)\right) \]

不同方法的差别，在于 \(\mathcal{D}\) 不能直接计算时，它选择哪个可监督对象作为代理。

2. 可视化：蒸馏对象选择器

选择不同对象，会看到监督箭头从 logits、transition、endpoint、distribution、reward 或 self-rollout 分布出发。

符号约定

先把分布、采样器、score、critic、NFE 和 video context 统一到一套记号里。

符号含义可计算接口

\(p_{\mathrm{data}}(x|c)\)真实条件数据分布。只能通过样本观测。

\(p_T(x|c),p_\theta(x|c)\)teacher 与 student 生成分布。通常隐式采样。

\(\Phi_T,\Phi_\theta\)teacher / student sampler 或 transition map。采样轨迹上的可调用函数。

\(s_T,s_\phi\)teacher score 与 fake score / fake critic。估计加噪边缘分布方向。

\(G_\theta,D_\psi\)one-step/few-step student generator 与 adversarial critic。训练时更新，推理时通常只保留 \(G_\theta\)。

\(c,\tau,\mathrm{NFE}\)条件、采样轨迹、函数调用次数。工程延迟与质量权衡的核心变量。

理解检查

能解释为什么 \(p_\theta(x|c)\) 通常无法直接写出密度。
能区分 matching transition、endpoint 和 final distribution。
能说明 NFE 下降时为什么训练目标也需要改变。

为什么生成模型蒸馏和分类蒸馏不同

经典 KD 学 teacher 的输出概率；生成蒸馏要学习一个能采样的计算图。

1. 一句话直觉

分类 KD 的 teacher 给出 \(q_T(y|x)\)，而生成 teacher 给出的是分布、轨迹、score field、条件控制行为和偏好信号。

2. 数学对象与训练目标

\[ q_T(y|x)=\operatorname{softmax}(z_T/T),\qquad \mathcal{L}_{KD}=T^2D_{KL}(q_T\Vert q_\theta) \]

生成蒸馏更接近 \(\Phi_T:z\mapsto x\Rightarrow\Phi_\theta:z\mapsto x\)。若绑定同一个 \(z\) 的 endpoint，是 trajectory distillation；若只要求样本云一致，是 distribution matching。

3. 训练过程与推理过程

4. 特点与问题

优点

KD 给出“匹配 teacher 可计算行为”的基本范式。

问题

生成模型的可计算行为不是一个 logits 向量，而是一条采样计算图。

5. 理解检查

能写出 classic KD 的 KL loss。
能说明生成蒸馏为什么必须选择匹配对象。

轨迹蒸馏：老师多步，学生少步

Progressive Distillation 让 student 的一步学习 teacher 的两步，逐轮把 NFE 减半。

1. 一句话直觉

老师走很多小步，学生学习把几步合成一步。

2. 数学对象与推导

\[ x_{t-2\Delta}^{S}=\Phi_\theta(x_t,t,t-2\Delta,c) \approx \Phi_T(\Phi_T(x_t,t,t-\Delta,c),t-\Delta,t-2\Delta,c) \]

student 不是学习 clean image 本身，而是学习 teacher sampler 的局部复合 transition。多轮训练后，\(K\rightarrow K/2\rightarrow K/4\rightarrow\cdots\)。

3. 训练过程与推理过程

4. 可视化：trajectory compression lab

滑块控制 teacher curvature、student step size 和 distillation round，观察 NFE reduction 与 toy error。

teacher curvature105student step size4round2

NFE reduction = 4x · toy error = 0.00

5. 特点与问题

优点

监督清晰、实现直接、teacher transition 可复用。

失败模式

多轮蒸馏累积误差，并继承 teacher path 的偏差。

6. 理解检查

能解释为什么 teacher 两步可以成为 student 一步目标。
能写出 distilled schedule 推理循环。

Guided Distillation：把 CFG 压进单个学生

CFG 采样需要 conditional 与 unconditional 两次前向；guided distillation 让 student 直接输出 CFG 组合后的方向。

1. 一句话直觉

先把 \(\epsilon_{cfg}\) 变成 student 的单次输出，再把 guided student 做少步蒸馏。

2. 数学对象与训练目标

\[ \epsilon_\theta(x_t,t,c)\approx \epsilon_{uncond}^{T}(x_t,t)+w(\epsilon_{cond}^{T}(x_t,t,c)-\epsilon_{uncond}^{T}(x_t,t)) \]

3. 训练过程与推理过程

4. 需要解释的陷阱

student 往往只适配训练时的 guidance scale。
高 guidance scale 下的一步模型容易过饱和。
CFG 蒸馏降低双前向成本，不自动解决 trajectory 压缩误差。

5. 理解检查

能说明 CFG 和 guided distillation 的边界。
能解释为什么 scale 泛化有限。

一致性蒸馏：学习轨迹终点不变量

Consistency / LCM 学一个函数，使同一 ODE 轨迹上的不同噪声点映射到同一个 endpoint。

1. 一句话直觉

同一条 probability flow ODE 轨迹上的不同噪声点，应该回到同一个干净样本。

2. 数学对象与损失

\[ \mathcal{L}_{CM} = \mathbb{E}_{x_t,t,s,c} d\!\left(f_\theta(x_t,t,c),\operatorname{sg}[f_{\theta^-}(x_s,s,c)]\right) \]

其中 \(x_s\) 由 teacher ODE 从 \(x_t\) 积分到 \(s\)，\(\theta^-\) 是 EMA target network。

3. 训练过程与推理过程

4. 可视化：consistency endpoint lab

同一轨迹上多个 noisy points 都指向同一 endpoint。噪声等级越高，一致性误差越难控制。

trajectory noise38

consistency error = 0.00

5. 特点与问题

优点

一步采样自然，多步可以逐渐提质。

失败模式

endpoint consistency 不等于完整分布匹配，低步数细节仍可能弱。

6. 理解检查

能写出 EMA target consistency loss。
能区分 one-step consistency 与 few-step consistency 推理。

DMD：从轨迹匹配到分布匹配

DMD 不要求 student 复现 teacher 的每条路径，只要求 student 的样本分布像 teacher。

1. 一句话直觉

trajectory matching 绑定同一个 \(z\) 的 endpoint；DMD 匹配加噪边缘分布，允许不同路径得到同一分布。

2. 数学对象与 score gap

\[ \nabla_\theta \mathcal{L}_{DMD} \approx \mathbb{E}_{z,t,\epsilon,c} \left[ w(t)(s_\phi(x_t,t,c)-s_T(x_t,t,c)) \frac{\partial x_t}{\partial \theta} \right] \]

\(s_T\) 是 frozen teacher score，\(s_\phi\) 是追踪 student 加噪分布的 fake score。

3. 训练过程与推理过程

4. 可视化：DMD vs trajectory matching

trajectory 模式绑定路径；distribution 模式只看样本云覆盖。fake critic accuracy 和 GAN weight 会改变锐度与覆盖风险。

mode critic accuracy70 GAN weight35

coverage risk = medium · sharpness = medium

5. 需要解释的问题

DMD 的 reverse-KL-like 行为可能更 mode-seeking。
fake score 不准会给 generator 偏置梯度。
如果再加入 paired regression，student 会重新绑定 teacher path。

6. 理解检查

能说明 \(s_\phi-s_T\) 的含义。
能解释为什么 DMD 不等于 trajectory matching。

DMD2：更稳定、更可扩展的分布匹配蒸馏

DMD2 保留分布匹配优点，同时用 remove regression、two-time-scale fake critic、GAN loss 和 multi-step backward simulation 稳定训练。

1. 一句话直觉

DMD2 不再依赖大规模 paired teacher samples，而是在 generator 当前会产生的输入分布上更新 fake critic 和 student。

remove regression loss two-time-scale fake critic GAN loss with real data multi-step backward simulation

2. 总损失

\[ \mathcal{L}_{DMD2}=\mathcal{L}_{DMD}+\lambda_{adv}\mathcal{L}_{GAN} \]

multi-step student 训练时还要模拟 inference-time generator inputs，避免训练看 noisy real、推理看 previous generated sample 的 mismatch。

3. 训练过程与推理过程

4. 特点与问题

优点

不强绑定 teacher path，支持 one-step 和 multi-step student。

问题

fake critic、GAN head、backward simulation 的更新节奏使系统复杂。

5. 理解检查

能列出 DMD2 的四个改进点。
能解释 backward simulation 为什么服务 multi-step student。

ADD：Score Distillation + Adversarial Loss

Score teacher 保证方向大体正确，GAN discriminator 补低步数生成最容易丢失的高频细节和真实感。

1. 一句话直觉

ADD 不是普通 GAN；diffusion teacher 仍然提供强分布先验，adversarial loss 主要补 perceptual manifold signal。

2. 训练目标

\[ \mathcal{L}_{ADD}=\lambda_{score}\mathcal{L}_{score}+\lambda_{adv}\mathcal{L}_{adv} \]

3. 训练过程与推理过程

4. 可视化：loss balance lab

调节 score、DMD、adversarial、reward 权重，观察 toy sharpness、coverage 和 temporal consistency。

score65 DMD55 adversarial35 reward25

sharpness

coverage

temporal consistency

5. 理解检查

能解释 ADD 中 score loss 与 GAN loss 各解决什么。
能说出 adversarial loss 的稳定性风险。

LADD：Latent Adversarial Diffusion Distillation

LADD 将 adversarial feedback 放到 latent diffusion teacher 的 generative features 上，减少 pixel-space decode 开销。

1. 一句话直觉

ADD 走 latent → RGB → external discriminator；LADD 走 latent → teacher feature blocks → discriminator heads。

2. 关键变化

从 pixel-space adversarial feedback 转向 latent-space adversarial feedback。
利用 pretrained latent diffusion teacher 的 generative features，而不只是外部固定判别特征。
对 generated latent 和 target latent 重新加噪，在指定 noise level 上抽取 teacher block features。
noise level 控制反馈类型：高噪声偏全局结构，低噪声偏局部纹理。
主训练循环避免频繁 decode 到 RGB，适合高分辨率和多宽高比训练。

\[ h_S^{(\ell)}=\operatorname{Feat}_T^{(\ell)}(z_{S,t},t,c),\qquad \mathcal{L}_{G,LADD}=-\sum_\ell \mathbb{E}\log D_\psi^{(\ell)}(h_S^{(\ell)},t,c) \]

3. 训练过程与推理过程

4. 可视化：LADD feature feedback

拖动 noise level，看反馈从 global structure 逐渐偏向 local texture。

noise level58

5. 理解检查

能说明 LADD 为什么不只是 latent space GAN。
能解释 noise level 如何影响反馈类型。

Flow / Rectified Path 的少步化

Reflow、Shortcut、MeanFlow 不是单纯压缩 diffusion chain，而是让路径更直或直接学习大步平均速度。

1. 一句话直觉

若采样瓶颈来自 ODE 积分，可以让路径更直，或让模型直接预测一个区间的 average velocity。

2. 数学对象

\[ \frac{dx_t}{dt}=v_\theta(x_t,t,c),\qquad x_t=(1-t)x_0+t x_1,\qquad \bar v_\theta(x_t,r,t,c)\approx\frac{x_t-x_r}{t-r} \]

3. 训练过程与推理过程

4. 特点与问题

优点

把少步目标前移到路径或速度定义上。

问题

并非所有任务的路径都足够直；one-step 质量需要按 paper reports 与复现验证。

5. 理解检查

能区分 instantaneous velocity 和 average velocity。
能解释 Reflow 为什么可能降低 NFE。

Self Forcing：训练时就让模型吃自己的输出

Self Forcing 把 AR video diffusion 的训练分布改成 student 自己推理时会遇到的 self-generated context。

1. 一句话直觉

Teacher forcing 训练时吃 ground-truth context，推理时吃自己生成的 context；Self Forcing 直接在训练时进行 self-rollout。

2. Holistic distribution matching

\[ \mathcal{L}_{SF}=\mathcal{D}(p_\theta(x^{1:N}|c),p_{\mathrm{data}}(x^{1:N}|c)) \]

\(\mathcal{D}\) 可以由 DMD、SiD、GAN 或 video-level reward/critic 近似。

3. 训练过程与推理过程

4. 可视化：rolling KV cache lab

比较 Teacher Forcing 与 Self Forcing。后者训练与推理都使用 self-generated context 和 rolling KV cache。

moderollout length9cache size4

train-test gap warning = high

5. 重点解释

Self Forcing 是 post-training，不一定从零训练。
核心是 train-time rollout mirrors inference-time rollout。
rolling KV cache 既是速度优化，也是分布对齐机制。
stochastic gradient truncation 让长 AR rollout 可训练。

6. 理解检查

能解释 self-rollout 与 teacher forcing 的差异。
能说出 KV cache 在训练和推理中的作用。

Self-Forcing++：从短视频 teacher 到长视频自纠错

Self-Forcing++ 让 student 先生成超过 teacher horizon 的长视频，再抽取短窗口交给 teacher 修正。

1. 一句话直觉

不需要 long-video teacher；短视频 teacher 只负责修正 student 自生成长视频中的局部退化窗口。

2. Extended DMD 目标

\[ \nabla_\theta \mathcal{L}_{extended\text{-}DMD} \approx \mathbb{E}_{i,t,\epsilon} \left[w(t)(s_S(W_{\theta,t},t,c)-s_T(W_{\theta,t},t,c)) \frac{\partial W_{\theta,t}}{\partial\theta}\right] \]

其中 \(W_\theta\) 是 student long rollout 中采样出的短窗口，teacher 只评估这个窗口。

3. 训练过程与推理过程

4. 特点与问题

优点

利用 short-horizon teacher correction 处理 long rollout 中真实出现的退化状态。

失败模式

motion freeze、over-exposure、subject drift、color drift、scene collapse、cache contamination。

5. 理解检查

能解释 backward noise initialization 的作用。
能说明 Self-Forcing++ 为什么无需 long-video teacher。

统一技术路线演化与方法对比

从 logits KD 到 self-rollout correction，核心变化是匹配对象越来越接近 student 推理时的真实分布。

Classic KDTrajectoryConsistencyDMD / DMD2ADD / LADDSelf-rolloutLong-horizon correction

方法	匹配对象	绑定 teacher path	典型 NFE	训练信号	优点	失败模式	适合场景
Progressive Distillation	local transition	强	4-16	teacher two-step target	简单稳定	继承 teacher path	diffusion 少步化
Guided Distillation	CFG-composed denoiser	中	1-8	guided teacher vector	减少 CFG 双调用	scale 泛化有限	固定 CFG 快速采样
Consistency / LCM	endpoint invariant	中	1-4	EMA target consistency	一步自然	细节依赖 schedule	latent diffusion acceleration
DMD	noisy marginal distribution	弱	1	real / fake score gap	不要求路径对应	fake score 难跟踪	one-step generator
DMD2	DMD + GAN + backward simulation	弱	1-4	fake critic, real data GAN	去 paired data	系统复杂	高质量 fast T2I
ADD	score + adversarial	中	1-4	score distillation + discriminator	锐度强	训练稳定性	turbo image generation
LADD	latent generative features	中	1-4	teacher feature discriminator	高分辨率友好	依赖强 latent teacher	SD3/latent turbo
Self Forcing	self-rollout video distribution	中	few-step/chunk	holistic video loss	训练-推理状态对齐	训练 horizon 限制	AR video diffusion
Self-Forcing++	long rollout windows	弱/中	few-step/chunk	extended DMD	长视频自纠错	训练成本高	minute-scale video

常见误解与最终检查

生成模型蒸馏不是只把模型变小，而是压缩采样计算图和推理时状态分布。

常见误解

蒸馏不只是模型压缩；生成蒸馏主要压缩采样计算图，参数量可以不小。
one-step 延迟最低，但 four-step 往往在细节、文本一致性、多样性上更稳。
DMD 不要求 teacher path 对齐，但仍需要 teacher score 或 target distribution signal。
GAN loss 不只是让图更锐，也可能造成 mode collapse 或伪影。
Self Forcing 的关键是训练时真的使用 self-generated context 和 KV cache。
Self-Forcing++ 不需要 long-video teacher，而是用 short-horizon teacher 修正长 rollout 的短窗口。

理解检查

能按匹配对象解释 trajectory、consistency、DMD、ADD/LADD 和 Self Forcing 的区别。
能写出 DMD2 的 four-part training recipe。
能解释为什么 LADD 的核心是 teacher generative features，而不是普通 latent GAN。
能解释 Self-Forcing++ 如何用短 teacher 纠正长视频。

延伸阅读

Hinton et al. 2015；Salimans & Ho 2022；Song et al. 2023；Luo et al. 2023；Sauer et al. 2023/2024；Yin et al. 2024；Huang et al. 2025；Cui et al. 2025；Lu et al. 2025。文中低步数效果均应理解为 paper reports，不是任意模型和数据上的普遍保证。

来源核验表

训练流程与推理流程按论文或项目页核对；表中只记录与本文算法步骤直接相关的依据。

方法	核验到的流程要点	来源
Classic KD	temperature soft targets 与 KL matching。	Distilling the Knowledge in a Neural Network
Progressive Distillation	teacher 两个 DDIM-like 小步合成 student 一个大步，逐轮减半 NFE。	Progressive Distillation for Fast Sampling
Guided Distillation	先匹配 CFG-composed output，再做 progressive distillation 到少步。	On Distillation of Guided Diffusion Models
Consistency / LCM / TCD / CTM	同一 PF-ODE 轨迹上的状态映射到一致 endpoint；LCM 迁移到 latent guided PF-ODE，TCD/CTM 扩展到轨迹一致性。	Consistency Models · LCM · TCD · CTM
DMD / DMD2	DMD 用 real score 与 fake score 的差更新 generator；DMD2 去 paired regression，加入 two-time-scale fake critic、GAN loss 与 multi-step backward simulation。	DMD project · DMD2
ADD / LADD	ADD 结合 score distillation 与 adversarial loss；LADD 在 latent space 使用 teacher generative features 和 noise-level feedback。	ADD · LADD
Flow / Shortcut / MeanFlow / SFD	Flow Matching 学 instantaneous velocity；Rectified/Reflow 让路径更直；Shortcut/MeanFlow/SFD 学可跳步或区间平均速度。	Flow Matching · Rectified Flow · MeanFlow · SFD
Self Forcing / Self-Forcing++	Self Forcing 训练时做 self-rollout 与 rolling KV cache；Self-Forcing++ 生成长 rollout，抽局部窗口，用 backward noise initialization 与 extended DMD 对齐短 teacher。	Self Forcing · Self Forcing paper · Self-Forcing++