生成模型加速原理

总体问题：慢在哪里，才能快在哪里

生成模型加速不是单纯压缩模型。核心问题是：如何用更少的计算、更少的内存、更少的数据搬运、更少的同步等待，近似或等价地得到同一个生成分布。

一句话直觉

参数多只是表象；很多真实瓶颈来自串行生成、HBM 数据搬运、KV cache 膨胀、请求长度不规则，以及 GPU 调度空泡。

统一代价模型

单步延迟的下界可以拆成计算、带宽、通信、同步和调度几项：

\[ T_{\text{step}}\approx \max\left( \frac{\text{FLOPs}}{\text{compute throughput}}, \frac{\text{bytes moved}}{\text{memory bandwidth}} \right) +T_{\text{comm}}+T_{\text{sync}}+T_{\text{schedule}}. \]

不同加速技术本质上是在压缩某一项：NFE、token 数、precision bytes、KV cache、HBM traffic、batch 空泡或通信等待。

更少步数DDIM, DPM-Solver, UniPC, ODE solver

更少搬运FlashAttention, kernel fusion, activation recomputation

更少 KVMQA, GQA, MLA, PagedAttention, prefix cache

更少精度FP8, INT8, INT4, GPTQ, AWQ, SmoothQuant, KV quant

更高利用率continuous batching, chunked prefill, request scheduling

更稀疏激活MoE, sparse attention, token pruning, dynamic halting

可视化：瓶颈地图

调节 workload 类型，观察主要瓶颈如何从 compute 转移到 memory、serial depth 或 scheduling。

decode 常被 KV cache 带宽、串行 token 和 batch 利用率限制。

伪代码：统一延迟估计

Roofline 视角：算力瓶颈还是带宽瓶颈

GPU 很快，但把数据从显存搬到计算单元也很慢。很多快算法不是少算了，而是少搬了。

数学对象 / 代价模型

定义 arithmetic intensity：

\[ I=\frac{\text{FLOPs}}{\text{bytes moved}},\qquad T\ge \max\left(\frac{\text{FLOPs}}{P_{\max}},\frac{\text{bytes moved}}{B_{\max}}\right). \]

当 \(I\) 很低时，模型更像 memory-bound；当 \(I\) 很高时，模型更可能被 compute throughput 限制。

Roofline Calculator

输入每步 FLOPs、数据搬运和硬件峰值，估计 compute time、memory time 与瓶颈类型。

FLOPs / step (TF)160 Bytes moved (GB)220 Peak TFLOP/s900 Bandwidth TB/s3.4 Batch size8 Precision bytes2

latency lower bound = 0 ms

优点、风险、诊断信号

优点

先判断瓶颈，避免盲目追逐 FLOPs。

风险

实际 runtime 还有 kernel launch、shape、cache miss、通信和调度成本。

诊断信号

TFLOP/s、HBM bandwidth、kernel utilization、prefill/decode 分开测。

工程动作

memory-bound 优先看 IO-aware kernel、量化、cache；compute-bound 再看并行和低精度 tensor core。

Attention Kernel：FlashAttention 的原理

FlashAttention 仍然计算 exact attention，但用分块和在线 softmax 避免把完整 \(T\times T\) score matrix 写回 HBM。

一句话直觉

标准 attention 慢，不只是 \(O(T^2)\)；更关键是巨大中间矩阵的显存读写。FlashAttention 尽量让中间状态停留在 SRAM / shared memory。

数学对象 / 在线 softmax

\[ \operatorname{Attn}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt d}\right)V,\qquad m_{\text{new}}=\max(m_{\text{old}},m_{\text{block}}) \] \[ l_{\text{new}}=e^{m_{\text{old}}-m_{\text{new}}}l_{\text{old}}+ e^{m_{\text{block}}-m_{\text{new}}}l_{\text{block}}. \]

通过递推维护每行最大值和分母，就不需要显式保存完整 attention matrix。

Attention Memory Visualizer

比较 standard attention 与 FlashAttention 在 score matrix materialization 和 HBM traffic 上的差异。

sequence length T2048 head dim d128 mode

attention matrix = 0 GB

伪代码：Blocked Exact Attention

风险与诊断

不同 kernel 对 mask、dropout、head dim、causal attention 和硬件代际支持不同。
低精度 attention 要做 numerical parity test，尤其是长上下文和高温采样。
prefill 与 decode 要分开 benchmark；decode 往往还需要 KV cache 与 serving 优化。

KV Cache：从重新计算到记忆历史

自回归模型每生成一个 token 都要看历史。KV cache 把每层历史 key/value 存起来，避免每步重复计算整段上下文。

KV cache 内存公式

设 batch \(B\)、层数 \(L\)、上下文长度 \(T\)、KV heads \(H_{kv}\)、head dim \(D\)、每元素字节 \(b\)，则：

\[ M_{KV}=2\cdot B\cdot L\cdot T\cdot H_{kv}\cdot D\cdot b. \]

MQA 令 \(H_{kv}=1\)，GQA 令 \(1<H_{kv}<H_q\)，因此 KV cache 可线性下降。

KV Cache Memory Calculator

比较 MHA、GQA 和 MQA 在同一 batch/context 下的 KV 显存占用。

layers L32 context T8192 batch B16 query heads Hq32 KV heads Hkv8 head dim D128

GQA cache = 0 GB

伪代码：KV Cache Decode Loop

伪代码：Paged KV Allocation + Prefix Sharing

PagedAttention 与 Prefix Cache

PagedAttention

把 KV cache 切成固定 block，用 block table 映射逻辑序列和物理显存，减少碎片。

Prefix cache

多个请求共享 system prompt、few-shot 或文档前缀时，只 prefill 一次并复用 KV。

风险

prefix 命中率低、block 太粗或太细、多租户隔离不足都会削弱收益。

诊断信号

KV hit rate、fragmentation、active/free blocks、tokens/sec、TTFT 与 TPOT。

Serving & Continuous Batching：让 GPU 不等慢请求

静态 batch 会被长请求拖住；continuous batching 允许新请求在旧请求 decode 过程中动态加入，填补 GPU 空泡。

一句话直觉

普通 batching 像等整桌人都吃完再换桌；continuous batching 是谁吃完谁走，新请求立刻补位。

Continuous Batching Timeline

对比静态 batch 和 continuous batch 的空白 slot、平均 latency 与 GPU utilization。

max batch slots6 request variance58 arrival rate5

continuous batching utilization = 0%

伪代码：Serving Scheduler

Quantization：用更少 bit 表示权重、激活和 KV

量化不是随便截断 float。好的量化要回答 scale 怎么选、outlier 怎么处理、误差对输出是否敏感、硬件 kernel 是否真正加速。

基本公式

\[ s=\frac{\max |x|}{2^{b-1}-1},\qquad q=\operatorname{clip}\left(\operatorname{round}(x/s),-2^{b-1},2^{b-1}-1\right),\qquad \hat{x}=sq. \]

SmoothQuant 用 \(Y=XW=(XS)(S^{-1}W)\) 把 activation outlier 的难度迁移给 weight；GPTQ 用近似二阶信息补偿逐列量化误差；AWQ 用 activation-aware scaling 保护 salient channels。

Quantization Error Lab

调节 bit 数、scale 类型、outlier 强度和方法，观察重构误差与动态范围浪费。

bits scale type outlier strength45 method

relative error = 0%

伪代码：PTQ Calibration + Quantized Linear

风险与诊断

权重量化通常减显存和权重带宽，但不保证端到端 latency 更低。
KV quant 会影响 long-context retrieval；数学、代码和精确复制任务更敏感。
诊断要分开看 perplexity、exact match、pass@k、long-context accuracy、prefill latency 和 decode latency。

Speculative Decoding：小模型草稿，大模型验收

小模型先猜多个 token，目标模型一次性验证，减少大模型串行 forward 次数；正确的 rejection sampling 可以保持目标分布。

接受概率

目标模型 \(p\)，draft 模型 \(q\)。候选 token \(\tilde{x}\) 的接受概率为：

\[ a=\min\left(1,\frac{p(\tilde{x}\mid h)}{q(\tilde{x}\mid h)}\right). \]

被拒绝时从残差分布采样，因此最终输出仍来自目标模型 \(p\)。

Speculative Acceptance Simulator

调节 draft quality、proposal length、draft cost 和 temperature，估计接受长度与 speedup。

draft quality72 K proposal5 draft cost18% temperature0.65

expected speedup = 1.0x

伪代码：Draft / Verify Loop

Diffusion / Flow 采样器：从一阶小步到高阶大步

快速 sampler 的目标是不重新训练模型，也尽量用更少 NFE 解出相似的反向轨迹。

DDIM 与 ODE solver 视角

\[ \hat{x}_0=\frac{x_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t,t)}{\sqrt{\bar{\alpha}_t}},\qquad x_s=\sqrt{\bar{\alpha}_s}\hat{x}_0+\sqrt{1-\bar{\alpha}_s}\epsilon_\theta(x_t,t). \] \[ \frac{dx}{dt}=f_\theta(x,t),\qquad \text{error}=O(\Delta t^p). \]

DPM-Solver、DPM-Solver++、UniPC、DEIS 等方法利用 diffusion ODE 的结构，以少量 denoiser evaluations 得到更高阶近似。

NFE vs Quality Simulator

比较 DDPM、DDIM、DPM-Solver 和 UniPC 在不同 NFE、guidance scale 与 solver order 下的速度和误差趋势。

sampler num steps24 guidance scale7.0 solver order2

estimated latency = 0 ms

伪代码：Fast Diffusion Scheduler Loop

风险与诊断

少步采样容易丢细节；高阶 solver 在强 CFG 下可能过冲。
不同模型的 prediction type 不同，epsilon / x0 / v 的 scheduler 公式不能混用。
视频生成要同时看 quality、FVD、motion score 和 temporal consistency。

Diffusion / DiT / Video Cache：复用跨时间步特征

相邻 denoising step 的 latent 与中间特征常有冗余。Cache 方法把变化不大的部分复用起来，少跑一部分网络。

缓存成立的近似

\[ h_l(x_t,t)\approx h_l(x_{t-1},t-1),\qquad h_{t,\text{frame}=i}\approx h_{t,\text{frame}=i+1}. \]

U-Net 可高频更新低层、低频更新高层；DiT 可按 block 与 denoising stage 动态刷新；视频还要避免画面发粘和 subtle motion 丢失。

Diffusion Cache Timeline

横轴是 denoising step，纵轴是 network block，颜色表示 compute、reuse 或 forced refresh。

cache interval2 reuse threshold62 late refresh70

estimated speedup = 1.0x

伪代码：Diffusion Feature Cache Loop

Token Reduction：少处理一些 token / patch

图像、视频和长文本里有大量冗余 token。Token reduction 通过合并、剪枝或延后计算降低 attention 和 MLP 负担。

Token merging 与 pruning

\[ s_{ij}=\frac{h_i^\top h_j}{\|h_i\|\|h_j\|},\qquad \tilde{h}_{ij}=\frac{w_i h_i+w_j h_j}{w_i+w_j},\qquad O(T^2d)\rightarrow O(T'^2d). \]

Patch Merge Lab

调节 merge ratio，观察哪些图像 patch 被合并，以及 attention token count 如何下降。

merge ratio28% importance sharpness50 mode

attention cost reduction = 0%

风险与诊断

训练时启用可让模型适应 token reduction；推理 plug-and-play 质量风险更高。
文本 LLM 删除历史 token 会破坏事实检索和长程依赖，比视觉 patch merging 更危险。
诊断应同时看速度、细节丢失、文本一致性和 long-context retrieval。

Training Acceleration：让训练放进显存、跑满集群

训练加速要同时处理参数、梯度、optimizer state、activation、temporary buffers 与通信。多卡如果通信设计不好，可能只是放大等待。

训练显存组成与 ZeRO / FSDP

\[ M_{\text{per GPU}}\approx \frac{M_{\text{parameters}}+M_{\text{gradients}}+M_{\text{optimizer}}}{N_{\text{gpu}}} +M_{\text{activation}}+M_{\text{communication buffer}}. \]

Activation checkpointing 用重算换显存；ZeRO/FSDP 把 optimizer states、gradients 和 parameters 分片；tensor/pipeline/sequence/expert parallel 分别切单层矩阵、层、序列和 expert。

Training Memory Simulator

估计参数、梯度、optimizer state 和 activation 在不同 GPU 数、precision 与 checkpoint 策略下的 per-GPU memory。

parameters (B)32 precision bytes2 optimizer num GPUs8 sequence length4096 checkpoint

per-GPU memory = 0 GB

伪代码：Activation Checkpointing

伪代码：FSDP / ZeRO-3 Conceptual Step

LoRA / QLoRA 与 Runtime Engine：训练省钱，不等于推理魔法

LoRA 加速的是微调：少训练参数、少存 optimizer state。推理是否更快取决于 adapter 是否合并、kernel 是否融合、动态多 adapter 是否引入额外调度。

LoRA 数学与图编译

\[ W'=W+\Delta W,\qquad \Delta W=BA,\qquad A\in\mathbb{R}^{r\times d_{in}},\;B\in\mathbb{R}^{d_{out}\times r}. \]

Runtime engine 通过 graph capture、operator fusion、layout transformation、shape specialization、kernel autotuning 和 CUDA graph replay 减少临时 tensor、kernel launch 和 Python 调度。

伪代码：LoRA / QLoRA Training and Merge

伪代码：Graph Capture / Kernel Fusion

Architecture Acceleration：从 dense 到 conditional compute

MoE、稀疏 attention、动态深度等方法让不同 token 只走部分计算路径，但通信、router 和负载均衡会决定真实速度。

MoE 与 sparse attention

\[ g(x)=\operatorname{TopK}(\operatorname{softmax}(W_rx),k),\qquad y=\sum_{i\in \operatorname{TopK}}g_i(x)E_i(x). \] \[ \text{local attention: }O(T^2)\rightarrow O(Tw). \]

MoE Routing Lab

调节 expert 数、top-k、batch size、router imbalance 和 capacity，观察 active parameters、dropped tokens 与 expert utilization。

num experts12 top-k2 batch tokens1024 router imbalance35 capacity factor1.20

dropped tokens = 0%

伪代码：MoE Routing Forward

生成模型加速路线演化

不要只记论文名；更重要的是每个节点解决了哪种瓶颈。

2017Transformer

去掉 RNN recurrence，让训练序列并行。

2019MQA / Megatron / ZeRO

减少 decode KV 带宽；解决大模型训练显存和并行。

2020DDIM

不重训 DDPM 也能跳步采样，把 diffusion 推向几十步。

2022FlashAttention / GPTQ / SmoothQuant / DPM-Solver

IO-aware attention、PTQ 与高阶 solver 同时成熟。

2023vLLM / PagedAttention / AWQ / QLoRA / UniPC

LLM serving 进入动态批处理与 KV 管理；低比特部署和低显存微调普及。

2024FlashAttention-3 / DeepCache / DiT Cache

注意力 kernel 适配新硬件；diffusion 开始系统复用跨步特征。

2025-2026KV compression / adaptive serving / FP4-FP8

长上下文与多模态 serving 进一步转向 memory hierarchy、跨请求复用和硬件低精度。

决策地图：遇到慢，先判断慢在哪里

可操作的加速流程不是“把所有优化都上”，而是先分解瓶颈，再选择最少的干预。

LLM 推理慢

prefill 慢：FlashAttention、chunked prefill、prompt prefix cache、FP8/INT8、tensor parallel。

decode 慢：KV cache、MQA/GQA/MLA、PagedAttention、continuous batching、speculative decoding、KV quant。

Diffusion 图像慢

NFE 太高：DDIM、DPM-Solver、DPM-Solver++、UniPC、better timestep schedule。

单步太慢：FlashAttention、token merging、DeepCache / DiT cache、lower latent resolution、compiled runtime。

视频生成慢

瓶颈通常是 frames × resolution × denoising steps × temporal attention。优先看 video DiT cache、temporal window attention、frame chunking、decoder slicing 和 latent temporal compression。

训练慢 / 显存不够

显存：mixed precision、activation checkpointing、gradient accumulation、ZeRO/FSDP、QLoRA、sequence packing。

吞吐：FlashAttention、fused optimizer、data loader profiling、parallel degrees、communication overlap。

常见误解

加速问题里最危险的是把不同瓶颈混成同一个词。

FLOPs 少一定更快：不一定。LLM decode 经常 memory-bandwidth bound；不规则访存和 kernel launch 可能抵消 FLOPs 下降。
量化一定加速：量化会降低存储，但是否降低 latency 取决于硬件低精度 kernel、batch size、dequant 是否融合和真实瓶颈。
FlashAttention 是近似 attention：不是。它是 exact attention 的 IO-aware 实现，不是 low-rank 或 sparse approximation。
KV cache 只会加速：KV cache 避免重复计算，但长上下文和高并发下会成为主要显存与带宽瓶颈。
Diffusion 加速只有蒸馏：蒸馏很重要，但 DDIM、DPM-Solver、UniPC、cache、token merging、scheduler 和 attention kernel 也能加速。
MoE 参数多所以一定慢：MoE 每 token 只激活少数 expert，真正瓶颈常在 router imbalance 和 all-to-all communication。

阅读边界

本页是机制地图，不替代真实 profiling。任何加速方案落地前，都应分开测 prefill / decode、单请求 / batch、TTFT / TPOT、显存峰值、kernel utilization 和质量指标。