About this Article
- Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
- Journal: Advances in Neural Information Processing Systems (NeurIPS) 33
- Year: 2020
- Official Citation: Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (Vol. 33, pp. 6840-6851).
Accomplishments
- Developed Diffusion model, which is widely used in image generation.
Key Points
1. Idea
Diffusion model consists of two steps.
- Forward process (or diffusion process): Gradually add Gaussian noise to data (image). Doing so may corrupt the original data.
- Reverse process: Recover noised data to original data.
- The goal of training is to minimize the gap between the forward and reverse process in the same step.
2. Detailed Equations
1. Forward Process
$q(x_{1:T} \vert x_{0}) := \prod_{t=1}^{T}q(x_{t} \vert x_{t-1}), \quad q(x_{t} \vert x_{t-1}) := \mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I)$
- $q(x_{1:T} \vert x_{0})$ : the probability of a total scenario that one image is damaged.
- $q(x_{t} \vert x_{t-1})$: One step of forward process.
2. Reverse Process
$p_{\theta}(x_{0:T}) := p(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1} \vert x_{t}), \quad p_{\theta}(x_{t-1} \vert x_{t}) := \mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{\theta}(x_{t},t))$
- $p_{\theta}(x_{0:T})$: the probability of a total scenario that one noisy image is recovered.
- $p_{\theta}(x_{t-1} \vert x_{t})$: the prediction of mean and variance of step by using the data of current step.
3. Loss Function
\(\mathbb{E}\left[-\log p_{\theta}(x_{0})\right] \le E_{q}\left[-\log\frac{p_{\theta}(x_{0:T})}{q(x_{1:T} \vert x_{0})}\right] = \mathbb{E}_{q}\left[-\log p(x_{T}) - \sum_{t \ge 1}\log\frac{p_{\theta}(x_{t-1} \vert x_{t})}{q(x_{t} \vert x_{t-1})}\right] =:L\)
Although the equation is complex, the purpose is clear. It is to minimize the difference between the forward process $q(x_{t} \vert x_{t-1})$ and reverse process $p_{\theta}(x_{t-1} \vert x_{t})$. However, the authors used an improved version of equation (3) for computational convenience.
\[E_{q}\left[ \underbrace{D_{KL}(q(x_T \vert x_0) \vert \vert p(x_T))}\_{L_T} + \sum_{t>1} \underbrace{D_{KL}(q(x_{t-1} \vert x_t, x_0) \vert \vert p_{\theta}(x_{t-1} \vert x_t))}_{L_{t-1}} - \underbrace{\log p_{\theta}(x_0 \vert x_1)}_{L_0} \right]\]- $L_T$: a constant. the difference between the last noise and the original noise.
- $L_{t-1}$: the core of training, sum of prediction(p) and answer(q) for all steps.
- $L_0$: recovery performance of the last step.
3. Reverse Process and L_{1:T-1}
What should the reverse process predict? Authors let $\Sigma_{\theta}$ in equation (1) as a constant. Therefore, a parameter that the model should predict is $\mu_{\theta}$ (mean). Through several loss function analysis and reparameterizing, authors figured out that mean $\mu_{\theta}$ can be represented as a following equation. The equation (11) shows that the only thing the model should predict is $\epsilon$ (a random noise vector from Gaussian distribution), because $x_t$ is given.
\[\mu_{\theta}(x_{t},t) = \frac{1}{\sqrt{\alpha_{t}}}\left(x_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}\_{t}}} \epsilon_{\theta}(x_{t},t)\right)\]By further simplification, loss function can be represented as a follow.
\[\mathbb{E}_{x_{0},\epsilon}\left[\frac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t}(1-\overline{\alpha}\_{t})} \left\ \vert \epsilon - \epsilon_{\theta}(\sqrt{\overline{\alpha}\_{t}}x_{0}+\sqrt{1-\overline{\alpha}_{t}}\epsilon,t) \right\ \vert ^{2}\right]\]What reverse process should predict is, therefore, a noise $\epsilon$.
4. Simplified Training Objective
In equation (12), you can see the complex weight $\frac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t}(1-\overline{\alpha}_{t})}$. When t grows, $(1-\overline{\alpha}_{t})$ grows. This can make the weight small when t is large, which is not efficient to focus much harder problem(more noisy image). Therefore, authors removed the weight and simplified the loss function.
\[L_{simple}(\theta) := E_{t,x_{0},\epsilon}[\ \vert \epsilon-\epsilon_{\theta}(\sqrt{\overline{\alpha}\_{t}}x_{0}+\sqrt{1-\overline{\alpha}_{t}}\epsilon,t)\ \vert ^{2}]\]This can make the model to focus on harder problems with much of noise (when t is large).
Impact of research
- Positive: Useful for data compression. High resolution data can be more easily accessible even through increased global internet traffic.
- Negative: Production of fake images and videos.
- Negative: Reflection of bias in datasets.
댓글남기기