Fashion-VDM: Video Diffusion Model for Virtual Try-On

Johanna Karras^1,2, Yingwei Li¹, Nan Liu¹, Luyang Zhu^1,2,

Innfarn Yoo¹, Andreas Lugmayr¹, Chris Lee¹, Ira Kemelmacher-Shlizerman^1,2

¹Google Research ²University of Washington

SiGGRAPH Asia 2024

Input Person

Try-On Video

Input Garment
Input Person

Try-On Video

Input Garment
Input Person

Try-On Video

Input Garment
Input Person

Try-On Video

Input Garment
Input Person

Try-On Video

Input Garment
Input Person

Try-On Video

Input Garment
Input Person

Try-On Video

Input Garment

We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on.

Approach

**$\text{Architecture.}$** Given a noisy video $z_t$ at diffusion timestep $t$, a forward pass of Fashion-VDM computes a single denoising step to get the denoised video $z'_{t-1}$. Noisy video $z_t$ is preprocessed into person poses $J_p$ and clothing-agnostic frames $I_a$, while the garment image $I_g$ is preprocessed into the garment segmentation $S_g$ and garment poses $J_g$. The architecture follows [Zhu et al. 2024], except the main UNet contains 3D-Conv and temporal attention blocks to maintain temporal consistency. Additionally, we inject temporal down/upsampling blocks during 64-frame temporal training. Noisy video $z_t$ is encoded by the main UNet and the conditioning signals, $S_g$ and $I_a$, are encoded by separate UNet encoders. In the 8 DiT blocks at the lowest resolution of the UNet, the garment conditioning features are cross-attended with the noisy video features and the spatially-aligned clothing-agnostic features $z_a$ and noisy video features are directly concatenated. $J_g$ and $J_p$ are encoded by single linear layers, then concatenated to the noisy features in all UNet 2D spatial layers.

**$\text{Progressive Temporal Training.}$** Fashion-VDM is trained in multiple phases of increasing frame length. We first pretrain an image model, by training only the spatial layers on our image dataset. In subsequent phases, we train temporal and spatial layers on increasingly long batches of consecutive frames from our video dataset.

**$\text{Split Classifier-Free Guidance.}$** We introduce split-CFG, a generalization of dual-CFG which allows independent control over multiple conditioning signals. Each (p₀, p₁, p₂, p₃) above represents a different weighting over our conditional results: unconditional, garment conditioning, (person, garment) conditioning, and (person, garment, pose) conditioning, respectively. See the precise algorithm definition below, where ε_θ represents the trained Fashion-VDM model, *C = {c_i }* represents the set of conditioning signal groups, *W = {w_i }* represents the weights per conditioning signal group, and *z_t* is the noisy image at diffusion timestep t. In each inference step i, we add the next conditioning signal group *c_i* to the cumulative group of conditioning signals c, predict the noise ε̂_i = ε_θ*(z_t, c)*, and add the weighted difference between the current prediction and the previous prediction to the prediction of the previous step, ε̂_i-1.

Video Try-On Demo

Input Person

Try-On Video

Input Garment

Input Person

Try-On Video

Input Garment

Input Person

Try-On Video

Input Garment

Input Person

Try-On Video

Input Garment

Input Person

Try-On Video

Input Garment

Input Person

Try-On Video

Input Garment

Input Person

Try-On Video

Input Garment

Method Comparisons

Input Garment

Input Person

M&M VTO

Magic Animate

Animate Anyone

Ours

Limitations

The main limitations of Fashion-VDM include inaccurate body shape, artifacts, and incorrect details in occluded garment regions. Body shape misrepresentation occurs, because while the person keypoints encode rough body shape, detailed body size measurements are lost. Another limitation is artifacts appearing around the body/cloth boundary. This is often caused by inaccuracies in the agnostic image, which may incorrectly leak artifacts from the original garment. In our human evaluation, 10/17 videos that were failed had agnostic errors. Lastly, improbable details may be hallucinated in unseen garment regions, because the input image only shows one view of the garment. Future work might consider multi-garment conditioning and individual person customization for improved garment and person fidelity. Other errors include minor flickering for fine-grained patterns.

Bibtex

@InProceedings{Karras_FashionVDM_2024,
  author={Karras, Johanna and Li, Yingwei and Liu, Nan and Zhu, Luyang and Yoo, Innfarn and Lugmayr, Andreas and Lee, Chris and Kemelmacher-Shlizerman, Ira},
  title={Fashion-VDM: Video Diffusion Model for Virtual Try-On},
  booktitle = {Proceedings of ACM SIGGRAPH Asia 2024},
  month = {December},
  year={2024},
}

Acknowledgements

This work was done when all authors were at Google. We would like to thank Chunhui Gu, Alan Yang, Varsha Ramakrishnan, Tyler Zhu, Srivatsan Varadharajan, Yasamin Jafarian and Ricardo Martin-Brualla for their insightful discussions. We are grateful for the kind support of the whole Google ARML Commerce organization. We especially thank Hayes Helsper for his expertise in figure design.