FIT: A Large-Scale Dataset for Fit-Aware
Virtual Try-On

1University of Washington, 2Google Research
*Equal contribution, listed alphabetically

SIGGRAPH 2026

We present the FIT Dataset: A dataset and benchmark designed for fit-aware virtual try-on, featuring diverse garment fits (e.g. tight, loose) and precise size annotations. Left: Example dataset triplets consisting of (1) the conditioning garment image (top), (2) the conditioning person image (middle), and (3) the target try-on image (bottom). Right: Visualization of the corresponding person and garment measurement annotations. Backgrounds removed for clarity.

Abstract

Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit” cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size.

In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available.

FIT Logo FIT Dataset

Browse the FIT Dataset

The FIT dataset consists of 1,137,282 training and 1000 test samples, each consisting of $(I_{\text{try-on}}, I_{\text{p}}, I_g, m_p, m_g)$. FIT covers 168 distinct body shapes (82 men's, 86 women's) in sizes XS-3XL, 528 body poses, as well as 158,483 unique top and garment designs. The garments come in a diverse range of fits, from loose to tight. Refer to our paper for additional dataset statistics.


Select a data sample to view the try-on image, paired-person image, layflat garment image, and measurements. πŸ“

Tighter
Garment Fit
Looser
Smaller
Person Size
Larger

FIT Dataset Comparison

We compare FIT to related datasets. FIT is the first large-scale dataset for virtual try-on that provides photorealistic images, ill-fit examples, precise measurement information, and ground-truth paired-person images. For scale, we report the number of training images.

Dataset Realism Ill-Fit Size Info Triplet Scale
SV-VTO βœ“ βœ“ βœ“ βœ“ 1,524
SIZER βœ“ βœ“ βœ“ βœ— 2,000
DeepFashion3D βœ— βœ— βœ— βœ— 2,078
ViTON-HD βœ“ βœ— βœ— βœ— 11,647
LAION-Garment βœ“ βœ— βœ— βœ“ 60K
SewFactory βœ“ βœ— βœ“ βœ— 1M
GCD βœ— βœ— βœ“ βœ— 115K
Ours βœ“ βœ“ βœ“ βœ“ 1.13M

Garment Resizing Demo

Try on different sizes of the same garment!

XS Person
S Person
M Person
L Person
XL Person
2XL Person
3XL Person

Garment Size

XS S M L XL 2XL 3XL

Fit-Aware Virtual Try-On Demo

Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result
Person
Garment
Try-on Result

S Person, XL Garment

Comparisons to State-of-the-Art

Prior virtual try-on works excel at garment appearance transfer and generating aesthetically-pleasing images. However, past methods lack precise measurement conditioning and instead hallucinate garment fit. In contrast, Fit-VTO generates high-quality virtual try-on results, while also being measurement-conditioned. Below, we show how Fit-VTO excels on synthetic FIT images with measurement conditioning, as well as real-world VITON-HD images (without measurements).

FIT Dataset

XS person,
XL garment
Bust: 84cm Height: 167cm Hips: 89cm Waist: 63cm Width: 117cm Length: 51cm Sleeve: 23cm
Input Person Input Garment

Inputs

Any2AnyTryOn

Any2AnyTryOn

Nano Banana Pro

Nano Banana Pro

COTTON

COTTON

IDM-VTON

IDM-VTON

Ours

Ours

Ground Truth

Ground Truth

XL person,
XS garment
Bust: 107cm Height: 176cm Hips: 107cm Waist: 93cm Width: 98cm Length: 47cm Sleeve: 22cm
Input Person Input Garment

Inputs

Any2AnyTryOn

Any2AnyTryOn

Nano Banana Pro

Nano Banana Pro

COTTON

COTTON

IDM-VTON

IDM-VTON

Ours

Ours

Ground Truth

Ground Truth

XS person,
3XL garment
Bust: 86cm Height: 161cm Hips: 89cm Waist: 69cm Width: 123cm Length: 51cm Sleeve: 22cm
Input Person Input Garment

Inputs

Any2AnyTryOn

Any2AnyTryOn

Nano Banana Pro

Nano Banana Pro

COTTON

COTTON

IDM-VTON

IDM-VTON

Ours

Ours

Ground Truth

Ground Truth

XS person,
3XL garment
Bust: 85cm Height: 163cm Hips: 90cm Waist: 63cm Width: 161cm Length: 60cm Sleeve: 28cm
Input Person Input Garment

Inputs

Any2AnyTryOn

Any2AnyTryOn

Nano Banana Pro

Nano Banana Pro

COTTON

COTTON

IDM-VTON

IDM-VTON

Ours

Ours

Ground Truth

Ground Truth

S person,
XL garment
Bust: 96cm Height: 176cm Hips: 103cm Waist: 63cm Width: 133cm Length: 57cm Sleeve: 24cm
Input Person Input Garment

Inputs

Any2AnyTryOn

Any2AnyTryOn

Nano Banana Pro

Nano Banana Pro

COTTON

COTTON

IDM-VTON

IDM-VTON

Ours

Ours

Ground Truth

Ground Truth


VITON-HD Dataset

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Inputs

Any2Any

Nano Banana

COTTON

IDM-VTON

Ours

Ground Truth

Method

FIT Dataset Generation Pipeline

(a) Overall workflow: We start by simulating a 3D garment on a target body via GarmentCode to render a synthetic image $I_s$. We generate a text prompt $p$ (via VLM) and a composite normal map $I_n$ (stitching estimated normals with realistic head/feet details). These condition our re-texturing model $f_\text{texture}$ to produce the try-on image $I_{\text{try-on}}$. Finally, we use $f_\text{paired}$ to generate a paired person image $I_p$, and a VLM to synthesize a layflat garment $I_g$. (b) GarmentCode simulation: Given a sampled design template, we compute sewing patterns for a specific body size. Then, we cross-drape these patterns onto a different target body, using box-mesh realignment to prevent simulation failures, and extract ground-truth measurements. (c) Using source and target garments draped on the same body, we derive an identity map $I_\text{id}$ by masking the garment in $I_{\text{try-on}}$. Conditioned on $I_\text{id}$, a paired normal map $I_n'$, and a paired prompt $p'$, $f_\text{paired}$ generates the paired person image $I_p$.

FIT-Aware Virtual Try-On

Our architecture is a flow-based model based on Flux.1-dev MMDiT architecture and finetuned with LoRA. FiT-VTO generates a try-on image $I_{\text{try-on}}$ given a layflat garment image $I_g$, paired person image $I_p$, and person-garment measurements $m = [m_p, m_g]$. First, image inputs $I_g$ and $I_p$ are encoded into latents separately through a pre-trained VAE encoder. We replace the text embeddings in Flux.1-dev with custom measurement embeddings $m_{\text{embed}}$ computed from $m$. Person latents are channel-concatenated with the noisy target latents, while layflat latents and $m_{\text{embed}}$ are sequence-wise concatenated with $z_t$. After processing through the diffusion transformer, clean latents are decoded by the VAE decoder.

BibTeX

@article{fitvto2026,
  author    = {Karras, Johanna and Wang, Yuanhao and Li, Yingwei and Kemelmacher-Shlizerman, Ira},
  title     = {FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On},
  journal   = {SIGGRAPH},
  year      = {2026},
}