Skip to content

cs-wywang/VITON-HD

Repository files navigation

The Chinese version of the README is located in README-zh.md

Notice

This virtual try-on task is based on VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization (published in CVPR 2021). Since the official paper does not provide training code, we did not train the model ourselves. Instead, we used the pretrained model to perform clothing replacement. We also implemented the preprocessing steps including clothing segmentation, human parsing, and pose estimation.

To simplify testing, the system only requires two input images: one clothing image and one person image (both with background). These inputs are automatically converted into the six required inputs of the original VITON-HD model, as shown below:

input Description
cloth Clothing image with background removed
cloth_mask Binary mask of clothing (white clothes on black)
person Person image with background removed
person_parse Semantic segmentation of human body parts
openpose_img Pose keypoints visualization (color-coded)
openpose_json Pose keypoints coordinates in JSON format

Installation

VITON-HD

Windows environment:

Windows 10
GPU RTX 3070
cudnn 7.6.5
Cuda 3.7.10

Conda virtual environment:

python 3.8.18
cudatoolkit 8.0
torch 1.13.0+cu117
torchaudio 0.13.0+cu117
torchvision 0.14.0+cu117

Preprocess

1. Clothing Segmentation

We use a pretrained model from U2-Net. No additional dependencies are required.

2. Human Parsing

We use the AILIA model library. Python 3.6 or newer is required.

pip3 install ailia

In addition to the steps in the Tutorial, you also need to apply for the ailia SDK from ailia. Alternatively, you can directly apply by clicking Download a free evaluation version of ailia SDK. You will need to provide a valid email address to receive the download link and the license file. After downloading, move the license file (AILIA.lic) to the directory where bootstrap.py is located (ailia_sdk/python). Then, download the requirements.txt file from this page (https://github.com/axinc-ai/ailia-models) into the same directory as bootstrap.py, and run the following command to install the required dependencies on Windows, Mac, and Linux:

pip install -r requirements.txt

Note: For Jetson and Raspberry Pi installation details, please refer to the Tutorial.

Note: The ailia model expires after 30 days. Once expired, you need to reapply for the SDK and repeat the above steps.

3. Pose Estimation

We use the OpenPose implementation from the Windows Portable Version. Follow the steps in the Windows Portable Version to download the release. Then download the OpenPose models from Kaggle: openpose_model | Kaggle.

After that, run:

models/getBaseModels.bat
models/getCOCO_and_MPII_optional.bat

These commands will automatically download the required models. After downloading, place the images to be processed into the ./examples/media/ directory, and run the following command in the root directory:

bin\OpenPoseDemo.exe --image_dir examples\media --hand --write_images output\ --write_json output/ --disable_blending

This will perform pose estimation on all files in the ./examples/media/ directory.

Note: It is recommended to clean the media directory beforehand to avoid insufficient GPU memory or long processing time when using CPU.

Usage

A batch script run.bat has already been provided. When running on different machines, simply modify the absolute paths accordingly. Then execute the following command in the terminal:

run person_00 cloth.jpg

Here, person_00 is the name of the person image (without the .jpg or .png extension), and cloth.jpg is the clothing image.

After execution, it will automatically update the person-clothing pairs in datasets\test_pairs.txt. The result of the virtual try-on will be saved as "results\test\person_cloth.jpg".

Note: Before running, make sure both input images are resized to 768×1024, otherwise dimension mismatch errors may occur.

Principle

VITON-HD

For training a neural network, the ideal dataset would consist of:

  • Input: a person wearing their own clothes and a target clothing image
  • Ground truth: the same person wearing the target clothing

However, such datasets are difficult to obtain. In most cases, we only have product images and photos of models wearing those products (i.e., inputs without corresponding labels). Therefore, we adopt the VITON approach, which removes clothing information from the person image.

Since real-world images are often high-resolution, we use VITON-HD, which performs well at higher resolutions. It reduces artifacts caused by misalignment between warped clothing and the target region, producing results at a resolution of 1024×768.

The process is as follows:

  1. First, obtain human semantic segmentation and pose maps using existing methods. Then combine them with the original image and remove the clothing and arm regions.
  2. Next, use the segmentation image (without clothing and arms) as ground truth. Take the segmentation map (with clothing and arms removed), pose map, and target clothing as input, and use a U-Net-based CGAN to predict the segmentation map of the person wearing the target clothing.
  3. Then, use the segmentation (without clothing and arms), pose map, and the predicted segmentation (with target clothing). Extract the clothing region and apply TPS transformation to warp the target clothing so that it better fits the person’s pose.
  4. Finally, combine all the above information to generate the final image. Since the warped clothing cannot perfectly align with the original clothing region, misalignment may occur. To address this, the ALIAS (ALIgnment-Aware Segment normalization) module is used to reduce artifacts caused by misalignment. It can also generate skin regions previously occluded by clothing, and accurately reconstruct clothing details such as patterns, styles, and textures. Meanwhile, the original face, hands, pants, and other details of the person are well preserved, resulting in high-quality outputs.

image-20240429194550431

Preprocess

1. Clothing Segmentation

1.1 Clothing Segmentation

To perform clothing segmentation, we use U2-Net. U2-Net

Since our target is clothing segmentation, we selected the dataset [iMaterialist (Fashion) 2019 at FGVC6]. This dataset provides detailed annotations for different parts of clothing, achieving a level of accuracy that exceeds our requirements. For convenience, we chose this dataset.

In the output segmentation map, different parts of the clothing are labeled with different colors. For implementation details, please refer to the official U2-Net repository. The model I trained on my laptop performs reasonably well, but the repository Clothes Segmentation using U2NET provides models trained for more iterations, which yield better results. You can download them if needed.

The original clothing image is shown below:

cloth

1.2 Convert All Non-Black Pixels to White

Since the segmentation output assigns different colors to different parts of the clothing, while VITON-HD requires a binary (black-and-white) segmentation mask, this conversion step is necessary.

The resulting output is shown below:

cloth

1.3 Remove the Background of the Clothing Image Using the Binary Mask

The resulting output is shown below:

cloth

At this point, we have obtained the cloth and cloth_mask inputs required by VITON-HD.

2. Human Parsing

In the VITON-HD paper, the authors use the segmentation model from ACGPN for human parsing. However, that dataset does not include a label for the neck, while VITON-HD’s segmentation results do include a neck label.

Therefore, we use the ATR training set from Self-Correction-Human-Parsing. In this dataset, the neck and face are labeled with the same color. However, images generated based on the LIP dataset do not include a neck label and only annotate the face. Thus, we subtract the face region obtained from LIP from the combined face+neck region obtained from ATR to isolate the neck region, and then assign it a different color for labeling.

Next, all non-black pixels are converted to white to obtain the segmentation mask, and a similar process as in Section 1.3 is applied to remove the background and obtain the person image.

Since Self-Correction-Human-Parsing only provides GPU-based inference, we instead use another library that integrates multiple AI models: AILIA. This library conveniently includes both ATR and LIP models and provides converted ONNX models, allowing inference on CPU.

After segmentation, we found that the color labels differ from those in the VITON-HD dataset. Therefore, we need to convert the colors to match the VITON-HD format.

The color mapping is shown as follows:

ATR Output Map Sample Map
Palette Index Color Body Part Palette Index Color Body Part
0 [0, 0, 0]: Black Background 0 [0, 0, 0]: Black Background
2 [0, 128, 0]: Green Hair 2 [254, 0, 0]: Red Hair
4 [0, 0, 128]: Blue Clothes 5 [254, 85, 0]: Orange Clothes
5 [128, 0, 128]: Purple Pants 9 [0, 85, 85]: Dark Green \
11 [192, 128, 0]: Brownish Yellow Face + Neck 10 [85, 51, 0]: Brown Neck
12 [64, 0, 128]: Dark Purple Right Leg 12 [0, 128, 0]: Green Pants
13 [192, 0, 128]: Pink Left Leg 13 [0, 0, 254]: Blue Face
14 [64, 128, 128]: Light Blue Right Arm 14 [51, 169, 220]: Light Blue Right Arm
15 [192, 128, 128]: Skin Color Left Arm 15 [0, 254, 254]: Bright Blue Left Arm
16 [85, 254, 169]: Light Green Right Leg
17 [169, 254, 85]: Bright Green Left Leg

For details about the color conversion process, please refer to: Charlie839242/An-implementation-of-preprocess-in-VITON-HD-

At this point, we have obtained the person and person_parse inputs required by VITON-HD.

person_00person_00

3. Pose Estimation

To generate the pose images and keypoint coordinates required by VITON-HD, we use OpenPose. On Windows, we implement it using the Windows Portable Version. The rendered skeleton images are saved in the output directory, while the keypoint coordinates and related information are stored in JSON files.

In this way, we obtain both the pose images and the corresponding keypoint data in JSON format.

person_00_rendered

After completing the above preprocessing steps, we obtain all the required inputs for VITON-HD. This enables virtual try-on by simply providing a person image and a target clothing image (both with background). The method achieves good performance even on high-resolution images and effectively reduces artifacts.

References

About

An implementation of VITON-HD

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors