Skip to content

StaryMoon/CogVideoX-Unofficial

Repository files navigation

CogVideoX-Unofficial

Unofficial PyTorch Reproduction of

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

[Video Generation / arXiv 2024]
Python PyTorch License Status

Paper · PDF · Issues · Release

This is an unofficial implementation maintained by @StaryMoon. If this repository helps your reading, reproduction, or course project, please consider giving it a star and following my GitHub profile.

News

  • 2026-06-10: Initial public release with official-style project structure, citation metadata, configuration, PyTorch interfaces, and smoke test.

Overview

This repository organizes a PyTorch implementation for CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, focusing on text-to-video diffusion transformers with expert adaptive normalization and 3D VAE compression. The codebase is structured like a standard research repository so that model components, configuration files, scripts, and evaluation utilities can be extended independently.

Main goals:

  • provide a clean PyTorch module layout for the paper;
  • keep training, inference, evaluation, and configuration entry points explicit;
  • track paper-reported metrics separately from local experiment logs;
  • make it easy for contributors to inspect, compare, and extend the implementation.

Repository Structure

CogVideoX-Unofficial/
├── configs/
│   └── default.yaml
├── scripts/
│   └── smoke_test.py
├── src/cogvideox_unofficial/
│   ├── __init__.py
│   └── model.py
├── CITATION.cff
├── README.md
├── requirements.txt
└── pyproject.toml

Installation

git clone https://github.com/StaryMoon/CogVideoX-Unofficial.git
cd CogVideoX-Unofficial
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For CUDA-enabled experiments, install the PyTorch build matching your CUDA version from the official PyTorch website before installing the rest of the dependencies.

Quick Check

python scripts/smoke_test.py

Expected output:

output: (...)
loss: ...

Data Preparation

mkdir -p data/train data/val data/test checkpoints outputs

Recommended layout:

data/
├── train/
├── val/
└── test/

Keep private datasets, downloaded checkpoints, and generated outputs out of git. Dataset-specific converters can be added under scripts/ while preserving the public repository structure.

Training

Minimal module usage:

import torch
from cogvideox_unofficial import ModelConfig, UnofficialModel, reconstruction_loss

config = ModelConfig(task="video", hidden_dim=128, num_layers=2, num_heads=4)
model = UnofficialModel(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

image = torch.randn(2, 3, 64, 64)
token_ids = torch.randint(0, config.vocab_size, (2, 16))
out = model(image, token_ids=token_ids)
loss = reconstruction_loss(out.primary)
loss.backward()
optimizer.step()

Inference

import torch
from cogvideox_unofficial import UnofficialModel

model = UnofficialModel().eval()
with torch.no_grad():
    image = torch.randn(1, 3, 64, 64)
    y = model(image).primary
print(y.shape)

Evaluation

Suggested entry points:

python scripts/smoke_test.py
# python scripts/evaluate.py --config configs/default.yaml --ckpt checkpoints/model.pt

Paper-reported values and local run values should be kept in separate columns so readers can distinguish citation numbers from local experiment logs.

Paper Results

For copyright and license clarity, this repository links to the original paper figures and tables instead of redistributing screenshots copied from the PDF. The table below tracks where readers can find the paper-reported results.

Result Type Paper Location Source
Main quantitative comparison Main paper tables arXiv paper
Ablation study Experiment / ablation sections arXiv paper
Qualitative examples Main paper figures and appendix arXiv PDF

Reproduction Log

Date Config Split Metric Value Notes
2026-06-10 configs/default.yaml smoke check forward pass ok package interface validation

Implementation Status

  • Package layout and install metadata
  • Core PyTorch module interfaces
  • Default config and smoke test
  • Paper citation and result-location index
  • Dataset-specific preprocessing scripts
  • Paper-specific training recipe
  • Evaluation and visualization scripts
  • Public checkpoints and model zoo entries

Model Zoo

Model Checkpoint Config Notes
default TBA configs/default.yaml compact implementation interface

Citation

If you find this repository useful, please cite the original paper:

@article{CogVideoX_2024,
  title   = {CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author  = {Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and others},
  journal = {arXiv preprint arXiv:2408.06072},
  year    = {2024}
}

Acknowledgements

  • Thanks to the authors of CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer for the original research.
  • Thanks to arXiv for open access to the paper metadata and manuscript.
  • This repository is inspired by standard open-source PyTorch research codebases.
  • The implementation is unofficial and all paper names, datasets, and trademarks belong to their respective owners.

License

This repository is released under the MIT License. The original paper, datasets, official code, project assets, and third-party dependencies remain governed by their own licenses.

Keywords

pytorch, unofficial-implementation, reproduction, cogvideox, video-generation, text-to-video, diffusion-transformer

About

Unofficial PyTorch reproduction for CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages