Bohao Li | Zhicheng Cao | Huixian Li | Yangming Guo
State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5% AP, demonstrating superior robustness and data efficiency.
- 📖 Introduction
- 🖼️ Visualizations
- 📊 Model Zoo & Results
- 🛠️ Installation
- 🏃 Usage
- 🙏 Acknowledgements
- 📄 License
- 📚 Citation
CIGPose is a whole-body pose estimation framework that improves robustness in challenging scenes (e.g., occlusion, clutter, difficult lighting) by explicitly addressing visual confounding via a causal perspective.
It is implemented as an MMPose project under mmpose/projects/cigpose.
- Causal formulation: Model visual context as a confounder and target the interventional distribution
P(Y|do(F))instead of the observationalP(Y|F). - Causal Intervention Module (CIM): Use predictive uncertainty to identify confounded keypoint embeddings and replace them with learned, context-invariant canonical embeddings.
- Hierarchical graph reasoning: Perform local (intra-part) and global (inter-part) message passing on deconfounded embeddings to enforce anatomical plausibility.
- (a) Quantitative comparison on COCO-WholeBody val: CIGPose achieves strong accuracy while remaining data-efficient.
- (b) Qualitative comparison vs. RTMPose-x: baseline predictions may latch onto spurious background cues; CIGPose mitigates this by intervening on confounded keypoint representations, producing more anatomically plausible poses.
- From left to right: input image, RTMPose-x, CIGPose-x.
- CIGPose is more robust under common confounders (e.g., occlusion and clutter), yielding cleaner and more coherent whole-body structures.
- Additional examples further illustrating the robustness gains and improved anatomical consistency brought by causal intervention + hierarchical graph reasoning.
| Config | Input Size | FLOPS (G) | Body AP | Foot AP | Face AP | Hand AP | Whole AP | ckpt |
|---|---|---|---|---|---|---|---|---|
| CIGPose-m | 256x192 | 2.3 | 69.0 | 64.3 | 82.1 | 49.7 | 59.9 | pth |
| CIGPose-l | 256x192 | 4.6 | 71.2 | 69.0 | 83.3 | 54.0 | 62.6 | pth |
| CIGPose-l | 384x288 | 10.7 | 73.0 | 72.0 | 88.3 | 59.8 | 66.3 | pth |
| CIGPose-x | 384x288 | 18.7 | 73.5 | 72.3 | 88.1 | 60.2 | 67.0 | pth |
| CIGPose-l+UBody | 256x192 | 4.6 | 71.3 | 66.2 | 83.4 | 55.5 | 63.1 | pth |
| CIGPose-l+UBody | 384x288 | 10.7 | 73.1 | 72.3 | 88.0 | 61.2 | 66.9 | pth |
| CIGPose-x+UBody | 384x288 | 18.7 | 73.5 | 70.3 | 88.4 | 62.6 | 67.5 | pth |
| Config | Input Size | FLOPS (G) | Params (M) | AP | AR | ckpt |
|---|---|---|---|---|---|---|
| CIGPose-m | 256x192 | 1.9 | 14 | 76.6 | 79.3 | pth |
| CIGPose-l | 256x192 | 4.2 | 28 | 77.6 | 80.3 | pth |
| CIGPose-l | 384x288 | 9.4 | 29 | 78.5 | 81.1 | pth |
| Config | Input Size | Params (M) | AP | AP easy | AP medium | AP hard | ckpt |
|---|---|---|---|---|---|---|---|
| CIGPose-m | 256x192 | 14.4 | 71.4 | 81.0 | 72.7 | 58.9 | pth |
| CIGPose-l | 256x192 | 28.4 | 73.7 | 82.8 | 75.1 | 61.2 | pth |
| CIGPose-l | 384x288 | 28.8 | 74.2 | 82.9 | 75.6 | 62.5 | pth |
| CIGPose-x | 384x288 | 50.4 | 75.8 | 84.2 | 77.3 | 63.6 | pth |
Our code is based on MMPose.
# 1. Create a conda environment
conda create -n cigpose python=3.8 -y
conda activate cigpose
# 2. Install PyTorch (adjust CUDA version as needed)
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 -f https://download.pytorch.org/whl/cu118/torch_stable.html
# 3. Install MMCV and MMDetection
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
# 4. Clone this repository
git clone https://github.com/53mins/CIGPose.git
cd CIGPose/mmpose
pip install -v -e .
Please refer to MMPose guidelines for dataset preparation:
cd mmpose
bash tools/dist_train.sh mmpose/projects/cigpose/wholebody_2d_keypoint/cigpose-l_8xb32-420e_coco-wholebody-384x288.py 8
bash tools/dist_test.sh mmpose/projects/cigpose/wholebody_2d_keypoint/cigpose-l_8xb32-420e_coco-wholebody-384x288.py path/to/checkpoint.pth
- This project is built on top of MMPose, and follows its training/testing utilities and dataset conventions.
This project is released under the LICENSE.
If you find CIGPose useful in your research, please consider citing:
@article{li2026cigpose,
title={CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation},
author={Li, Bohao and Cao, Zhicheng and Li, Huixian and Guo, Yangming},
journal={arXiv preprint arXiv:2603.09418},
year={2026}
}

