This is the official implementation of our CVPR2026 paper, AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
If our work is helpful to your research, please consider citing our paper:
@InProceedings{Lee_2026_CVPR,
author = {Lee, Shih-Po and Elhamifar, Ehsan},
title = {AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision-Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {3421-3431}
}
To download the EgoPER and CaptainCook4D dataset, please visit GTG2Vid
To download the EgoPER and CaptainCook4D datasets, please send a request to lee.shih@northeastern.edu with the following information:
- Your Full Name
- Institution/Organization
- Advisor/Supervisor Name
- Current Position/Title
- Emaill Address (with institutional domain name)
- Purpose (e.g., download the dataset or pre-trained weight or both, for research purpose or others)
Create a data/ folder with the following structure and move/name directories accordingly.
vc_v_features_10fps/: pre-extracted frame-wise featureslabels_10fps/: frame-wise labelsaction_object_state/: frame-wise captionsframes_10fps: frame-wise framesclean_action_dict.json: denoise action description
- data
- clean_action_dict.json
- EgoPER
- action2idx.json
- coffee/
- vc_v_features_10fps/
- labels_10fps/
- frames_10fps/
- action_object_state/
- training.txt
- test.txt
- oatmeal/
- pinwheels/
- tea/
- quesadilla/
- CaptainCook4D
- action2idx.json
- breakfastburritos/
- vc_v_features_10fps/
- labels_10fps/
- frames_10fps/
- action_object_state/
- training.txt
- test.txt
- cucumberraita/
- microwaveeggsandwich/
- ramen/
- spicedhotchocolate/
- Please setup your own environment with the following requirements according to your hardware and firmware version
Ensure your environment can
from sentence_transformers import SentenceTransformerfrom transformers import AutoModelForCausalLM, AutoTokenizer. We useQwen2.5-32B-Instructfrom hugging face.
Ensure your environment can
from transformers import AutoProcessorfrom vllm import LLM, SamplingParams. For more information, please visit vllm
- Activate your environment before running. and create
output/ - Ensure the feature path in the code matches your local path
- Change
--datasetand--taskaccordingly. - Default numer of frames: 3
- Default number of clusters: clust_config.json
- Embed frame-wise actions and object states with SentenceBERT and their temporal embeddings.
- Use k-means to generate subaction and object clusters.
- Remove bad clusters for both subaction and object clusters.
- Results will be saved in
output/, namedsubactions_{num_clusters}_clust/andobj_state_{num_clusters}_clust
python generate_subactions_objects.py --dataset EgoPER --task quesadilla- Use LLMs to summarize actions in a cluster with objects information to obtain the subaction.
- Need to check the followings:
- If the subaction already exists.
- If the subaction causes a conflict with the action.
- It the subaction is appropriate inside the action.
- Results will be saved in
output/, named/summarized_subactions_{num_clusters_dict[task]}_clust/
python summarize_subactions.py --dataset EgoPER --task quesadilla- Perform average over the corresponding pre-extracted frame features
- The frames that match the subaction descriptions
- Results will be saved in
output/, namedv_subaction_features_{num_clusters_dict[task]}_clust
python generate_subaction_features.py --dataset EgoPER --task quesadilla- Perform graph-to-video alignment to localize the subactions within an action segment.
- Record the matched execution paths during the process.
- Keep the frequetly matched paths and construct AXG.
- Results will be saved in
output/, namedaxg_{num_clusters_dict[task]}_clust
python build_axg.py --dataset EgoPER --task quesadilla- (optional) If you would like to use predicted TAS, create a
tas_output/and put the downloaded or your own tas inside.
- Align subactions in the AXG with the action segment.
- You will obtain matched subactions and dropped ones.
- Matched subactions are considered as potentially correct.
- Dropped subactions are considered as potentially erroneous.
- Results will be saved in
output/, nameddata_for_vlm_{num_clusters_dict[task]}_clust_{tas_backbone} --tas_backbonecan be eithergt,gtg2vid,fact, oregoped
python axg2vid.py --dataset EgoPER --task quesadilla- Activate your vllm environment
cd vllm- Run VLMs with prompts (subactions + actions, whether the segment is dropped) and selected frames
- Results will be saved in
output/, namederror_reasoning_{num_clusters_dict[task]}_clust_{tas_backbone}_{num_sampling_frames}f
python run_axg-reasoner.py --dataset EgoPER --task quesadilla --numf 3- Evaluate the results on F1@10, F1@25, and F1@50 regarding correct and error segments.
python evaluate_error_detection.py --dataset EgoPER --task quesadilla- Evaluate the results with LLMs and NLP metrics.
python evaluate_error_explanation.py --dataset EgoPER --task quesadilla
python evaluate_error_explanation.py --dataset EgoPER --task quesadilla --eval



