Skip to content

Minimal loop + DTW: update data_loader; add smoke test#17

Merged
AmitMY merged 23 commits into
sign-language-processing:mainfrom
ImYangYun:feat/minimal-loop-dtw
Dec 31, 2025
Merged

Minimal loop + DTW: update data_loader; add smoke test#17
AmitMY merged 23 commits into
sign-language-processing:mainfrom
ImYangYun:feat/minimal-loop-dtw

Conversation

@ImYangYun
Copy link
Copy Markdown
Contributor

This PR adds a minimal training/validation loop that uses masked MSE and a validation-time DTW sanity metric.

  • The loop keeps tensors in [B,T,J,C] for loss/metrics and only permutes to [B,J,C,T] at the model boundary.
  • The data loader now filters by split, uses a safe pivot in [1, total_frames-1] so past/future are non-empty, and returns masked tensors.
  • A small smoke test (with a DummyModel) runs training_step and validation_step without heavy dependencies.
  • I ran the minimal loop locally for ~500–1000 steps, it trains and validates successfully, and the curves look stable. The smoke test also passes.

And I met some problems when implementing these tasks:
In the loop I convert BTJC → BJCT before calling the model, and then convert the model output BJCT → BTJC after the forward pass. The model was implemented around BJCT (time last), while our masks, zero_pad_collator, and loss/metrics (masked MSE + DTW) are written for BTJC (time at dim=1). So inputs/past are permuted to match the model, and predictions are permuted back so losses operate over [B, T, J, C].
I’m not sure if this is the best place to keep the conversions. Right now it works, but the double permute feels a bit clunky. Can you give some suggestions on this? Thanks!

minimal_curves

Copy link
Copy Markdown
Contributor

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, left a few comments.

Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
Comment thread signwriting_animation/scripts/minimal_loop.py Outdated
@ImYangYun
Copy link
Copy Markdown
Contributor Author

ImYangYun commented Oct 14, 2025

I looked into the predicted sequences but they are still almost static: mean|Δpred|≈0 while GT > 0, and free-run DTW is ~1.18–1.45. At inference I also see the encoder output has time-std ≈ 0.0 (no temporal variation).

  • To address this I added time conditioning for future tokens (linear ramp → small MLP) and injected it into the future motion embeddings before the Transformer.
  • switched the loop to full-sequence prediction (Tf inferred from the mask)
  • trained with masked position loss + 0.5× masked velocity loss;
  • added small debug prints (time-std of future/encoder outputs) plus a free-run viz/CSV writer.

Current outcome: future embeddings now have non-zero time-std, but the encoder output over time remains ~0, and free-run predictions are still nearly static (mean|Δpred|≈0 vs GT≈0.03). So I’m unsure whether the time signal should be injected earlier/later (e.g., pre-pos-enc vs inside the Transformer), and whether we should keep next-frame/full-sequence + (later) autoregressive sampling, or move to a true diffusion sampling loop.

截屏2025-10-14 14 23 35 minimal_curves

@ImYangYun
Copy link
Copy Markdown
Contributor Author

free_run_anim

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Oct 14, 2025

So what I understand is that you are unable to overfit to a small dataset.
For the visualization, it would be more useful if you unnormalize the data before visualizing, and visualize using pose-format - which has nice colors etc, so we can see how the body, face, and hands move
See this video for example: https://rotem-shalev.github.io/ham-to-pose/static/videos/vid_1.mp4

@ImYangYun
Copy link
Copy Markdown
Contributor Author

In this update I focused on stabilizing the training–inference pipeline and diagnosing the long-standing visualization issues. Over the past month I repeatedly encountered inconsistencies between the 586-joint source format and the 178-joint reduced skeleton used for training: GT visualizations were sometimes shifted, flipped, or distorted depending on whether reduce_holistic() was applied, and several mapping mismatches (e.g., wrong component order, inconsistent mean/std shape, and mismatched header joints) caused predicted skeletons to appear stretched or collapsed.
After unifying preprocessing, re-computing the 178-joint(reduced holistic applied) mean/std, and enforcing a consistent BTJC pipeline in the Lightning module, the reduced-skeleton GT now visualizes correctly, confirming that visualization must use the same 178-joint header as training.

At this stage I still have several open questions regarding the remaining prediction distortions:

  1. Should all visualization/evaluation be done on the reduced 178-joint skeleton, or are there scenarios where full 586-joint output is preferred?
  2. Is my current mapping pipeline computing mean/std after reduction and visualizing with the reduced header doing properly?
  3. CAMDM includes GaussianDiffusion utilities such as timestep scaling and p_sample_loop. Am I expected to implement conditioning interfaces myself (SignWriting + past motion), or are there existing CAMDM components I can safely reuse here?

@ImYangYun
Copy link
Copy Markdown
Contributor Author

Changes

  • Update model with PositionalEncoding + TimestepEmbedder
  • Achieves stable motion generation (disp_ratio: 0.30 → 1.02)
  • Fixed pylint warnings and improved code quality

Results (4-sample test)

  • Displacement Ratio: 1.02
  • MPJPE: 0.019
  • PCK@0.1: 100%
train_curve

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Dec 17, 2025

Great! Now it is time to run a full training for the entire dataset

@ImYangYun ImYangYun requested a review from AmitMY December 29, 2025 22:48
@ImYangYun
Copy link
Copy Markdown
Contributor Author

Lint warnings are about "too many arguments" (R0913/R0917) in __init__ methods.
Update models with freeze_clip option to allow unfreezing CLIP for SignWriting-specific fine-tuning.

@AmitMY
Copy link
Copy Markdown
Contributor

AmitMY commented Dec 30, 2025

Fix the lint issues, and I'd merge

************* Module signwriting_animation.translation.translate
signwriting_animation/translation/translate.py:7:0: E0401: Unable to import 'sockeye.inference' (import-error)
signwriting_animation/translation/translate.py:24:4: W0621: Redefining name 'args' from outer scope (line 73) (redefined-outer-name)
signwriting_animation/translation/translate.py:32:4: W0621: Redefining name 'translator' from outer scope (line 75) (redefined-outer-name)
signwriting_animation/translation/translate.py:18:8: C0415: Import outside toplevel (huggingface_hub.snapshot_download) (import-outside-toplevel)
signwriting_animation/translation/translate.py:21:4: E0401: Unable to import 'sockeye.translate' (import-error)
signwriting_animation/translation/translate.py:21:4: C0415: Import outside toplevel (sockeye.translate.parse_translation_arguments, sockeye.translate.load_translator_from_args) (import-outside-toplevel)
signwriting_animation/translation/translate.py:15:28: W0613: Unused argument 'temperature' (unused-argument)
signwriting_animation/translation/translate.py:46:14: W0621: Redefining name 'translator' from outer scope (line 75) (redefined-outer-name)
signwriting_animation/translation/translate.py:46:26: W0621: Redefining name 'texts' from outer scope (line 79) (redefined-outer-name)
signwriting_animation/translation/translate.py:58:4: W0621: Redefining name 'outputs' from outer scope (line 86) (redefined-outer-name)
signwriting_animation/translation/translate.py:47:4: E0401: Unable to import 'sockeye.inference' (import-error)
signwriting_animation/translation/translate.py:47:4: C0415: Import outside toplevel (sockeye.inference.make_input_from_factored_string) (import-outside-toplevel)
************* Module signwriting_animation.translation.pretraining.create_pretraining_data
signwriting_animation/translation/pretraining/create_pretraining_data.py:19:25: R1732: Consider using 'with' for resource-allocating operations (consider-using-with)
signwriting_animation/translation/pretraining/create_pretraining_data.py:32:37: R1732: Consider using 'with' for resource-allocating operations (consider-using-with)
************* Module signwriting_animation.diffusion.lightning_module
signwriting_animation/diffusion/lightning_module.py:201:4: R0913: Too many arguments (11/5) (too-many-arguments)
signwriting_animation/diffusion/lightning_module.py:201:4: R0917: Too many positional arguments (11/5) (too-many-positional-arguments)
************* Module signwriting_animation.data.data_loader
signwriting_animation/data/data_loader.py:59:4: R0917: Too many positional arguments (9/5) (too-many-positional-arguments)

You may supress R0917 in the pyproject if you'd like.

@ImYangYun ImYangYun closed this Dec 30, 2025
@ImYangYun
Copy link
Copy Markdown
Contributor Author

Fixed lint, suppressed R0913 and R0917 in pyproject.toml and added ignore-paths for translation.

@ImYangYun ImYangYun reopened this Dec 30, 2025
@AmitMY AmitMY merged commit d9c8384 into sign-language-processing:main Dec 31, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants