Skip to content
This repository was archived by the owner on May 10, 2026. It is now read-only.
This repository was archived by the owner on May 10, 2026. It is now read-only.

text-to-pose: Scale up sequence size #2

@AmitMY

Description

@AmitMY

Due to our model being memory intensive (transformer, n^2 in sequence length), we limit all training data at a sequence length and batch_size. (filter here)

Currently, if memory serves me right, out of ~4000 videos in the dicta_sign dataset, the model trains on ~2500 because of the limit of 100 frames max. (more frames, we get an out-of-memory error. perhaps related to more than just the transformer)

The ideal backbone for the pose encoding, in my opinion, is an S4 model, while the text (usually a lot shorter) could still use a transformer.

We should experiment how to increase the input size for the model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions