text-to-pose: Scale up sequence size

Due to our model being memory intensive (transformer, `n^2` in sequence length), we limit all training data at a [sequence length](https://github.com/sign-language-processing/transcription/blob/main/text_to_pose/args.py#L18) and [`batch_size`](https://github.com/sign-language-processing/transcription/blob/main/text_to_pose/args.py#L15). ([filter here](https://github.com/sign-language-processing/transcription/blob/main/text_to_pose/data.py#L60))

Currently, if memory serves me right, out of ~4000 videos in the `dicta_sign` dataset, the model trains on ~2500 because of the limit of 100 frames max. (more frames, we get an out-of-memory error. perhaps related to more than just the transformer)

The ideal backbone for the pose encoding, in my opinion, is an S4 model, while the text (usually a lot shorter) could still use a transformer.

We should experiment how to increase the input size for the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-to-pose: Scale up sequence size #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

text-to-pose: Scale up sequence size #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions