From 06cee90453ea7364426ef9b556b8a62e714a11f0 Mon Sep 17 00:00:00 2001
From: Amit Moryossef <amit@nagish.com>
Date: Tue, 28 Apr 2026 09:21:43 +0000
Subject: [PATCH 1/3] Add SEDA reference (Tan et al., 2024)

Adds a sentence describing SEDA's sign-feature and spoken-text augmentation
plus multi-task learning extension to the sign language transformer in the
Video-to-Text section, and appends the corresponding BibTeX entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/index.md       |  2 ++
 src/references.bib | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/src/index.md b/src/index.md
index f0f8d3a..d99f0aa 100644
--- a/src/index.md
+++ b/src/index.md
@@ -744,6 +744,8 @@ On this encoding, they use a Connectionist Temporal Classification (CTC) [@grave
 Using the same encoding, they use a transformer decoder to decode the spoken language text one token at a time.
 They show that adding gloss supervision improves the model over not using it and that it outperforms previous video-to-gloss-to-text pipeline approaches [@cihan2018neural].
 
+@tan-etal-2024-seda extend this sign language transformer with SEDA, a simple and effective data augmentation framework that augments sign features by passing the same frames through multiple sign embeddings (the original spatial embedding plus a frozen spatial-temporal embedding distilled from a self-mutual knowledge distillation model) and augments spoken text via lemmatization and alphabet normalization, then trains both views jointly with multi-task learning followed by task-specific fine-tuning to achieve competitive WER, BLEU, and ROUGE on RWTH-PHOENIX-Weather-2014T.
+
 Following up, @camgoz2020multi propose a new architecture that does not require the supervision of glosses, named "Multi-channel Transformers for Multi-articulatory Sign Language Translation".
 In this approach, they crop the signing hand and the face and perform 3D pose estimation to obtain three separate data channels.
 They encode each data channel separately using a transformer, then encode all channels together and concatenate the separate channels for each frame.
diff --git a/src/references.bib b/src/references.bib
index 4371d2c..c7a5e4a 100644
--- a/src/references.bib
+++ b/src/references.bib
@@ -4782,6 +4782,14 @@ @inproceedings{dataset:reverdy-etal-2024-stk
     author = "Reverdy, Cl{\'e}ment  and
       Gibet, Sylvie  and
       Le Naour, Thibaut",
+}
+
+@inproceedings{tan-etal-2024-seda,
+    title = "{SEDA}: Simple and Effective Data Augmentation for Sign Language Understanding",
+    author = "Tan, Sihan  and
+      Miyazaki, Taro  and
+      Itoyama, Katsutoshi  and
+      Nakadai, Kazuhiro",
     editor = "Efthimiou, Eleni  and
       Fotinea, Stavroula-Evita  and
       Hanke, Thomas  and
@@ -4816,3 +4824,7 @@ @inproceedings{dataset:reverdy-etal-2024-stk
     url = "https://aclanthology.org/2024.signlang-1.35/",
     pages = "315--322"
 }
+
+    url = "https://aclanthology.org/2024.signlang-1.41/",
+    pages = "370--375"
+}

From caa90f22fdf142a3ba03e106ce1cc09f05558a11 Mon Sep 17 00:00:00 2001
From: AmitMY <amit@nagish.com>
Date: Tue, 28 Apr 2026 09:56:48 +0000
Subject: [PATCH 2/3] Trim tan-etal SEDA one-liner (review pattern: concise)

---
 src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/index.md b/src/index.md
index d99f0aa..3385c12 100644
--- a/src/index.md
+++ b/src/index.md
@@ -744,7 +744,7 @@ On this encoding, they use a Connectionist Temporal Classification (CTC) [@grave
 Using the same encoding, they use a transformer decoder to decode the spoken language text one token at a time.
 They show that adding gloss supervision improves the model over not using it and that it outperforms previous video-to-gloss-to-text pipeline approaches [@cihan2018neural].
 
-@tan-etal-2024-seda extend this sign language transformer with SEDA, a simple and effective data augmentation framework that augments sign features by passing the same frames through multiple sign embeddings (the original spatial embedding plus a frozen spatial-temporal embedding distilled from a self-mutual knowledge distillation model) and augments spoken text via lemmatization and alphabet normalization, then trains both views jointly with multi-task learning followed by task-specific fine-tuning to achieve competitive WER, BLEU, and ROUGE on RWTH-PHOENIX-Weather-2014T.
+@tan-etal-2024-seda extend this sign language transformer with SEDA, a data augmentation framework that augments sign features through multiple sign embeddings and augments spoken text via lemmatization, achieving competitive WER, BLEU, and ROUGE on RWTH-PHOENIX-Weather-2014T.
 
 Following up, @camgoz2020multi propose a new architecture that does not require the supervision of glosses, named "Multi-channel Transformers for Multi-articulatory Sign Language Translation".
 In this approach, they crop the signing hand and the face and perform 3D pose estimation to obtain three separate data channels.

From 614d629a81d36164c346810fe234b05a5ceb324a Mon Sep 17 00:00:00 2001
From: AmitMY <amit@nagish.com>
Date: Tue, 28 Apr 2026 12:02:56 +0000
Subject: [PATCH 3/3] Move tan-etal SEDA to last paragraph in Video-to-Text
 section (review feedback)

---
 src/index.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/index.md b/src/index.md
index 3385c12..21ead94 100644
--- a/src/index.md
+++ b/src/index.md
@@ -744,8 +744,6 @@ On this encoding, they use a Connectionist Temporal Classification (CTC) [@grave
 Using the same encoding, they use a transformer decoder to decode the spoken language text one token at a time.
 They show that adding gloss supervision improves the model over not using it and that it outperforms previous video-to-gloss-to-text pipeline approaches [@cihan2018neural].
 
-@tan-etal-2024-seda extend this sign language transformer with SEDA, a data augmentation framework that augments sign features through multiple sign embeddings and augments spoken text via lemmatization, achieving competitive WER, BLEU, and ROUGE on RWTH-PHOENIX-Weather-2014T.
-
 Following up, @camgoz2020multi propose a new architecture that does not require the supervision of glosses, named "Multi-channel Transformers for Multi-articulatory Sign Language Translation".
 In this approach, they crop the signing hand and the face and perform 3D pose estimation to obtain three separate data channels.
 They encode each data channel separately using a transformer, then encode all channels together and concatenate the separate channels for each frame.
@@ -804,6 +802,8 @@ SSVP-SLT achieves state-of-the-art performance on How2Sign [@dataset:duarte2020h
 They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance.
 Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from [The Daily Moth](https://www.dailymoth.com/).
 
+@tan-etal-2024-seda extend this sign language transformer with SEDA, a data augmentation framework that augments sign features through multiple sign embeddings and augments spoken text via lemmatization, achieving competitive WER, BLEU, and ROUGE on RWTH-PHOENIX-Weather-2014T.
+
 #### Text-to-Video
 Text-to-Video, also known as sign language production, is the task of producing a video that adequately represents
 a spoken language text in sign language.