Skip to content

Add Nucleotide Transformer v2 SeqTrainer tutorial for promoter tasks#6

Open
Gonza10V wants to merge 1 commit into
refactoredfrom
codex/implement-notebook-for-nucleotide-transformer-v2
Open

Add Nucleotide Transformer v2 SeqTrainer tutorial for promoter tasks#6
Gonza10V wants to merge 1 commit into
refactoredfrom
codex/implement-notebook-for-nucleotide-transformer-v2

Conversation

@Gonza10V
Copy link
Copy Markdown
Owner

@Gonza10V Gonza10V commented May 1, 2026

Motivation

  • Make it easy to use the InstaDeep Nucleotide Transformer v2 from within SeqTrainer by providing a package-native backbone spec and convenience accessors.
  • Demonstrate realistic promoter tasks (binary classification and regression) using SeqTrainer dataset abstractions together with Hugging Face Trainer to serve as a reproducible example for users.

Description

  • Add src/seqtrainer/torch/hf.py which defines NucleotideTransformerV2Spec and provides register_default_torch_backbones and get_nucleotide_transformer_v2_backbone to register/resolve a built-in backbone spec for InstaDeepAI/nucleotide-transformer-v2-500m-multi-species.
  • Export the new helpers from seqtrainer.torch by updating src/seqtrainer/torch/__init__.py so the backbone can be resolved via from seqtrainer.torch import get_nucleotide_transformer_v2_backbone.
  • Add tutorial notebook notebooks/tutorials/06_ntv2_promoter_tasks.ipynb that loads the example promoter dataset, builds SeqTrainer MaterializedDataset splits, tokenizes with the resolved tokenizer, and shows HF Trainer setup for both classification (median-thresholded) and regression (raw target); training/eval calls are commented to avoid accidental heavy runs.
  • Update README.md to include the new notebook in the tutorial list.

Testing

  • Ran python -m compileall src/seqtrainer to ensure files compile and the package files are syntactically valid, which succeeded.
  • Verified the backbone helper via PYTHONPATH=src python - <<'PY' ... get_nucleotide_transformer_v2_backbone() ... PY, which returned the expected BackboneSpec successfully.
  • Confirmed that importing without an editable install fails in a plain interpreter (environment limitation), so the recommended usage is via editable install or PYTHONPATH=src; this import failure is environment-specific and not a code defect.

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba4856a8a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

" per_device_train_batch_size=2,\n",
" per_device_eval_batch_size=2,\n",
" num_train_epochs=1,\n",
" eval_strategy=\"epoch\",\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use evaluation_strategy for Transformers 4.30 compatibility

This notebook passes eval_strategy to TrainingArguments, but this parameter is not accepted in the project's declared minimum transformers versions (e.g., 4.30.x uses evaluation_strategy), so running the tutorial in a compliant environment raises TypeError before training starts. The same incompatible argument appears in both classification and regression argument blocks, which makes the tutorial fail for users installing seqtrainer[torch] with the documented dependency floor.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant