Add DNABERT-2 promoter finetuning tutorial notebook by Gonza10V · Pull Request #5 · Gonza10V/SeqTrainer

Gonza10V · 2026-04-30T04:17:02Z

Motivation

Provide an end-to-end example showing how to finetune DNABERT-2 for binary promoter classification using Hugging Face Transformers and the Trainer API.
Make it easy to reproduce and try locally by including dependency hints and a synthetic fallback dataset when a local CSV is not available.

Description

Add a new tutorial notebook at notebooks/tutorials/06_dnabert2_promoter_finetuning.ipynb containing setup, data loading, preprocessing, tokenization, training, evaluation, and inference sections.
Include a dependency-install cell and reproducibility seed plus model configuration targeting zhihan1996/DNABERT-2-117M with trust_remote_code=True for tokenizer and model loading.
Implement robust data handling with required-column validation, sequence sanitization ([^ACGTN] -> N), a DatasetDict train/validation/test split, and a DataCollatorWithPadding-based data pipeline.
Add compute_metrics (accuracy, F1, precision, recall), TrainingArguments tuned for finetuning, a Trainer instantiation, and example inference code for new sequences.

Testing

Validated the notebook JSON structure with python -m json.tool notebooks/tutorials/06_dnabert2_promoter_finetuning.ipynb, which completed successfully.

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3b648021d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T04:18:49Z

+        "    temp_df,\n",
+        "    test_size=0.5,\n",
+        "    random_state=42,\n",
+        "    stratify=temp_df[\"label\"] if temp_df[\"label\"].nunique() > 1 else None,\n",


Guard second split from invalid stratification

This split can raise a ValueError when temp_df has only one sample per class (which happens with the notebook’s built-in 6-row fallback dataset after the first 70/30 split), because train_test_split(..., stratify=...) requires at least 2 members per class. In that case the tutorial stops before training, so users without data/data_DNABERT/promoter_classification.csv cannot run the notebook end-to-end.

Useful? React with 👍 / 👎.

Add DNABERT2 promoter classification finetuning tutorial notebook

d3b6480

Gonza10V added the codex label Apr 30, 2026 — with ChatGPT Codex Connector

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DNABERT-2 promoter finetuning tutorial notebook#5

Add DNABERT-2 promoter finetuning tutorial notebook#5
Gonza10V wants to merge 1 commit into
refactoredfrom
codex/create-example-notebook-for-dnabert2-finetuning

Gonza10V commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gonza10V commented Apr 30, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant