Add DNABERT-2 promoter finetuning tutorial notebook#5
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3b648021d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| " temp_df,\n", | ||
| " test_size=0.5,\n", | ||
| " random_state=42,\n", | ||
| " stratify=temp_df[\"label\"] if temp_df[\"label\"].nunique() > 1 else None,\n", |
There was a problem hiding this comment.
Guard second split from invalid stratification
This split can raise a ValueError when temp_df has only one sample per class (which happens with the notebook’s built-in 6-row fallback dataset after the first 70/30 split), because train_test_split(..., stratify=...) requires at least 2 members per class. In that case the tutorial stops before training, so users without data/data_DNABERT/promoter_classification.csv cannot run the notebook end-to-end.
Useful? React with 👍 / 👎.
Motivation
TrainerAPI.Description
notebooks/tutorials/06_dnabert2_promoter_finetuning.ipynbcontaining setup, data loading, preprocessing, tokenization, training, evaluation, and inference sections.zhihan1996/DNABERT-2-117Mwithtrust_remote_code=Truefor tokenizer and model loading.[^ACGTN] -> N), aDatasetDicttrain/validation/test split, and aDataCollatorWithPadding-based data pipeline.compute_metrics(accuracy, F1, precision, recall),TrainingArgumentstuned for finetuning, aTrainerinstantiation, and example inference code for new sequences.Testing
python -m json.tool notebooks/tutorials/06_dnabert2_promoter_finetuning.ipynb, which completed successfully.Codex Task