Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold by Gonza10V · Pull Request #12 · Gonza10V/SeqTrainer

Gonza10V · 2026-05-12T22:23:21Z

Motivation

Provide a first-pass, extensible pipeline to train next-token-prediction (NTP) models on bacterial genomes using a Titans/MIRAS-style Memory-As-Context (MAC) architecture so long genomes can be handled via chunked streaming and long-term memory summaries.
Pin the initial dataset to Escherichia coli K-12 MG1655 (NC_000913.3) to ensure deterministic provenance and enable repeatable experiments for downstream promoter tasks.
Scaffold downstream transfer evaluation hooks so pretrained genome encoders can be compared against scratch and frozen-encoder baselines for promoter classification and activity prediction.

Description

Added a download utility scripts/download_genomes.py that fetches the pinned accession (fallback to direct FASTA fetch) and records provenance fields including accession, source, download_date, file, and sha256 in data/raw/ecoli_mg1655/ and download_metadata.json.
Added preprocessing and dataset utilities in src/seqtrainer/genome_ntp/data.py plus scripts/preprocess_genome_ntp.py to parse FASTA, normalize DNA to A/C/G/T/N, optionally append the reverse complement, perform interval-based train/val/test splits with configurable buffer_bp to avoid leakage, export tokens.npy and a metadata.json with preprocessing parameters and splits in data/processed/ecoli_mg1655/.
Implemented a small, configurable MAC-style NTP model scaffold src/seqtrainer/genome_ntp/model.py (MACMemoryNTP + MacNTPConfig) that prepends long-term memory slots to chunk embeddings, applies local causal Transformer encoding, updates memory with a decay gate, and reports a surprise proxy and memory_norm statistics.
Added a simple dependency-free DNA tokenizer and helper in src/seqtrainer/genome_ntp/tokenizer.py with A/C/G/T/N tokens and reverse_complement utility, and a genome manifest format in src/seqtrainer/genome_ntp/manifest.py with an initial configs/genome_manifest.json entry for MG1655.
Added training and evaluation entrypoints: train_ntp.py (configurable CLI; checkpointing; writes metrics.json) and eval_ntp.py (held-out evaluation, confusion matrix, and Markov baseline stub), plus dataset/loaders and metrics helpers in src/seqtrainer/genome_ntp/train_eval.py.
Added downstream protocol stub src/seqtrainer/genome_ntp/downstream.py describing evaluation modes (scratch, finetune, frozen_encoder) and target metrics for classification/activity tasks, updated README.md with usage examples, and added minimal unit tests in tests/test_genome_ntp.py covering tokenizer, FASTA parsing, interval splitting, and windowing.
Made a small packaging change to src/seqtrainer/__init__.py to avoid import-time optional dependency failures in lightweight test runs.
Notes: the datasets CLI path downloads a zip for GCF_000005845.2 but automatic extraction/selection of FASTA from that archive is left as next-step scaffolding; Markov baseline is a first-pass stub while fuller baselines can be added later.

Testing

Ran unit/test collection with python -m pytest -q tests/test_genome_ntp.py which failed during import in this environment due to missing runtime dependencies (initial failure: missing requests, then pandas, then numpy) so tests could not complete; the test files tests/test_genome_ntp.py were added and exercise DNATokenizer, parse_fasta, interval_split, and GenomeWindowDataset but require installing dependencies to run.
Attempted a full preprocess/train/eval smoke flow with PYTHONPATH=src python scripts/preprocess_genome_ntp.py, PYTHONPATH=src python train_ntp.py, and PYTHONPATH=src python eval_ntp.py, but execution was blocked in this environment by missing packages such as numpy and torch (so training and evaluation were not executed to completion here).
Summary of automated outcomes: test and smoke commands were executed in CI-like fashion but all failed in this runtime due to missing third-party dependencies; the new code is structured so that after installing required dependencies (numpy, torch, etc.) the unit tests and smoke-training flow should run (see README.md for example commands).

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1035ada49e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T22:25:25Z

+    _, _, test_ds, _ = load_datasets(Path(args.processed), args.chunk_len, args.stride)
+    test_loader = make_loader(test_ds, 16)
+    model.reset_memory(); metrics = run_epoch(model, test_loader, None)
+    cm = confusion_matrix(model, test_loader, cfg.vocab_size)


Reset model memory before computing confusion matrix

The confusion-matrix pass reuses the same model instance immediately after run_epoch, which updates the MAC memory across the entire test set. Because confusion_matrix is then run without a reset_memory(), its predictions are conditioned on memory that already contains future test information, so the reported matrix is not an unbiased held-out evaluation.

Useful? React with 👍 / 👎.

@codex good catch, please implement a fix

Summary

Implemented the requested fix in eval_ntp.py by resetting MAC memory immediately before the confusion-matrix pass, so confusion-matrix predictions are not influenced by memory state accumulated during the earlier run_epoch test evaluation pass.

Committed the change on the current branch with commit message: Reset MAC memory before confusion matrix evaluation.

Testing

✅ python -m py_compile eval_ntp.py

View task →

chatgpt-codex-connector · 2026-05-12T22:25:25Z

+    source = "ncbi_nuccore"
+    fasta = out / f"{args.accession}.fna"
+    if datasets:
+        cmd = ["datasets", "download", "genome", "accession", "GCF_000005845.2", "--include", "genome,gff3,gbff", "--filename", str(out / "ncbi_dataset.zip")]


Use CLI accession in datasets download command

The datasets command is hardcoded to GCF_000005845.2 instead of using --accession, so any non-default accession request downloads the wrong genome via the CLI path while metadata still records the user-provided accession. This creates incorrect provenance and can silently preprocess the wrong sequence once zip extraction is added.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-12T22:25:25Z

+from pathlib import Path
+from urllib.request import urlretrieve
+
+from seqtrainer.genome_ntp.data import file_sha256


Remove torch dependency from download-only script

This downloader imports file_sha256 from seqtrainer.genome_ntp.data, and that module imports torch at import time. Since torch is an optional dependency in this repo, the README's data-setup flow can fail before any download starts on base installs, even though hashing a file does not require ML dependencies.

Useful? React with 👍 / 👎.

Add first-pass bacterial genome NTP MAC pipeline scaffolding

1035ada

Gonza10V added the codex label May 12, 2026 — with ChatGPT Codex Connector

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold#12

Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold#12
Gonza10V wants to merge 1 commit into
refactoredfrom
codex/implement-bacterial-genome-prediction-pipeline

Gonza10V commented May 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

Gonza10V May 23, 2026

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gonza10V commented May 12, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Gonza10V May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Summary

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant