Skip to content

Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold#12

Open
Gonza10V wants to merge 1 commit into
refactoredfrom
codex/implement-bacterial-genome-prediction-pipeline
Open

Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold#12
Gonza10V wants to merge 1 commit into
refactoredfrom
codex/implement-bacterial-genome-prediction-pipeline

Conversation

@Gonza10V
Copy link
Copy Markdown
Owner

Motivation

  • Provide a first-pass, extensible pipeline to train next-token-prediction (NTP) models on bacterial genomes using a Titans/MIRAS-style Memory-As-Context (MAC) architecture so long genomes can be handled via chunked streaming and long-term memory summaries.
  • Pin the initial dataset to Escherichia coli K-12 MG1655 (NC_000913.3) to ensure deterministic provenance and enable repeatable experiments for downstream promoter tasks.
  • Scaffold downstream transfer evaluation hooks so pretrained genome encoders can be compared against scratch and frozen-encoder baselines for promoter classification and activity prediction.

Description

  • Added a download utility scripts/download_genomes.py that fetches the pinned accession (fallback to direct FASTA fetch) and records provenance fields including accession, source, download_date, file, and sha256 in data/raw/ecoli_mg1655/ and download_metadata.json.
  • Added preprocessing and dataset utilities in src/seqtrainer/genome_ntp/data.py plus scripts/preprocess_genome_ntp.py to parse FASTA, normalize DNA to A/C/G/T/N, optionally append the reverse complement, perform interval-based train/val/test splits with configurable buffer_bp to avoid leakage, export tokens.npy and a metadata.json with preprocessing parameters and splits in data/processed/ecoli_mg1655/.
  • Implemented a small, configurable MAC-style NTP model scaffold src/seqtrainer/genome_ntp/model.py (MACMemoryNTP + MacNTPConfig) that prepends long-term memory slots to chunk embeddings, applies local causal Transformer encoding, updates memory with a decay gate, and reports a surprise proxy and memory_norm statistics.
  • Added a simple dependency-free DNA tokenizer and helper in src/seqtrainer/genome_ntp/tokenizer.py with A/C/G/T/N tokens and reverse_complement utility, and a genome manifest format in src/seqtrainer/genome_ntp/manifest.py with an initial configs/genome_manifest.json entry for MG1655.
  • Added training and evaluation entrypoints: train_ntp.py (configurable CLI; checkpointing; writes metrics.json) and eval_ntp.py (held-out evaluation, confusion matrix, and Markov baseline stub), plus dataset/loaders and metrics helpers in src/seqtrainer/genome_ntp/train_eval.py.
  • Added downstream protocol stub src/seqtrainer/genome_ntp/downstream.py describing evaluation modes (scratch, finetune, frozen_encoder) and target metrics for classification/activity tasks, updated README.md with usage examples, and added minimal unit tests in tests/test_genome_ntp.py covering tokenizer, FASTA parsing, interval splitting, and windowing.
  • Made a small packaging change to src/seqtrainer/__init__.py to avoid import-time optional dependency failures in lightweight test runs.
  • Notes: the datasets CLI path downloads a zip for GCF_000005845.2 but automatic extraction/selection of FASTA from that archive is left as next-step scaffolding; Markov baseline is a first-pass stub while fuller baselines can be added later.

Testing

  • Ran unit/test collection with python -m pytest -q tests/test_genome_ntp.py which failed during import in this environment due to missing runtime dependencies (initial failure: missing requests, then pandas, then numpy) so tests could not complete; the test files tests/test_genome_ntp.py were added and exercise DNATokenizer, parse_fasta, interval_split, and GenomeWindowDataset but require installing dependencies to run.
  • Attempted a full preprocess/train/eval smoke flow with PYTHONPATH=src python scripts/preprocess_genome_ntp.py, PYTHONPATH=src python train_ntp.py, and PYTHONPATH=src python eval_ntp.py, but execution was blocked in this environment by missing packages such as numpy and torch (so training and evaluation were not executed to completion here).
  • Summary of automated outcomes: test and smoke commands were executed in CI-like fashion but all failed in this runtime due to missing third-party dependencies; the new code is structured so that after installing required dependencies (numpy, torch, etc.) the unit tests and smoke-training flow should run (see README.md for example commands).

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1035ada49e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread eval_ntp.py
_, _, test_ds, _ = load_datasets(Path(args.processed), args.chunk_len, args.stride)
test_loader = make_loader(test_ds, 16)
model.reset_memory(); metrics = run_epoch(model, test_loader, None)
cm = confusion_matrix(model, test_loader, cfg.vocab_size)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset model memory before computing confusion matrix

The confusion-matrix pass reuses the same model instance immediately after run_epoch, which updates the MAC memory across the entire test set. Because confusion_matrix is then run without a reset_memory(), its predictions are conditioned on memory that already contains future test information, so the reported matrix is not an unbiased held-out evaluation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex good catch, please implement a fix

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

  • Implemented the requested fix in eval_ntp.py by resetting MAC memory immediately before the confusion-matrix pass, so confusion-matrix predictions are not influenced by memory state accumulated during the earlier run_epoch test evaluation pass.
  • Committed the change on the current branch with commit message: Reset MAC memory before confusion matrix evaluation.

Testing

  • python -m py_compile eval_ntp.py

View task →

source = "ncbi_nuccore"
fasta = out / f"{args.accession}.fna"
if datasets:
cmd = ["datasets", "download", "genome", "accession", "GCF_000005845.2", "--include", "genome,gff3,gbff", "--filename", str(out / "ncbi_dataset.zip")]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use CLI accession in datasets download command

The datasets command is hardcoded to GCF_000005845.2 instead of using --accession, so any non-default accession request downloads the wrong genome via the CLI path while metadata still records the user-provided accession. This creates incorrect provenance and can silently preprocess the wrong sequence once zip extraction is added.

Useful? React with 👍 / 👎.

from pathlib import Path
from urllib.request import urlretrieve

from seqtrainer.genome_ntp.data import file_sha256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove torch dependency from download-only script

This downloader imports file_sha256 from seqtrainer.genome_ntp.data, and that module imports torch at import time. Since torch is an optional dependency in this repo, the README's data-setup flow can fail before any download starts on base installs, even though hashing a file does not require ML dependencies.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant