Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold#12
Add MG1655 genome next-token-prediction pipeline with MAC memory scaffold#12Gonza10V wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1035ada49e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _, _, test_ds, _ = load_datasets(Path(args.processed), args.chunk_len, args.stride) | ||
| test_loader = make_loader(test_ds, 16) | ||
| model.reset_memory(); metrics = run_epoch(model, test_loader, None) | ||
| cm = confusion_matrix(model, test_loader, cfg.vocab_size) |
There was a problem hiding this comment.
Reset model memory before computing confusion matrix
The confusion-matrix pass reuses the same model instance immediately after run_epoch, which updates the MAC memory across the entire test set. Because confusion_matrix is then run without a reset_memory(), its predictions are conditioned on memory that already contains future test information, so the reported matrix is not an unbiased held-out evaluation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Implemented the requested fix in
eval_ntp.pyby resetting MAC memory immediately before the confusion-matrix pass, so confusion-matrix predictions are not influenced by memory state accumulated during the earlierrun_epochtest evaluation pass. - Committed the change on the current branch with commit message:
Reset MAC memory before confusion matrix evaluation.
Testing
- ✅
python -m py_compile eval_ntp.py
| source = "ncbi_nuccore" | ||
| fasta = out / f"{args.accession}.fna" | ||
| if datasets: | ||
| cmd = ["datasets", "download", "genome", "accession", "GCF_000005845.2", "--include", "genome,gff3,gbff", "--filename", str(out / "ncbi_dataset.zip")] |
There was a problem hiding this comment.
Use CLI accession in datasets download command
The datasets command is hardcoded to GCF_000005845.2 instead of using --accession, so any non-default accession request downloads the wrong genome via the CLI path while metadata still records the user-provided accession. This creates incorrect provenance and can silently preprocess the wrong sequence once zip extraction is added.
Useful? React with 👍 / 👎.
| from pathlib import Path | ||
| from urllib.request import urlretrieve | ||
|
|
||
| from seqtrainer.genome_ntp.data import file_sha256 |
There was a problem hiding this comment.
Remove torch dependency from download-only script
This downloader imports file_sha256 from seqtrainer.genome_ntp.data, and that module imports torch at import time. Since torch is an optional dependency in this repo, the README's data-setup flow can fail before any download starts on base installs, even though hashing a file does not require ML dependencies.
Useful? React with 👍 / 👎.
Motivation
NC_000913.3) to ensure deterministic provenance and enable repeatable experiments for downstream promoter tasks.Description
scripts/download_genomes.pythat fetches the pinned accession (fallback to direct FASTA fetch) and records provenance fields includingaccession,source,download_date,file, andsha256indata/raw/ecoli_mg1655/anddownload_metadata.json.src/seqtrainer/genome_ntp/data.pyplusscripts/preprocess_genome_ntp.pyto parse FASTA, normalize DNA toA/C/G/T/N, optionally append the reverse complement, perform interval-basedtrain/val/testsplits with configurablebuffer_bpto avoid leakage, exporttokens.npyand ametadata.jsonwith preprocessing parameters and splits indata/processed/ecoli_mg1655/.src/seqtrainer/genome_ntp/model.py(MACMemoryNTP+MacNTPConfig) that prepends long-term memory slots to chunk embeddings, applies local causal Transformer encoding, updates memory with a decay gate, and reports asurpriseproxy andmemory_normstatistics.src/seqtrainer/genome_ntp/tokenizer.pywithA/C/G/T/Ntokens andreverse_complementutility, and a genome manifest format insrc/seqtrainer/genome_ntp/manifest.pywith an initialconfigs/genome_manifest.jsonentry for MG1655.train_ntp.py(configurable CLI; checkpointing; writesmetrics.json) andeval_ntp.py(held-out evaluation, confusion matrix, and Markov baseline stub), plus dataset/loaders and metrics helpers insrc/seqtrainer/genome_ntp/train_eval.py.src/seqtrainer/genome_ntp/downstream.pydescribing evaluation modes (scratch,finetune,frozen_encoder) and target metrics for classification/activity tasks, updatedREADME.mdwith usage examples, and added minimal unit tests intests/test_genome_ntp.pycovering tokenizer, FASTA parsing, interval splitting, and windowing.src/seqtrainer/__init__.pyto avoid import-time optional dependency failures in lightweight test runs.datasetsCLI path downloads a zip forGCF_000005845.2but automatic extraction/selection of FASTA from that archive is left as next-step scaffolding; Markov baseline is a first-pass stub while fuller baselines can be added later.Testing
python -m pytest -q tests/test_genome_ntp.pywhich failed during import in this environment due to missing runtime dependencies (initial failure: missingrequests, thenpandas, thennumpy) so tests could not complete; the test filestests/test_genome_ntp.pywere added and exerciseDNATokenizer,parse_fasta,interval_split, andGenomeWindowDatasetbut require installing dependencies to run.PYTHONPATH=src python scripts/preprocess_genome_ntp.py,PYTHONPATH=src python train_ntp.py, andPYTHONPATH=src python eval_ntp.py, but execution was blocked in this environment by missing packages such asnumpyandtorch(so training and evaluation were not executed to completion here).numpy,torch, etc.) the unit tests and smoke-training flow should run (seeREADME.mdfor example commands).Codex Task