Refactor repo: framework‑neutral core, adapters, SPARQL/SBOL graph tooling, CLI, and tests#2
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d572efc75e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sequences = get_sequences_from_sbol(file_paths) | ||
| df = pd.DataFrame({"sequence": sequences, "target": y_labels}) | ||
| return df No newline at end of file | ||
| return [] |
There was a problem hiding this comment.
Restore y-URI discovery in dataset_builder wrapper
find_possible_y_uris is now hardcoded to return an empty list, so any legacy workflow that relied on auto-detecting numeric label predicates will always fail to discover a usable y_uri. This is a functional regression in a module labeled as a compatibility wrapper, and it blocks dataset construction for users who do not already know the exact predicate URI.
Useful? React with 👍 / 👎.
| """Count k-mer occurrences using A/C/G/T vocabulary.""" | ||
| if k <= 0: | ||
| raise ValueError("k must be positive") | ||
| seq = normalize_sequence(sequence).replace("N", "") |
There was a problem hiding this comment.
Avoid creating k-mers across unknown bases
Removing all N characters before counting (normalize_sequence(sequence).replace("N", "")) stitches separated regions together and produces k-mers that never existed contiguously in the original sequence (e.g., ANCG with k=2 yields AC). That corrupts feature vectors whenever ambiguous bases are present and can bias downstream model training/evaluation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Fixed
kmer_countsto treat ambiguousNbases as segment boundaries instead of deleting them, which prevents generating artificial cross-boundary k-mers (e.g.,ACfromANCG). The function now counts windows per contiguous A/C/G/T segment and normalizes by the true number of valid windows. src/seqtrainer/transforms/dna.pyL56-L75 - Added a regression test to verify that
k=2onANCGdoes not countAC, while still counting valid in-segment k-mers likeCG. tests/test_transforms_dna.pyL30-L34 - Committed the fix on the current branch with commit
14916bdand created a follow-up PR viamake_prtitled “Follow-up: prevent k-mer stitching across ambiguous bases”.
Testing
- ✅
pytest -q tests/test_transforms_dna.py
| for i, seq in enumerate(processed): | ||
| for j, token in enumerate(seq): | ||
| token_ids[i, j] = vocab.get(token, vocab.get("N", 0)) | ||
| attention_mask[i, j] = 0 if token == pad_char else 1 |
There was a problem hiding this comment.
Do not mask real 'N' tokens as padding
The attention mask marks every token equal to pad_char as 0, and with the default pad_char="N" this masks genuine ambiguous nucleotide positions, not just padded positions. For tokenized sequence models this suppresses valid inputs whenever biological sequences contain N, leading to systematically incorrect attention behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Fixed
token_idsattention-mask generation so it no longer treats every token equal topad_charas padding. The mask now uses normalized pre-padding content length, which preserves real biologicalNtokens while still masking true padded positions. src/seqtrainer/data/tensorization.pyL42-L57 - Added a regression test that verifies a real
Nin sequence content ("AN") remains unmasked and only padded tail positions are masked ([[1, 1, 0, 0]]). tests/test_tensorization_pipeline.pyL40-L48 - Committed the fix on the current branch (
77cd665) and created a follow-up PR viamake_prwith a focused title/body describing this change.
Testing
- ✅
pytest -q tests/test_tensorization_pipeline.py
Motivation
Description
seqtrainer/and updated packaging inpyproject.tomlincluding optional extras and an entrypoint scriptseqtrainer.SynBioHubClientwith retry, pagination, JSON decoding and SBOL fetch helpers inseqtrainer.clients.synbiohub.MaterializedDataset,DatasetRecipe, snapshot manifest and helpers inseqtrainer.data(cache.py,materialized.py,recipes.py,sbol.py,tensorization.py).seqtrainer.transforms.dnaand backward-compatible wrappers inpreprocessing.pyanddataset_builder.py.seqtrainer.sparqland canonical recipes such assequence_query.seqtrainer.graph(rdf.py,config.py).seqtrainer.torch/*) and Keras (seqtrainer.keras/*) including tensorization adapters, HF/DNABERT backbones, heads, and model composition factories.models.registrywith default specs and a task-level application blueprintbuild_promoter_regression_blueprintinseqtrainer.applications.seqtrainer.cli.main) and README/docs:README.md,README_VISION.md,docs/architecture.md, anddocs/migration.mdthat document the new layout and migration guidance.gnn.py) and provided compatibility wrappers where helpful.tests/that exercise data abstractions, caching, graph config, SPARQL recipes/normalization, SynBioHub client behavior, tensorization and adapters, transforms, and import surface.Testing
pytest -qovertests/) covering data, cache, graph, sparql, client, tensorization, transforms, adapters, and import smoke tests; all tests passed locally in CI (38 tests).Codex Task