Refactor repo: framework‑neutral core, adapters, SPARQL/SBOL graph tooling, CLI, and tests by Gonza10V · Pull Request #2 · Gonza10V/SeqTrainer

Gonza10V · 2026-04-18T04:14:19Z

Motivation

Provide a clear, framework-neutral domain core that owns SBOL/SynBioHub semantics and dataset materialization.
Enable optional framework adapters for Keras and PyTorch so model infra is pluggable and the base install stays lightweight.
Stabilize graph/SPARQL utilities and add durable dataset caching/snapshot semantics for reproducible workflows.
Replace ad-hoc prototype scripts with packaged CLIs and application blueprints to simplify common workflows.

Description

Reorganized package layout under seqtrainer/ and updated packaging in pyproject.toml including optional extras and an entrypoint script seqtrainer.
Added a production-ready SynBioHubClient with retry, pagination, JSON decoding and SBOL fetch helpers in seqtrainer.clients.synbiohub.
Introduced dataset abstractions and cache: MaterializedDataset, DatasetRecipe, snapshot manifest and helpers in seqtrainer.data (cache.py, materialized.py, recipes.py, sbol.py, tensorization.py).
Implemented DNA transforms in seqtrainer.transforms.dna and backward-compatible wrappers in preprocessing.py and dataset_builder.py.
Added SPARQL builder/recipes/normalization/prefixes in seqtrainer.sparql and canonical recipes such as sequence_query.
Added graph utilities for RDF/SBOL schema/edge extraction and AutoRDF2GML-style config builders under seqtrainer.graph (rdf.py, config.py).
Added framework adapters and factories for PyTorch (seqtrainer.torch/*) and Keras (seqtrainer.keras/*) including tensorization adapters, HF/DNABERT backbones, heads, and model composition factories.
Added a small framework-neutral models.registry with default specs and a task-level application blueprint build_promoter_regression_blueprint in seqtrainer.applications.
Added CLI entrypoints (seqtrainer.cli.main) and README/docs: README.md, README_VISION.md, docs/architecture.md, and docs/migration.md that document the new layout and migration guidance.
Removed or deprecated prototype modules (e.g., legacy gnn.py) and provided compatibility wrappers where helpful.
Added unit tests under tests/ that exercise data abstractions, caching, graph config, SPARQL recipes/normalization, SynBioHub client behavior, tensorization and adapters, transforms, and import surface.

Testing

Ran the full pytest suite (pytest -q over tests/) covering data, cache, graph, sparql, client, tensorization, transforms, adapters, and import smoke tests; all tests passed locally in CI (38 tests).
Exercised SPARQL code paths and SynBioHub client behavior via unit tests using mocked HTTP responses and pagination scenarios, which succeeded.
Verified tensorization and adapter error paths for missing optional dependencies via targeted unit tests, which succeeded.

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d572efc75e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T04:17:55Z

-    sequences = get_sequences_from_sbol(file_paths)
-    df = pd.DataFrame({"sequence": sequences, "target": y_labels})
-    return df
+    return []


Restore y-URI discovery in dataset_builder wrapper

find_possible_y_uris is now hardcoded to return an empty list, so any legacy workflow that relied on auto-detecting numeric label predicates will always fail to discover a usable y_uri. This is a functional regression in a module labeled as a compatibility wrapper, and it blocks dataset construction for users who do not already know the exact predicate URI.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-18T04:17:55Z

+    """Count k-mer occurrences using A/C/G/T vocabulary."""
+    if k <= 0:
+        raise ValueError("k must be positive")
+    seq = normalize_sequence(sequence).replace("N", "")


Avoid creating k-mers across unknown bases

Removing all N characters before counting (normalize_sequence(sequence).replace("N", "")) stitches separated regions together and produces k-mers that never existed contiguously in the original sequence (e.g., ANCG with k=2 yields AC). That corrupts feature vectors whenever ambiguous bases are present and can bias downstream model training/evaluation.

Useful? React with 👍 / 👎.

@codex implement a fix for this

Summary

Fixed kmer_counts to treat ambiguous N bases as segment boundaries instead of deleting them, which prevents generating artificial cross-boundary k-mers (e.g., AC from ANCG). The function now counts windows per contiguous A/C/G/T segment and normalizes by the true number of valid windows. src/seqtrainer/transforms/dna.pyL56-L75

Added a regression test to verify that k=2 on ANCG does not count AC, while still counting valid in-segment k-mers like CG. tests/test_transforms_dna.pyL30-L34

Committed the fix on the current branch with commit 14916bd and created a follow-up PR via make_pr titled “Follow-up: prevent k-mer stitching across ambiguous bases”.

Testing

✅ pytest -q tests/test_transforms_dna.py

View task →

chatgpt-codex-connector · 2026-04-18T04:17:55Z

+    for i, seq in enumerate(processed):
+        for j, token in enumerate(seq):
+            token_ids[i, j] = vocab.get(token, vocab.get("N", 0))
+            attention_mask[i, j] = 0 if token == pad_char else 1


Do not mask real 'N' tokens as padding

The attention mask marks every token equal to pad_char as 0, and with the default pad_char="N" this masks genuine ambiguous nucleotide positions, not just padded positions. For tokenized sequence models this suppresses valid inputs whenever biological sequences contain N, leading to systematically incorrect attention behavior.

Useful? React with 👍 / 👎.

@codex implement a fix for this

Summary

Fixed token_ids attention-mask generation so it no longer treats every token equal to pad_char as padding. The mask now uses normalized pre-padding content length, which preserves real biological N tokens while still masking true padded positions. src/seqtrainer/data/tensorization.pyL42-L57

Added a regression test that verifies a real N in sequence content ("AN") remains unmasked and only padded tail positions are masked ([[1, 1, 0, 0]]). tests/test_tensorization_pipeline.pyL40-L48

Committed the fix on the current branch (77cd665) and created a follow-up PR via make_pr with a focused title/body describing this change.

Testing

✅ pytest -q tests/test_tensorization_pipeline.py

View task →

… commands

Stabilize graph module with schema extraction and config APIs

d572efc

Gonza10V added the codex label Apr 18, 2026 — with ChatGPT Codex Connector

chatgpt-codex-connector Bot reviewed Apr 18, 2026

View reviewed changes

Gonza10V added 6 commits April 17, 2026 22:42

Expand CLI with dataset recipes, cache flags, outputs and model build…

10cf0d6

… commands

Harden migration path with deprecation warnings and timeline docs

66b9b29

Add CI matrix with lint, typing, and packaging checks

2f11d0b

Fix token attention mask to preserve real N bases

a5da385

Fix k-mer counting across unknown DNA bases

e808221

Trim flanking N runs before k-mer segmentation

d59faa7

Gonza10V merged commit a1bd573 into main Apr 25, 2026
12 checks passed

Gonza10V deleted the codex/refactor-seqtrainer-repository-structure-rxqpfs branch April 25, 2026 05:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor repo: framework‑neutral core, adapters, SPARQL/SBOL graph tooling, CLI, and tests#2

Refactor repo: framework‑neutral core, adapters, SPARQL/SBOL graph tooling, CLI, and tests#2
Gonza10V merged 7 commits into
mainfrom
codex/refactor-seqtrainer-repository-structure-rxqpfs

Gonza10V commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

Gonza10V Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

Gonza10V Apr 18, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gonza10V commented Apr 18, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Gonza10V Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Summary

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Gonza10V Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 18, 2026

Choose a reason for hiding this comment

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant