Skip to content

Refactor repo: framework‑neutral core, adapters, SPARQL/SBOL graph tooling, CLI, and tests#2

Merged
Gonza10V merged 7 commits into
mainfrom
codex/refactor-seqtrainer-repository-structure-rxqpfs
Apr 25, 2026
Merged

Refactor repo: framework‑neutral core, adapters, SPARQL/SBOL graph tooling, CLI, and tests#2
Gonza10V merged 7 commits into
mainfrom
codex/refactor-seqtrainer-repository-structure-rxqpfs

Conversation

@Gonza10V
Copy link
Copy Markdown
Owner

Motivation

  • Provide a clear, framework-neutral domain core that owns SBOL/SynBioHub semantics and dataset materialization.
  • Enable optional framework adapters for Keras and PyTorch so model infra is pluggable and the base install stays lightweight.
  • Stabilize graph/SPARQL utilities and add durable dataset caching/snapshot semantics for reproducible workflows.
  • Replace ad-hoc prototype scripts with packaged CLIs and application blueprints to simplify common workflows.

Description

  • Reorganized package layout under seqtrainer/ and updated packaging in pyproject.toml including optional extras and an entrypoint script seqtrainer.
  • Added a production-ready SynBioHubClient with retry, pagination, JSON decoding and SBOL fetch helpers in seqtrainer.clients.synbiohub.
  • Introduced dataset abstractions and cache: MaterializedDataset, DatasetRecipe, snapshot manifest and helpers in seqtrainer.data (cache.py, materialized.py, recipes.py, sbol.py, tensorization.py).
  • Implemented DNA transforms in seqtrainer.transforms.dna and backward-compatible wrappers in preprocessing.py and dataset_builder.py.
  • Added SPARQL builder/recipes/normalization/prefixes in seqtrainer.sparql and canonical recipes such as sequence_query.
  • Added graph utilities for RDF/SBOL schema/edge extraction and AutoRDF2GML-style config builders under seqtrainer.graph (rdf.py, config.py).
  • Added framework adapters and factories for PyTorch (seqtrainer.torch/*) and Keras (seqtrainer.keras/*) including tensorization adapters, HF/DNABERT backbones, heads, and model composition factories.
  • Added a small framework-neutral models.registry with default specs and a task-level application blueprint build_promoter_regression_blueprint in seqtrainer.applications.
  • Added CLI entrypoints (seqtrainer.cli.main) and README/docs: README.md, README_VISION.md, docs/architecture.md, and docs/migration.md that document the new layout and migration guidance.
  • Removed or deprecated prototype modules (e.g., legacy gnn.py) and provided compatibility wrappers where helpful.
  • Added unit tests under tests/ that exercise data abstractions, caching, graph config, SPARQL recipes/normalization, SynBioHub client behavior, tensorization and adapters, transforms, and import surface.

Testing

  • Ran the full pytest suite (pytest -q over tests/) covering data, cache, graph, sparql, client, tensorization, transforms, adapters, and import smoke tests; all tests passed locally in CI (38 tests).
  • Exercised SPARQL code paths and SynBioHub client behavior via unit tests using mocked HTTP responses and pagination scenarios, which succeeded.
  • Verified tensorization and adapter error paths for missing optional dependencies via targeted unit tests, which succeeded.

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d572efc75e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sequences = get_sequences_from_sbol(file_paths)
df = pd.DataFrame({"sequence": sequences, "target": y_labels})
return df No newline at end of file
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore y-URI discovery in dataset_builder wrapper

find_possible_y_uris is now hardcoded to return an empty list, so any legacy workflow that relied on auto-detecting numeric label predicates will always fail to discover a usable y_uri. This is a functional regression in a module labeled as a compatibility wrapper, and it blocks dataset construction for users who do not already know the exact predicate URI.

Useful? React with 👍 / 👎.

Comment thread src/seqtrainer/transforms/dna.py Outdated
"""Count k-mer occurrences using A/C/G/T vocabulary."""
if k <= 0:
raise ValueError("k must be positive")
seq = normalize_sequence(sequence).replace("N", "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid creating k-mers across unknown bases

Removing all N characters before counting (normalize_sequence(sequence).replace("N", "")) stitches separated regions together and produces k-mers that never existed contiguously in the original sequence (e.g., ANCG with k=2 yields AC). That corrupts feature vectors whenever ambiguous bases are present and can bias downstream model training/evaluation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex implement a fix for this

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

  • Fixed kmer_counts to treat ambiguous N bases as segment boundaries instead of deleting them, which prevents generating artificial cross-boundary k-mers (e.g., AC from ANCG). The function now counts windows per contiguous A/C/G/T segment and normalizes by the true number of valid windows. src/seqtrainer/transforms/dna.pyL56-L75
  • Added a regression test to verify that k=2 on ANCG does not count AC, while still counting valid in-segment k-mers like CG. tests/test_transforms_dna.pyL30-L34
  • Committed the fix on the current branch with commit 14916bd and created a follow-up PR via make_pr titled “Follow-up: prevent k-mer stitching across ambiguous bases”.

Testing

  • pytest -q tests/test_transforms_dna.py

View task →

Comment thread src/seqtrainer/data/tensorization.py Outdated
for i, seq in enumerate(processed):
for j, token in enumerate(seq):
token_ids[i, j] = vocab.get(token, vocab.get("N", 0))
attention_mask[i, j] = 0 if token == pad_char else 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not mask real 'N' tokens as padding

The attention mask marks every token equal to pad_char as 0, and with the default pad_char="N" this masks genuine ambiguous nucleotide positions, not just padded positions. For tokenized sequence models this suppresses valid inputs whenever biological sequences contain N, leading to systematically incorrect attention behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex implement a fix for this

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

  • Fixed token_ids attention-mask generation so it no longer treats every token equal to pad_char as padding. The mask now uses normalized pre-padding content length, which preserves real biological N tokens while still masking true padded positions. src/seqtrainer/data/tensorization.pyL42-L57
  • Added a regression test that verifies a real N in sequence content ("AN") remains unmasked and only padded tail positions are masked ([[1, 1, 0, 0]]). tests/test_tensorization_pipeline.pyL40-L48
  • Committed the fix on the current branch (77cd665) and created a follow-up PR via make_pr with a focused title/body describing this change.

Testing

  • pytest -q tests/test_tensorization_pipeline.py

View task →

@Gonza10V Gonza10V merged commit a1bd573 into main Apr 25, 2026
12 checks passed
@Gonza10V Gonza10V deleted the codex/refactor-seqtrainer-repository-structure-rxqpfs branch April 25, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant