Skip to content

Enhancement: improve agent-facing dataset reconnaissance and repository support #2

@youngfly93

Description

@youngfly93

Summary

After using biocli heavily for dataset reconnaissance and downloadability checks, I found it very effective for GEO/SRA scouting, but there are a few places where it could become much more useful for agent workflows.

This issue groups the highest-value improvements into one place.

What worked well

  • geo download --list-only is fast and very useful for checking whether a GSE has downloadable supplementary files.
  • sra search is enough to confirm that an SRA project resolves to real runs.
  • Structured output is agent-friendly and much faster than clicking through web pages.

Highest-value improvements

1. Built-in update discovery / self-update

Right now version updates are awkward to discover from inside the tool.

Concrete example:

  • biocli --version showed 0.3.8
  • npm view biocli version failed because the actual package name is scoped
  • I had to manually discover that the published package is @yangfei_93sky/biocli

Suggested improvements:

  • biocli doctor --check-update
  • biocli self-update
  • biocli version --latest

At minimum, the tool should expose its own package name and latest published version.

2. Native ArrayExpress / BioStudies support

biocli is strong on GEO, but some transcriptomics datasets live in ArrayExpress / BioStudies and currently require leaving the CLI.

Concrete example:

  • E-MTAB-373 required manual BioStudies API inspection
  • the study clearly exposes raw data files plus idf / sdrf, but this is not reachable through a first-class biocli command

Suggested commands:

  • biocli ae dataset E-MTAB-373
  • biocli ae download E-MTAB-373 --list-only
  • or a more general biocli biostudies study ...

3. GEO dataset metadata should expose raw availability directly

Today I often need both:

  • biocli geo dataset ...
  • biocli geo download ... --list-only

to answer a basic question like:

  • does this accession have raw files?
  • does it only have processed matrices?
  • what kinds of supplementary files exist?

Suggested improvement:

  • add fields such as has_raw_archive, has_supplementary_files, supplementary_file_count, supplementary_types

4. Archive content summary from filelist.txt

This would be especially valuable for agents deciding which downstream pipeline to use.

Concrete examples from real use:

  • GSE75479_RAW.tar exists, but the useful detail is that the archive contains RCC.gz
  • GSE22433_RAW.tar exists, but the useful detail is that it includes BGX
  • many Agilent entries expose raw .txt.gz files

Right now I had to fetch filelist.txt separately to infer this.

Suggested improvement:

When filelist.txt exists, surface an archive summary such as:

  • contains: CEL
  • contains: RCC
  • contains: BGX
  • contains: TXT raw exports

5. Dedicated SRA project command

sra search SRP... works as a workaround, but a project-level command would be cleaner and easier for automation.

Suggested command:

  • biocli sra project SRP276412

Useful fields:

  • project accession
  • run count
  • SRR list
  • sample titles
  • platform
  • layout
  • read length if available

6. Clearer error classes

From an agent perspective, these failure modes are very different and should be distinguishable:

  • accession exists but has no supplementary files
  • network failure
  • upstream API failure
  • unsupported repository
  • malformed accession

Suggested improvement:

  • return explicit machine-readable error categories instead of collapsing multiple cases into generic fetch failures

7. Optional platform / pipeline hints

This should stay lightweight, but even a small hint would help downstream orchestration a lot.

Examples:

  • Affymetrix CEL detected
  • Agilent raw TXT detected
  • Illumina BGX detected
  • NanoString RCC detected
  • RNA-seq counts matrix only
  • RNA-seq raw runs available via SRA

That would let agents choose the right preprocessing route much faster.

Why this matters

biocli is already very useful as a biological data discovery tool. The next big step is to make it not just a query tool, but also a better routing tool for downstream analysis workflows.

For agent usage, the difference is important:

  • query tool: "does data exist?"
  • routing tool: "what exact kind of data exists, and what pipeline should handle it?"

Suggested rollout

If this is too much for one release, my priority order would be:

  1. update discovery / self-update
  2. ArrayExpress / BioStudies support
  3. GEO archive content summary from filelist.txt
  4. dedicated SRA project command
  5. better machine-readable error classes
  6. optional platform hints

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions