Summary
After using biocli heavily for dataset reconnaissance and downloadability checks, I found it very effective for GEO/SRA scouting, but there are a few places where it could become much more useful for agent workflows.
This issue groups the highest-value improvements into one place.
What worked well
geo download --list-only is fast and very useful for checking whether a GSE has downloadable supplementary files.
sra search is enough to confirm that an SRA project resolves to real runs.
- Structured output is agent-friendly and much faster than clicking through web pages.
Highest-value improvements
1. Built-in update discovery / self-update
Right now version updates are awkward to discover from inside the tool.
Concrete example:
biocli --version showed 0.3.8
npm view biocli version failed because the actual package name is scoped
- I had to manually discover that the published package is
@yangfei_93sky/biocli
Suggested improvements:
biocli doctor --check-update
biocli self-update
biocli version --latest
At minimum, the tool should expose its own package name and latest published version.
2. Native ArrayExpress / BioStudies support
biocli is strong on GEO, but some transcriptomics datasets live in ArrayExpress / BioStudies and currently require leaving the CLI.
Concrete example:
E-MTAB-373 required manual BioStudies API inspection
- the study clearly exposes raw data files plus
idf / sdrf, but this is not reachable through a first-class biocli command
Suggested commands:
biocli ae dataset E-MTAB-373
biocli ae download E-MTAB-373 --list-only
- or a more general
biocli biostudies study ...
3. GEO dataset metadata should expose raw availability directly
Today I often need both:
biocli geo dataset ...
biocli geo download ... --list-only
to answer a basic question like:
- does this accession have raw files?
- does it only have processed matrices?
- what kinds of supplementary files exist?
Suggested improvement:
- add fields such as
has_raw_archive, has_supplementary_files, supplementary_file_count, supplementary_types
4. Archive content summary from filelist.txt
This would be especially valuable for agents deciding which downstream pipeline to use.
Concrete examples from real use:
GSE75479_RAW.tar exists, but the useful detail is that the archive contains RCC.gz
GSE22433_RAW.tar exists, but the useful detail is that it includes BGX
- many Agilent entries expose raw
.txt.gz files
Right now I had to fetch filelist.txt separately to infer this.
Suggested improvement:
When filelist.txt exists, surface an archive summary such as:
contains: CEL
contains: RCC
contains: BGX
contains: TXT raw exports
5. Dedicated SRA project command
sra search SRP... works as a workaround, but a project-level command would be cleaner and easier for automation.
Suggested command:
biocli sra project SRP276412
Useful fields:
- project accession
- run count
- SRR list
- sample titles
- platform
- layout
- read length if available
6. Clearer error classes
From an agent perspective, these failure modes are very different and should be distinguishable:
- accession exists but has no supplementary files
- network failure
- upstream API failure
- unsupported repository
- malformed accession
Suggested improvement:
- return explicit machine-readable error categories instead of collapsing multiple cases into generic fetch failures
7. Optional platform / pipeline hints
This should stay lightweight, but even a small hint would help downstream orchestration a lot.
Examples:
Affymetrix CEL detected
Agilent raw TXT detected
Illumina BGX detected
NanoString RCC detected
RNA-seq counts matrix only
RNA-seq raw runs available via SRA
That would let agents choose the right preprocessing route much faster.
Why this matters
biocli is already very useful as a biological data discovery tool. The next big step is to make it not just a query tool, but also a better routing tool for downstream analysis workflows.
For agent usage, the difference is important:
- query tool: "does data exist?"
- routing tool: "what exact kind of data exists, and what pipeline should handle it?"
Suggested rollout
If this is too much for one release, my priority order would be:
- update discovery / self-update
- ArrayExpress / BioStudies support
- GEO archive content summary from
filelist.txt
- dedicated SRA project command
- better machine-readable error classes
- optional platform hints
Summary
After using
biocliheavily for dataset reconnaissance and downloadability checks, I found it very effective for GEO/SRA scouting, but there are a few places where it could become much more useful for agent workflows.This issue groups the highest-value improvements into one place.
What worked well
geo download --list-onlyis fast and very useful for checking whether a GSE has downloadable supplementary files.sra searchis enough to confirm that an SRA project resolves to real runs.Highest-value improvements
1. Built-in update discovery / self-update
Right now version updates are awkward to discover from inside the tool.
Concrete example:
biocli --versionshowed0.3.8npm view biocli versionfailed because the actual package name is scoped@yangfei_93sky/biocliSuggested improvements:
biocli doctor --check-updatebiocli self-updatebiocli version --latestAt minimum, the tool should expose its own package name and latest published version.
2. Native ArrayExpress / BioStudies support
biocliis strong on GEO, but some transcriptomics datasets live in ArrayExpress / BioStudies and currently require leaving the CLI.Concrete example:
E-MTAB-373required manual BioStudies API inspectionidf/sdrf, but this is not reachable through a first-classbioclicommandSuggested commands:
biocli ae dataset E-MTAB-373biocli ae download E-MTAB-373 --list-onlybiocli biostudies study ...3. GEO dataset metadata should expose raw availability directly
Today I often need both:
biocli geo dataset ...biocli geo download ... --list-onlyto answer a basic question like:
Suggested improvement:
has_raw_archive,has_supplementary_files,supplementary_file_count,supplementary_types4. Archive content summary from
filelist.txtThis would be especially valuable for agents deciding which downstream pipeline to use.
Concrete examples from real use:
GSE75479_RAW.tarexists, but the useful detail is that the archive containsRCC.gzGSE22433_RAW.tarexists, but the useful detail is that it includesBGX.txt.gzfilesRight now I had to fetch
filelist.txtseparately to infer this.Suggested improvement:
When
filelist.txtexists, surface an archive summary such as:contains: CELcontains: RCCcontains: BGXcontains: TXT raw exports5. Dedicated SRA project command
sra search SRP...works as a workaround, but a project-level command would be cleaner and easier for automation.Suggested command:
biocli sra project SRP276412Useful fields:
6. Clearer error classes
From an agent perspective, these failure modes are very different and should be distinguishable:
Suggested improvement:
7. Optional platform / pipeline hints
This should stay lightweight, but even a small hint would help downstream orchestration a lot.
Examples:
Affymetrix CEL detectedAgilent raw TXT detectedIllumina BGX detectedNanoString RCC detectedRNA-seq counts matrix onlyRNA-seq raw runs available via SRAThat would let agents choose the right preprocessing route much faster.
Why this matters
biocliis already very useful as a biological data discovery tool. The next big step is to make it not just a query tool, but also a better routing tool for downstream analysis workflows.For agent usage, the difference is important:
Suggested rollout
If this is too much for one release, my priority order would be:
filelist.txt