Enhancement: improve agent-facing dataset reconnaissance and repository support

## Summary

After using `biocli` heavily for dataset reconnaissance and downloadability checks, I found it very effective for GEO/SRA scouting, but there are a few places where it could become much more useful for agent workflows.

This issue groups the highest-value improvements into one place.

## What worked well

- `geo download --list-only` is fast and very useful for checking whether a GSE has downloadable supplementary files.
- `sra search` is enough to confirm that an SRA project resolves to real runs.
- Structured output is agent-friendly and much faster than clicking through web pages.

## Highest-value improvements

### 1. Built-in update discovery / self-update

Right now version updates are awkward to discover from inside the tool.

Concrete example:

- `biocli --version` showed `0.3.8`
- `npm view biocli version` failed because the actual package name is scoped
- I had to manually discover that the published package is `@yangfei_93sky/biocli`

Suggested improvements:

- `biocli doctor --check-update`
- `biocli self-update`
- `biocli version --latest`

At minimum, the tool should expose its own package name and latest published version.

### 2. Native ArrayExpress / BioStudies support

`biocli` is strong on GEO, but some transcriptomics datasets live in ArrayExpress / BioStudies and currently require leaving the CLI.

Concrete example:

- `E-MTAB-373` required manual BioStudies API inspection
- the study clearly exposes raw data files plus `idf` / `sdrf`, but this is not reachable through a first-class `biocli` command

Suggested commands:

- `biocli ae dataset E-MTAB-373`
- `biocli ae download E-MTAB-373 --list-only`
- or a more general `biocli biostudies study ...`

### 3. GEO dataset metadata should expose raw availability directly

Today I often need both:

- `biocli geo dataset ...`
- `biocli geo download ... --list-only`

to answer a basic question like:

- does this accession have raw files?
- does it only have processed matrices?
- what kinds of supplementary files exist?

Suggested improvement:

- add fields such as `has_raw_archive`, `has_supplementary_files`, `supplementary_file_count`, `supplementary_types`

### 4. Archive content summary from `filelist.txt`

This would be especially valuable for agents deciding which downstream pipeline to use.

Concrete examples from real use:

- `GSE75479_RAW.tar` exists, but the useful detail is that the archive contains `RCC.gz`
- `GSE22433_RAW.tar` exists, but the useful detail is that it includes `BGX`
- many Agilent entries expose raw `.txt.gz` files

Right now I had to fetch `filelist.txt` separately to infer this.

Suggested improvement:

When `filelist.txt` exists, surface an archive summary such as:

- `contains: CEL`
- `contains: RCC`
- `contains: BGX`
- `contains: TXT raw exports`

### 5. Dedicated SRA project command

`sra search SRP...` works as a workaround, but a project-level command would be cleaner and easier for automation.

Suggested command:

- `biocli sra project SRP276412`

Useful fields:

- project accession
- run count
- SRR list
- sample titles
- platform
- layout
- read length if available

### 6. Clearer error classes

From an agent perspective, these failure modes are very different and should be distinguishable:

- accession exists but has no supplementary files
- network failure
- upstream API failure
- unsupported repository
- malformed accession

Suggested improvement:

- return explicit machine-readable error categories instead of collapsing multiple cases into generic fetch failures

### 7. Optional platform / pipeline hints

This should stay lightweight, but even a small hint would help downstream orchestration a lot.

Examples:

- `Affymetrix CEL detected`
- `Agilent raw TXT detected`
- `Illumina BGX detected`
- `NanoString RCC detected`
- `RNA-seq counts matrix only`
- `RNA-seq raw runs available via SRA`

That would let agents choose the right preprocessing route much faster.

## Why this matters

`biocli` is already very useful as a biological data discovery tool. The next big step is to make it not just a query tool, but also a better routing tool for downstream analysis workflows.

For agent usage, the difference is important:

- query tool: "does data exist?"
- routing tool: "what exact kind of data exists, and what pipeline should handle it?"

## Suggested rollout

If this is too much for one release, my priority order would be:

1. update discovery / self-update
2. ArrayExpress / BioStudies support
3. GEO archive content summary from `filelist.txt`
4. dedicated SRA project command
5. better machine-readable error classes
6. optional platform hints



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: improve agent-facing dataset reconnaissance and repository support #2

Summary

What worked well

Highest-value improvements

1. Built-in update discovery / self-update

2. Native ArrayExpress / BioStudies support

3. GEO dataset metadata should expose raw availability directly

4. Archive content summary from `filelist.txt`

5. Dedicated SRA project command

6. Clearer error classes

7. Optional platform / pipeline hints

Why this matters

Suggested rollout

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enhancement: improve agent-facing dataset reconnaissance and repository support #2

Description

Summary

What worked well

Highest-value improvements

1. Built-in update discovery / self-update

2. Native ArrayExpress / BioStudies support

3. GEO dataset metadata should expose raw availability directly

4. Archive content summary from filelist.txt

5. Dedicated SRA project command

6. Clearer error classes

7. Optional platform / pipeline hints

Why this matters

Suggested rollout

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. Archive content summary from `filelist.txt`