Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
4b9f83b
Add option to show unique samples dimensions CLI
brinkdp Feb 4, 2026
a62d232
Add CLI to create metadata TSV from dimensions
brinkdp Feb 4, 2026
fc8a8b4
Update user guides with metadata template info
brinkdp Feb 4, 2026
db50f85
Enable range filtering for numerical TSV columns
brinkdp Feb 4, 2026
b36d09a
Enable inequality filtering on numberical columns
brinkdp Feb 4, 2026
be16423
Enable OR filtering for numerical metadata
brinkdp Feb 5, 2026
d9b3341
Merge remote-tracking branch 'origin/main' into sample-metadata-funct…
brinkdp Feb 5, 2026
4c5f15d
Propagate metadata query warnings to terminal
brinkdp Feb 6, 2026
678977e
Improve formatting of metadata query warnings
brinkdp Feb 6, 2026
f1e798f
Support semicolon-separated values in TSV columns
brinkdp Feb 6, 2026
c5b5a1c
Add text on TSV format requirements to user guide
brinkdp Feb 6, 2026
227d4c4
Add text on TSV query syntax
brinkdp Feb 6, 2026
cef5a38
Add unit tests for numerical filtering
brinkdp Feb 6, 2026
ffbbf56
Handle semicolon-separated numeric values
brinkdp Feb 6, 2026
b64ebdb
Add test to assert that mixed-type error is raised
brinkdp Feb 9, 2026
963acdd
Ensure that mixed-type error is propagated to user
brinkdp Feb 9, 2026
e58486c
Add tests for string value filters
brinkdp Feb 9, 2026
70ea81d
Add fixture column that only has single values
brinkdp Feb 9, 2026
2719611
Update fixture with SingleString and adapt tests
brinkdp Feb 9, 2026
3730a5c
Add edge-case fixture and tests
brinkdp Feb 9, 2026
628dea6
Drop support of commaa in TSV values
brinkdp Feb 9, 2026
88fdc26
Refactor run_query() into helper methods
brinkdp Feb 10, 2026
15e3205
Allow hyphens in str but not numeric TSV values
brinkdp Feb 10, 2026
746a3b4
Support NOT filters with !
brinkdp Feb 10, 2026
aaffc1f
Strip leading/trailing spaces when loading TSV
brinkdp Feb 10, 2026
a0ab837
Update metadata user guide after refactoring
brinkdp Feb 10, 2026
7bca287
Update fixture to use other floats than just X.0
brinkdp Feb 10, 2026
f3fddaf
Merge remote-tracking branch 'origin/use-typer-recommended-cli-config…
brinkdp Feb 10, 2026
94c223b
Update test assertions for sample metadata queries
brinkdp Feb 10, 2026
bc39329
Update metadata template CLI to use pathlib
brinkdp Feb 10, 2026
cc74df6
Allow custom name for sample metadata templates
brinkdp Feb 10, 2026
8d648ff
Add draft sample metadata validator command
brinkdp Feb 10, 2026
1cd9d66
Add fixture TSV with incorrect formatting
brinkdp Feb 11, 2026
a9bfa9f
Collect all mixed-type cols for a single error
brinkdp Feb 11, 2026
9d9848e
Add unit tests for the TSV validator
brinkdp Feb 11, 2026
6090e97
Add more tests to validator to cover more cases
brinkdp Feb 11, 2026
6b2b6cd
Add section on TSV validator to quick start guide
brinkdp Feb 11, 2026
d6795c3
Add section on validator to sidecar metadata docs
brinkdp Feb 11, 2026
82b8d21
Use classmethod in MetadataTSVValidator
brinkdp Feb 11, 2026
3e26c25
Add mkdocs autogen docs for dimensions CLI command
brinkdp Feb 11, 2026
1ba014b
Add text to migration dev docs on branch switching
brinkdp Feb 11, 2026
9c3c4da
Add Sample_ID exceptions in SidecarQueryManager
brinkdp Feb 11, 2026
7aeda1b
Add unit tests for the Sample_ID exceptions
brinkdp Feb 11, 2026
8b2a1e9
Update error/warning handling in validator
brinkdp Feb 11, 2026
8e9d29b
Support negative numbers
brinkdp Feb 11, 2026
6d45005
Update unit tests with negative number cases
brinkdp Feb 11, 2026
058aedc
Add intro to sidecar metadata user guide
brinkdp Feb 12, 2026
ea2bcc3
Move dimensions user guide up in the query tree
brinkdp Feb 12, 2026
9c7baec
Add CRUD, pydantic model, route for unique samples
brinkdp Feb 12, 2026
ae8928c
Update template and validator command for new CRUD
brinkdp Feb 12, 2026
cd09ec7
Use the samples endpoint in dimensions show CLI
brinkdp Feb 12, 2026
74c3aef
Drop duplicate sorting step, already done in CRUD
brinkdp Feb 12, 2026
d366456
Add separate endpoint also for --unique-scaffolds
brinkdp Feb 12, 2026
b3b9f5c
Clarify difference between API and worker crud
brinkdp Feb 12, 2026
e80f5b8
Fix mistake where property call was dropped
brinkdp Feb 12, 2026
f9f1269
Add e2e tests for updated dimensions CLI commands
brinkdp Feb 12, 2026
66ea161
Refactor query logic to relax constraints
brinkdp Feb 12, 2026
b86d73f
Update e2e and validator unit tests after refactor
brinkdp Feb 12, 2026
acfffd8
Update unit tests for SidecarQueryManager
brinkdp Feb 12, 2026
f82a6cc
Handle the case of filters like Area:>North
brinkdp Feb 12, 2026
b827b81
Print None instead in CLI of [] when no result
brinkdp Feb 12, 2026
57b82d8
Update metadata user guide after refactoring
brinkdp Feb 12, 2026
7c588c6
Ensure all errors from Celery task are propagaged
brinkdp Feb 13, 2026
667c56d
Add e2e test to assert latest errors in terminal
brinkdp Feb 13, 2026
a3c31e3
Add e2e tests for other metadata task exceptions
brinkdp Feb 13, 2026
cea2778
Harmonize test docstrings
brinkdp Feb 13, 2026
75d03ab
Revert dev doc comment on migrations
brinkdp Feb 13, 2026
f20fc43
Fix test assertions sensitive to linebreaks
brinkdp Feb 13, 2026
574d459
Refactor dimensions-bucket check to be more strict
brinkdp Feb 13, 2026
eb86f5e
Ensure DivBase results VCF not in updated check
brinkdp Feb 13, 2026
4775237
Have helper also consider VCF file version
brinkdp Feb 13, 2026
9304664
Update tests after helper refactoring
brinkdp Feb 13, 2026
77bfd33
Add test for VCF in bucket not index in dimensions
brinkdp Feb 13, 2026
6c2118d
Allow commas, but send warnings to user
brinkdp Feb 16, 2026
27769f0
Update user guide with info on special characters
brinkdp Feb 16, 2026
d76fc31
Add text to some TODOs in the metadata user guide
brinkdp Feb 16, 2026
98f9015
Polish the metadata user guide
brinkdp Feb 16, 2026
8c14483
Apply suggestions from code review
brinkdp Feb 17, 2026
b48d25f
Fix additional typos found by copilot review
brinkdp Feb 17, 2026
6fdd4a5
Fix verb conjugation in variable name
brinkdp Feb 17, 2026
562da4d
Prune validator clafication message
brinkdp Feb 17, 2026
bd78a05
Update test after updating validator warning msg
brinkdp Feb 17, 2026
2be7adf
Update metadata user guide with results example
brinkdp Feb 17, 2026
0104e57
Ensure same handling of # in validator and queries
brinkdp Feb 17, 2026
e21e90f
Refactor TSV validator to a shared logic
brinkdp Feb 17, 2026
ce5e27f
Update query overview with sample metadata text
brinkdp Feb 18, 2026
d35419e
Add script to print df from TSV using shared logic
brinkdp Feb 18, 2026
3097371
Add checks and warnings for array notation in TSV
brinkdp Feb 18, 2026
b97dc21
Add text on bracket array warning to user guide
brinkdp Feb 18, 2026
8452994
Move shared unit test TSV fixtures to conftest.py
brinkdp Feb 18, 2026
7b95999
Add unit test for TSV->df->TSV validation
brinkdp Feb 18, 2026
17268d8
Update mock VCF and metadata scripts
brinkdp Feb 19, 2026
5693686
Merge branch 'pr70' into sample-metadata-functionalities
brinkdp Feb 23, 2026
5816ca5
Update test to not rely on \t in stout
brinkdp Feb 23, 2026
35f73d1
Merge remote-tracking branch 'origin/main' into sample-metadata-funct…
brinkdp Feb 24, 2026
59cfe7b
WIP: refactor shared metadata validator for lists
brinkdp Feb 24, 2026
847c7f9
WIP:make validation results more robust for worker
brinkdp Feb 25, 2026
4468f5e
WIP: refactor query engine for updated validator
brinkdp Feb 26, 2026
f2e19d8
Update tests to match refactoring
brinkdp Feb 26, 2026
cc40504
Update user guide for sidecar metadata
brinkdp Feb 26, 2026
df0caed
Update mkdocs CLI autogen docs on dimensions
brinkdp Feb 26, 2026
801676c
Limit number of samples validator prints to user
brinkdp Feb 26, 2026
467b38c
Make MetadataTSVValidator instance-based instead
brinkdp Feb 26, 2026
bf9ff81
Rename class to ClientSideMetadataTSVValidator
brinkdp Feb 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,15 @@ __pycache__/

# project specific files
/sample_metadata.tsv
sample_metadata_*.tsv
/sample_metadata_*.tsv
*.vcf
*.vcf.gz
*.vcf.gz.csi
*.vcf.gz.tbi
!tests/fixtures/*.vcf.gz
tests/fixtures/temp*
tests/fixtures/merged*
divbase_metadata_template*.tsv

# query job config files
bcftools_divbase_job_config.json
Expand All @@ -37,6 +38,4 @@ scripts/benchmarking/results
.DS_Store

# mkdocs build cache
.cache/
# pypi
dist/
.cache/
57 changes: 57 additions & 0 deletions docs/cli/_auto_generated/dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ $ divbase-cli dimensions [OPTIONS] COMMAND [ARGS]...

* `update`: Calculate and add the dimensions of a VCF...
* `show`: Show the dimensions index file for a project.
* `create-metadata-template`: Use the samples index in a projects...
* `validate-metadata-file`: Validate a sidecar metadata TSV file...

## `divbase-cli dimensions update`

Expand Down Expand Up @@ -49,5 +51,60 @@ $ divbase-cli dimensions show [OPTIONS]

* `--filename TEXT`: If set, will show only the entry for this VCF filename.
* `--unique-scaffolds`: If set, will show all unique scaffold names found across all the VCF files in the project.
* `--unique-samples`: If set, will show all unique sample names found across all the VCF files in the project.
* `--sample-names-limit INTEGER RANGE`: Maximum number of sample names to display per list in terminal output. [default: 20; x>=1]
* `--sample-names-output TEXT`: Write full sample names to file instead of truncating in terminal output. Mutually exclusive with --sample-names-stdout.
* `--sample-names-stdout`: Print full sample names to stdout (useful for piping). Mutually exclusive with --sample-names-output.
* `--project TEXT`: Name of the DivBase project, if not provided uses the default in your DivBase config file
* `--help`: Show this message and exit.

## `divbase-cli dimensions create-metadata-template`

Use the samples index in a projects dimensions cache to create a TSV metadata template file
that has the sample names as pre-filled as the first column.

**Usage**:

```console
$ divbase-cli dimensions create-metadata-template [OPTIONS]
```

**Options**:

* `-o, --output TEXT`: Name of the output TSV file to create. Defaults to sample_metadata_<project_name>.tsv. If a file with the same name already exists in the current directory, you will be prompted to confirm if you want to overwrite it.
* `--project TEXT`: Name of the DivBase project, if not provided uses the default in your DivBase config file
* `--help`: Show this message and exit.

## `divbase-cli dimensions validate-metadata-file`

Validate a sidecar metadata TSV file against DivBase formatting requirements and project dimensions.

Validation is run client-side to keep sensitive metadata local during validation.

Validation checks:
- File is properly tab-delimited
- First column is named '#Sample_ID'
- No commas in cells
- Sample_ID has only one value per row (no semicolons)
- No duplicate sample IDs
- Invalid characters
- Basic type consistency in user-defined columns. But not Pandas type inference,
as we want to avoid having the user install Pandas just for validation. So just check that numeric columns have only numeric values (excluding header).
- All samples in the TSV exist in the project's dimensions index

Returns errors for critical issues and warnings for non-critical issues.

**Usage**:

```console
$ divbase-cli dimensions validate-metadata-file [OPTIONS] INPUT_FILENAME
```

**Arguments**:

* `INPUT_FILENAME`: Name of the input TSV file to validate. [required]

**Options**:

* `--project TEXT`: Name of the DivBase project, if not provided uses the default in your DivBase config file
* `--help`: Show this message and exit.
10 changes: 10 additions & 0 deletions docs/user-guides/query-syntax.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
# DivBase Query Syntax for VCF data

TODO

## combined sample metadata and VCF queries

TODO - there is a link to here from the sample metadata guide, so the combined queries should be described in detail here

It is also possible to run a sidecar sample metadata query as part of a VCF query by adding the query as a sting to the flag `--tsv-filter`:

```bash
divbase-cli query bcftools-pipe --tsv-filter "Area:North" --command "view -s SAMPLES"
```
54 changes: 36 additions & 18 deletions docs/user-guides/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,23 +79,7 @@ Check your uploaded files:
divbase-cli files ls
```

## Step 7: Upload sample metadata

TODO It might make more sense to have run the dimensions update job before this if we are to use a pre-populated template file

Sample metadata must be uploaded as follows:

- In TSV format and be named "sample_metadata.tsv"
- Must contain a column named "sample_id" which matches the sample IDs in your VCF files
- The names and values of all other columns are optional.

TODO update this after `sidecar-metadata.md` docs are done, there are changes planned for some details.

```bash
divbase-cli files upload path/to/your/sample_metadata.tsv
```

## Step 8: Dimensions update
## Step 7: Dimensions update

For DivBase to be able to efficiently handle the VCF files in the the project, some key information about each VCF files is fetched from the files. In DivBase, this is refered to as "VCF dimensions". These include for instance which samples and scaffolds that a VCF file contains.

Expand All @@ -112,7 +96,7 @@ This submits a task to the DivBase task management system. The task will wait in

2. Please also note that the `divbase-cli dimensions update` command needs to be done every time a new VCF or a new version of a VCF file is uploaded.

## Step 9: Confirm dimensions update job completion
## Step 8: Confirm dimensions update job completion

Check the task history to confirm the dimensions update job has completed:

Expand All @@ -128,6 +112,40 @@ It is possible to inspect the cached VCF dimensions data for the project at any
divbase-cli dimensions show
```

## Step 9: Upload sample metadata

DivBase can checkout data based the VCF files themselves, but can also take an optional sidecar sample metadata file into account. The metadata file must be a TSV (tab-separated variables) file. The metadata contents of the file is defined by the users. If the VCF dimensions command has been run for the project, the cached dimensions data can be used create a template where the samples of the project have been pre-filled:

```bash
divbase-cli dimensions create-metadata-template
```

Details on how to write this file are given in [Sidecar Metadata TSV files: creating and querying sample metadata files](sidecar-metadata.md). In short, the first row starts with `#` and contains the headers for different metadata columns. The first column (`Sample_ID`) is mandatory and can be created by the system as just described; if created manually just make sure that each sample name is spelled exactly as in the VCF files. The rest of the columns are free for the user to define.

Example of a sidecar metadata TSV file with the mandatory `Sample_ID` column and two user defined columns.

```
#Sample_ID Population Area
129P2 1 North
129S1 2 East
129S5 3 South
```

!!! note
Please use a text editor that preserves the tabs when the file is saved. Incorrect tabs can lead to issues with running metadata queries in DivBase.

There is a command to help check that the sidecar metadata TSV is correctly formatted for use with DivBase. Running it is optional:

```bash
divbase-cli dimensions validate-metadata-file path/to/your/sample_metadata.tsv
```

When you are happy with the sample metadata file, it should be uploaded to the DivBase project with the following:

```bash
divbase-cli files upload path/to/your/sample_metadata.tsv
```

## Step 10: Run your queries

There are three types of queries in DivBase:
Expand Down
48 changes: 34 additions & 14 deletions docs/user-guides/running-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,40 +14,60 @@ TODO

The system will use the latest version the the files for all queries.

TODO COPIED OVER FROM QUICKSTART REIMPLEMENT
## Before running any queries: run the VCF dimensions command

DivBase only allows `bcftools view` in its query syntax and no other `bcftools` commands. The `merge`, `concat`, and `annotate` commands are used when processing a query, but should not be defined by the user.
TODO - finish writing this section

## Side car sample metadata queries
For more details see [VCF Dimensions caching](vcf-dimensions.md)

For more details, see [Sidecar Metadata TSV files: creating and querying sample metadata files](sidecar-metadata.md).
For performance reasons and to ensure query feasibility, key metadata from the VCF files must first be cached in DivBase.
For each VCF in the DivBase project, the system extracts file name, number and name of all samples, number and name of all scaffolds, number of variants,

how to write the sample metadata TSV (+template)
DivBase will use this whenever a user submits a query to the project. For instance, the user might make a sample metadata filtering that results in only certain samples. The system knows which file names each requested sample are located in, and will ensure that only those files will be transferred to the worker.

example
dimensions show

how to use dimensions show to get all samples in the project

how to write query
## Sidecar sample metadata queries

## Before any VCF queries: run the VCF dimensions command
DivBase supports that users store extensive sample metadata in a separate TSV file. This metadata can be queried on its own or in combination with VCF data. Users are free to define their own metadata as they see if: column names represent metadata categories and rows represent the samples found in the VCF files in the DivBase project. The TSVs need to follow a few mandatory requirements, but no strict metadata schema is enforced, This allows DivBase to accomodate a variety of research projects with different metadata needs.

For more details see [VCF Dimensions caching](vcf-dimensions.md)
The metadata can be queried on its own to learn which samples that fulfil a certain metadata query and the VCF files the samples are present in. The same query syntax is used in the combined sample metadata and VCF data queries, and user can use the dedicated sample metadata query command as a dry-run before running full combined query to ensure that the metadata query produces the results the user intended.

For performance reasons and to ensure query feasibility, key metadata from the VCF files must first be cached in DivBase.
For each VCF in the DivBase project, the system extracts file name, number and name of all samples, number and name of all scaffolds, number of variants,
For instructions on how to create the sidecar sample metadata TSV files and how to run sample metadata queries, see the guide on [Sidecar Metadata TSV files: creating and querying sample metadata files](sidecar-metadata.md). The guide also describes the CLI commands that specifically relate to the sample metadata TSV files. These are:

DivBase will use this whenever a user submits a query to the project. For instance, the user might make a sample metadata filtering that results in only certain samples. The system knows which file names each requested sample are located in, and will ensure that only those files will be transferred to the worker.
```bash
divbase-cli dimensions create-metadata-template

example
dimensions show
divbase-cli dimensions validate-metadata-file path/to/your/sample_metadata.tsv

divbase-cli files upload path/to/your/sample_metadata.tsv

divbase-cli query tsv "Area:Northern Portugal"
```

## VCF queries

TODO - finish writing this section

Run the query on all VCF files in the DivBase project unless specified. There are two ways to specify the files: either as direct input to the `bcftools` command, or by combining the VCF query with a sample metadata query to determine which VCF files to use.

See also [DivBase Query Syntax for VCF data](query-syntax.md), [How to create efficient DivBase queries](how-to-create-efficient-divbase-queries.md), and [Tutorial: Running a query on a public dataset](tutorial-query-on-public-data.md).

can be run with or without sample metadata filtering
can be run with or without sample metadata filtering.

for sample metadata linked VCF queries, it can be good to do a dry run first [TO BE IMPLEMENTED, at the moment it needs to be run seperatelly]

how to write a query

how to optimize a query (see separate markdown)

TODO COPIED OVER FROM QUICKSTART REIMPLEMENT

DivBase only allows `bcftools view` in its query syntax and no other `bcftools` commands. The `merge`, `concat`, and `annotate` commands are used when processing a query, but should not be defined by the user.

## Combined sample metadata and VCF data query

Uses a sample metadata query to identify the VCF files in the DivBase project to run the VCF queries on.
Loading