SQ-837: Refine the sample metadata TSV handling and write user guides by brinkdp · Pull Request #67 · ScilifelabDataCentre/divbase

brinkdp · 2026-02-16T15:09:30Z

This PR is a major refactoring of the sidecar sample metadata TSV query logic. Focus was on adding functions like numerical operations, and improving UX.

Things covered in this PR:

supported numerical operators (inequalities, ranges) and negative values in query filters
supported ! for NOT filtering
supported multi-value cells in TSV, delimited by semicolon
added CLI command to generate sidecar tsv template from project's dimensions index
added CLI command for a client-side sidecar tsv validator (decided against server-side since it would need to send potentially big request payloads). The validation logic is in the libs package so that the same code can be used in the client side validator and in the server side query processing.
specified requirements for sample metadata files in DivBase in user guide and implemented them as warnings and errors in the validator and the query manager. Most important one is: if a column is not strictly numeric, treat it as string and send warning to user when they try to use numerical operators on that column.
ensured errors (and warnings) propagate back to the user (auth exception is triggered first, so it should be fine to have details in the subsequent exceptions, I thought.)
refactored error on VCF dimensions not being up-to-date and raised it for metadata queries too (previously only for VCF queries)
added separate CRUDs and endpoints for only getting unique samples (and unique scaffolds; not really used in this PR but very similar logic) from dimensions table. divbase-cli dimensions show --unique-samples now gives number and names of samples (and scaffolds for --unique-samples`)
added unit and e2e tests for many, many use cases
add a script that loads a TSV and prints the result df to terminal for inspection. Calls the same function that DivBase uses to load TSVs on the client and server side.
wrote user guide pages. Related pages on dimensions and VCF queries is a TODO that will be covered in its own branch(es).

Sidecar metadata TSV format and filtering syntax

To not repeat what I wrote in the user guide, I would suggest spinning up mkdocs and reading: http://localhost:8008/divbase/user-guides/sidecar-metadata/

Also updated the quick-start guide, but there is really no room there to discuss the metadata format and query syntax there.

Create pre-filled template

The first column in the TSV, Sample_ID is the only column that we stricly enforce since it needs to match exactly with the sample IDs in the VCFs in the project. To help users get this column right, I added a CLI command that reads the dimensions index of the project and generates at TSV with this first column populated with the sample names.

For instance, if local-project-1 has files as per local_dev_setup.py and divbase-cli dimensions update --project local-project-1 has been run:

divbase-cli dimensions create-metadata-template

will save a single-column TSV with all unique sample names from the dimensions file to sample_metadata_<project_name>.tsv. Output filename can be controlled with --output.

TSV file Validation

The TSV validator runs client-side and is intended to be run on local TSV files before they are uploaded to DivBase. It validates the content in the TSV; checks on the query filters are done elsewhere. The validator does make one request to the server to fetch a list of the samples from the project's dimensions index to be able to validate sample names. Therefore I sorted the CLI command under divbase-cli dimensions just like the TSV template creator.

We talked about having Pandas in the libs to be able to ensure that the same code can be used client-side and server-side for doing the same checks. To my surpise, Pandas was already in the libs package... I could have sworn that it was in the api package when I looked last week! Well, anyway, that made it an easy choice to refactor all shared code to a class in the libs package.

For local-project-1 after its dimensions index has been updated:

# Should give a warning about mixed-type column, but should otherwise be fine:
divbase-cli dimensions validate-metadata-file tests/fixtures/sample_metadata.tsv --project local-project-1

# Intentionally, no sample names match dimensions here, should return a lot of warnings and errors:
divbase-cli dimensions validate-metadata-file tests/fixtures/sample_metadata_incorrect_formatting_to_test_tsv_validator.tsv --project local-project-1

Metadata queries

In addition to the shared TSV content validator that is also run by SidecarQueryManager when a metadata query begins processing, there are also checks for the filter processing. For instance they e.g. check if numerical filters can be applied or not. This part of the code does not raise errors, only warnings. The user will typically get metadata query results but with warnings.

Users can use this to add their own user defined columns for these samples.

For example: divbase-cli query tsv "Population:2-3" --metadata-tsv-name \ tutorial_mock_metadata_mgpv3snps.tsv Pandas read_csv() infer if columns are numerical or string based on the data so we already had the correct type in the df. See: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Might need some more testing and docs, but the basics are in place.

I.e inquality OR range OR discrete value example "Weight": ">50,30-40,25,45"

…ionalities

Up until now, warnings were only displayed in the worker logs.

Use textwrap to ensure that bulleted warnings keep bullet indentation when warnings string is too long to fit within the terminal width.

Use lambda functions and pandas .apply() to split values on semicolons and apply filters accordingly. This allows users to define sample metadata TSVs where cells can contain multiple values separated by semicolons, and filter strings can match any of those values. Sample_ID Group S1 1 S2 1;2 S3 2

There might be an issue with semicolon separated numerical values...

Pandas will not infer a column as numeric if it contains semicolon-separated values, even if those values are numeric. This means that the inequalities and range logic will not work for such values. This adds a helper method that checks if a column contains semicolon-separated numeric values and raises an error if the values are of mixed types (e.g. "1";two;3). The lambda functions in the run_query() logic will convert each semicolon-separated numerical value float or int before applying the inequality or range logic.

For semicolon-separated columns.

Return 400 and print error message with details about the mixed numeric and string types in the column.

String filtering has less set operations that the numerical filters (inquealities, range, discrete) since it only takes discrete values. Will thus need less test coverage for now.

And tests that act on that column. This ensure that the query logic can handle columns that contain some semicolon-separated and columns with no semicolon-separated values

Some of these tests fail at the moment which show spots where the SidecarQueryManager's type infereance and handling should either be improved or clarified in the docs.

Add helper methods to keep the checks DRY, and update fixture and tests.

To make the code easier to read, and a little more DRY. Also took the opurtunity to change the lamda functions used to create filters for numerical operations on the dataframe into named nested functions. This should hopefully be a little more readable and still comply with the Ruff linter.

Could be a common use case to have hyphens in string values, e.g. "North-East", "South-West". Numerical values are not supported and will raise an exception.

Big refactoring. The same processing needs to be applied to positive and to negative user-inputted filter values. NOT conditions are applied with AND logic after positive conditions: in the end, the rows must NOT match any negated values. To try to keep the code DRY and manageable, several helper methods have been added.

Whitespaces inside values are preserved

…-loc' into sample-metadata-functionalities

This ensures that the validation of the TSV content has a single source of truth in the form of the SharedMetadataValidator, which is used by both the CLI and the API. Note! This is only for the contents of the TSV files. The filtering logic for the queries is still separate to the query manager on the server side.

This script runs SharedMetadataValidator, i.e. same logic that the CLI and API uses when loading a sample metadata TSV file. Thus, this script can be used to ensure that the TSV results in the expected DataFrame structure after being ingested by DivBase.

If users include Python/JSON-style array notation (e.g., '[1, 2, 3]') in their sample metadata TSV, it will be treated as a string column by DivBase. The warning tells that to the user and advises them to use semicolons (e.g., '1;2;3') instead for multi value columns.

Turns out that some TSVs exported from df will not match the original TSV, and that mainly has to do with ints and floats, e.g. a 0 being exported as 0.0. I think that the discrepencies are acceptable but it could be something to further work on.

brinkdp · 2026-02-18T13:15:46Z

Hi @RMCrean, this is ready for review now. Happy to demo this PR and talk about it in person if you like.

About nested arrays and pandas:

I had a little look at it since we talked about this yesterday, and my current understanding is still that the main way to load a TSV and treat bracketed arrays in the TSV as nparrays, e.g. [1,2,3], is to use a apply a custom function to the dataframe or to use a pandas converter in the load_csv() step. For instance with something like:

# For serialised arrays in the TSV cells, e.g.: [1,2,3]
df['values'] = df['values'].apply(json.loads).apply(np.array)

# For semicolon values in the TSV cells, e.g. 1;2;3
df['values'] = df['values'].apply(lambda x: np.array(x.split(';'), dtype=int))

I'm not using nparrays at all for the multi-value cells at the moment, though. But I added a check that sends a warning to the user if they try to use bracket array notation in their sidecar TSV files to make the current treatment clear to the user.

RMCrean · 2026-02-19T08:41:54Z

Thanks for this! I have started looking through it now to try and get to grips with it, but I agree it will be a good idea to look at this together. I think being able to validate the tsv as a user will be nice! I also like the use of dataclasses and pydantic models :) The logic for handling filtering/queries is very impressive too!

As this is such a big and impressive piece of work I do have some general questions/points from the first pass:

Have you tested this on a large metadata file? I'm wondering how validation impacts the query time for a tsv query?
I have two failing tests on this branch about a results file not showing up. Do you also have them?
Validation run on a very large file that is badly formatted (e.g. same problem for every row) might cause the terminal to hang, as it will try to print out every validation failure. Maybe there should be some kind of cap at e.g. 500 errors/warnings before you return it, and just say there are potentially more errors but here is the first 500.
Regading "ensured errors (and warnings) propagate back to the user (auth exception is triggered first, so it should be fine to have details in the subsequent exceptions, I thought.)" I agree it is good that the auth checks are first, but it would still not be desirable to potentially leak secrets back to an authenticated user in a error log. We can perhaps treat the S3 operations a little differently and not for example give tracebacks. Can probably also make use of pydantic secretstrs for the sensitive variables like done in the API and CLI, to further reduce the risk.
The exception handling in each task seems to be different and I think it would be good to standardise this somewhat so it is more predictable. This stackoverflow thread seems like something worth discussing https://stackoverflow.com/questions/16658371/how-to-log-exceptions-occurring-in-a-django-celery-task/45333231#45333231, so create a class that all tasks can inherit from that tells each task how to handle an unexpected error. Perhaps you've experimented with this already though? I see your comments about how one has to be careful about making sure errors can be propagated up correctly unchanged (which I had no idea about) so I appreciate this is not an easy topic at all, but I think it could be worth some time brainstorming on together.

Regarding pandas, my point each time the few times we've talked about this has been more can we use as much pandas default implementation as possible? So, can we use the pandas way to store an array instead of a custom way with ";" is one example, but the general idea is more about limiting what divbase has to do itself and hopefully therefore limit the number of edge cases/complexity of the code.

Here is a small snippet to show more what I mean (my pandas knowledge is very rusty so this was rather reliant on AI). In this case handling both treatment of arrays and assigning their underlying types.

import ast
import pandas as pd
import pathlib

data = {
    "ID": [1, 2, 3, 4],
    "Name": ["alpha", "beta", "gamma", "delta"],
    "Tags": [["tag1, with, special chars%^()", "tag2"], ["tag2"], ["tag1", "tag3", "tag4"], ["tag3"]],
    "float_list": [[1.1, 2.2], [3.3], [4.4, 5.5, 6.6], [7.7]],
    "mixed_list": [[1, "two", 3.0], ["four", 5, 6.6], [7.7, "eight", 9], ["ten"]],
    "ints_list": [[1, 2], [3], [4, 5, 6], [7]],
}

df_original = pd.DataFrame(data)

# save and load as tsv
file_path = "mixed_data.tsv"
df_original.to_csv(file_path, sep='\t', index=False)
df_loaded = pd.read_csv(file_path, sep='\t')

# Only apply ast.literal_eval to object (string) columns that look like Python literals
for col in df_loaded.select_dtypes(include=['object']).columns:
    try:
        df_loaded[col] = df_loaded[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
    except ValueError:
        print(f"Skipping column '{col}' due to invalid literal values.")

print(f"""
    {type(df_loaded["Tags"][0][0])=},
    {type(df_loaded["float_list"][0][0])=},
    {type(df_loaded["mixed_list"][0][2])=},
    {type(df_loaded["ints_list"][0][0])=}
""")
print(all(df_loaded == df_original))

Output:

Skipping column 'Name' due to invalid literal values.
    type(df_loaded["Tags"][0][0])=<class 'str'>,
    type(df_loaded["float_list"][0][0])=<class 'float'>,
    type(df_loaded["mixed_list"][0][2])=<class 'float'>,
    type(df_loaded["ints_list"][0][0])=<class 'int'>
True

But I appreciate that would be a major refactor, so I understand completely if you'd prefer not to consider this as an alternative.

brinkdp · 2026-02-19T09:54:59Z

Thanks! That's many great ideas already and it sounds good to work on some of these things together.

large metadata file? a very large file that is badly formatted?

Oh, thanks for thinking about that. I did not try it yet, but that sounds like something we really need to do. I can generate a large TSV file and some mock VCF with many samples and run some tests. Paginating the validation errors is also a good idea. Maybe making an option to print the validation messages to a log file.

I have two failing tests on this branch about a results file not showing up. Do you also have them?

No, all my tests pass... Let's look at that together.

We can perhaps treat the S3 operations a little differently and not for example give tracebacks. Can probably also make use of pydantic secretstrs for the sensitive variables like done in the API and CLI, to further reduce the risk.

I like that! Secrets should be very protected. Project info, like file and sample metadata I feel is needs to be returned to make the messages useful, but it good be good to think about the level of detail there too.

The exception handling in each task seems to be different and I think it would be good to standardise this somewhat so it is more predictable.

Yes, you are right 😄 The reason why this one is quite a bit different is that the route polls for task completion. We should maybe consider if this should still be doing that? The other tasks pass their errors to the celery logs that users access with the task-history CLI. This task was extra awkward since the celery JSON serilisation/deserilisation did not work properly with custom exception classes that use __init__ functions like DivBaseAPIException. Sounds good to brainstorm on this together to centralise the task exception handling.

can we use as much pandas default implementation as possible? and hopefully therefore limit the number of edge cases/complexity of the code.

I am completely with you here. The challenge, I think, is to understand what is pandas default or best-practice for multi-value cells. I think we are currently looking at "pandas default" from two different perspectives, so we just need to decide what to go for. I am using default behaviour in the sense that columns that are not inferred as int or float in the df will be treated as string. If I understand your perspective, you are thinking about how to represent the arrays in the TSV in the most pandas default way? If bracket array notation feels more universal (and it is what many programming languages use), then I think it would be worth to refactor to use that instead of semicolon.

Let's find time when we can look more at this together! In the meantime I can look at the large file-case.

To be able to test the sample metadata validation and queries with larger files. The large files should not be under source control due to their potential size. The scripts use random seeds for reproducibility. Example to generate a mock VCF file with 5000 samples and 100 variants, and then generate the corresponding sample metadata TSV file with four fixed column, one warning column that will trigger validator warnings, followed by 25 random columns. bash scripts/benchmarking/generate_mock_vcf.sh -s 5000 -r 100 will generate a file named mock_vcf_5000s_100r.vcf.gz python scripts/generate_mock_sample_metadata.py --vcf mock_vcf_5000s_100r.vcf.gz --output mock_metadata_mock_vcf_5000s_100r.tsv --columns 25 --add-warning-column

RMCrean · 2026-02-20T07:40:17Z

That sounds like a good plan. For the error messages I did mean to just cap after e.g. 500 messages and give back to the user. You could implement pagination etc... but my guess would be if they are getting 500+ errors, there is probably something generally wrong in how they've formatted (e.g. every cell in a column misformatted). So I'm not sure pagination is needed, but I don't have very strong opinions. A log file for all errors is perhaps easier to implement I suppose.

And sorry about the tests, I think the problem was me not having uv sync'd or at least everything is passing now.

R.E. the pandas part. I think about it like this: The goal is to have a way to store an array (aka collection of data) in a single cell. What is a way that pandas/python natively supports? A python list (or np.array) is a way to do it. A cell with "[1, 1.0, "hello"]" can use ast.literal_eval and be auto converted into a list with the correct underlying types for each element. So "[1, 1.0, 'hello']" becomes: list[int, float, str], rather than having to rely on type casting. If ast.literal_eval can't convert it, then its a user error and they have to reformat there input. There is no direct way in python to convert "1;1.0;'hello'". So if we store an array in the TSV using e.g. "[1,2,3]", python can use ast.literal_eval and auto convert that into an array for us, and assign the underlying data types correctly.

(I think the benefits of using an array are more than just the conversion though)

This also makes me think. I wonder if the constraint that each column must contain only arrays/list of multiple values or only single values. It's perfectly fine to have a column called e.g. "Groups" with the following data:

Groups
["A", "C"] # fine to be a member of multiple groups
["B"] # fine to be a member of 1 group
[] # fine to not be a member of any group

And this would not be allowed:

Groups
["A", "C"] 
"B"
# empty cell here

The ";" separation syntax does not have that same benefit, where you cannot directly tell if you're dealing with a list/array of items or just a string and therefore what kind of logic to apply to it. A column made up of lists/array columns can use e.g. if "A" in column style checks for simple filtering for example.

Can cause issues with terminal settings, rich, typer etc. Add assertion for the expected sample names

…ionalities

Use python list syntax for multi-value cells instead of previous semicolon-separated string format. To test the new functionaly, CLI commands and their unit tests are also updated here Tagged WIP since there are probably maby breaking changes outside of the CLI command and its units tests.

The shared metadata validator has two callers: the CLI validator and the server-side query engine. The CLI validator will only need to display the error and warning messages to the user. But the query engine need to act on them (raise or warn). The previous implementation that relied on string matching the error messages was brittle: if message phrasing change (which they are prone to do in the pilot phase when we expect a lot of feedback), troubleshooting the query engine will be needed often. Inspired by HTML status error codes, this creates a custom DivBase metadata validation Enum category system that will allow the query engine to look for the validation Enums and not for the exact strings in the messages. Tagged WIP here too since things are probably still broken as the refactoring is not complete

Reuse calculations from the shared metadata validator: no need to recheck if a col is num, string, or mixed-type, just import the results. The shared validator does the parsing of multi-value lists and passes the dataframe to the query engine logic.

Also collect all shared unit test fixtures together, now that all tests pass again.

To reflect new syntax using bracket arrays.

Should probably have been commited as part of #70.

When using divbase-cli dimensions validate-metadata-file with the mock 5000 sample data, it will print every single sample name to terminal if they are not in the dimensions of that project. This limits it to 20.

Having the classmethod was convienient for calling the for calling the class from the CLI since the instance was never reused. But the classmethod still captured state and used the __init__ method so it can be clearer to just follow the instance-based style used in the rest of the codebase.

The old name was MetadataTSVValidator, but this could become a source of confusion since the core validation logic is in the SharedMetadataValidator

brinkdp added 30 commits February 4, 2026 14:34

Add option to show unique samples dimensions CLI

4b9f83b

Add CLI to create metadata TSV from dimensions

a62d232

Users can use this to add their own user defined columns for these samples.

Update user guides with metadata template info

fc8a8b4

Enable inequality filtering on numberical columns

b36d09a

Might need some more testing and docs, but the basics are in place.

Enable OR filtering for numerical metadata

be16423

I.e inquality OR range OR discrete value example "Weight": ">50,30-40,25,45"

Merge remote-tracking branch 'origin/main' into sample-metadata-funct…

d9b3341

…ionalities

Propagate metadata query warnings to terminal

4c5f15d

Up until now, warnings were only displayed in the worker logs.

Improve formatting of metadata query warnings

678977e

Use textwrap to ensure that bulleted warnings keep bullet indentation when warnings string is too long to fit within the terminal width.

Add text on TSV format requirements to user guide

c5b5a1c

Add text on TSV query syntax

227d4c4

Add unit tests for numerical filtering

cef5a38

There might be an issue with semicolon separated numerical values...

Add test to assert that mixed-type error is raised

b64ebdb

For semicolon-separated columns.

Ensure that mixed-type error is propagated to user

963acdd

Return 400 and print error message with details about the mixed numeric and string types in the column.

Add tests for string value filters

e58486c

String filtering has less set operations that the numerical filters (inquealities, range, discrete) since it only takes discrete values. Will thus need less test coverage for now.

Add fixture column that only has single values

70ea81d

And tests that act on that column. This ensure that the query logic can handle columns that contain some semicolon-separated and columns with no semicolon-separated values

Update fixture with SingleString and adapt tests

2719611

Add edge-case fixture and tests

3730a5c

Some of these tests fail at the moment which show spots where the SidecarQueryManager's type infereance and handling should either be improved or clarified in the docs.

Drop support of commaa in TSV values

628dea6

Add helper methods to keep the checks DRY, and update fixture and tests.

Allow hyphens in str but not numeric TSV values

15e3205

Could be a common use case to have hyphens in string values, e.g. "North-East", "South-West". Numerical values are not supported and will raise an exception.

Strip leading/trailing spaces when loading TSV

aaffc1f

Whitespaces inside values are preserved

Update metadata user guide after refactoring

a0ab837

Update fixture to use other floats than just X.0

7bca287

Merge remote-tracking branch 'origin/use-typer-recommended-cli-config…

f3fddaf

…-loc' into sample-metadata-functionalities

Update test assertions for sample metadata queries

94c223b

Update metadata template CLI to use pathlib

bc39329

brinkdp added 9 commits February 17, 2026 11:57

Prune validator clafication message

562da4d

Update test after updating validator warning msg

bd78a05

Update metadata user guide with results example

2be7adf

Ensure same handling of # in validator and queries

0104e57

Update query overview with sample metadata text

ce5e27f

Add text on bracket array warning to user guide

b97dc21

brinkdp marked this pull request as ready for review February 18, 2026 12:36

brinkdp requested a review from a team as a code owner February 18, 2026 12:36

brinkdp added 2 commits February 18, 2026 14:05

Move shared unit test TSV fixtures to conftest.py

8452994

brinkdp added 12 commits February 23, 2026 16:12

Merge branch 'pr70' into sample-metadata-functionalities

5693686

Update test to not rely on \t in stout

5816ca5

Can cause issues with terminal settings, rich, typer etc. Add assertion for the expected sample names

Merge remote-tracking branch 'origin/main' into sample-metadata-funct…

35f73d1

…ionalities

Update tests to match refactoring

f2e19d8

Also collect all shared unit test fixtures together, now that all tests pass again.

Update user guide for sidecar metadata

cc40504

To reflect new syntax using bracket arrays.

Update mkdocs CLI autogen docs on dimensions

df0caed

Should probably have been commited as part of #70.

Limit number of samples validator prints to user

801676c

When using divbase-cli dimensions validate-metadata-file with the mock 5000 sample data, it will print every single sample name to terminal if they are not in the dimensions of that project. This limits it to 20.

Rename class to ClientSideMetadataTSVValidator

bf9ff81

The old name was MetadataTSVValidator, but this could become a source of confusion since the core validation logic is in the SharedMetadataValidator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQ-837: Refine the sample metadata TSV handling and write user guides#67

SQ-837: Refine the sample metadata TSV handling and write user guides#67
brinkdp wants to merge 104 commits intomainfrom
sample-metadata-functionalities

brinkdp commented Feb 16, 2026 •

edited

Loading

Uh oh!

brinkdp commented Feb 18, 2026

Uh oh!

RMCrean commented Feb 19, 2026

Uh oh!

brinkdp commented Feb 19, 2026

Uh oh!

RMCrean commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brinkdp commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sidecar metadata TSV format and filtering syntax

Create pre-filled template

TSV file Validation

Metadata queries

Uh oh!

brinkdp commented Feb 18, 2026

Uh oh!

RMCrean commented Feb 19, 2026

As this is such a big and impressive piece of work I do have some general questions/points from the first pass:

Uh oh!

brinkdp commented Feb 19, 2026

Uh oh!

RMCrean commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brinkdp commented Feb 16, 2026 •

edited

Loading