Skip to content

SQ-837: Refine the sample metadata TSV handling and write user guides#67

Open
brinkdp wants to merge 104 commits intomainfrom
sample-metadata-functionalities
Open

SQ-837: Refine the sample metadata TSV handling and write user guides#67
brinkdp wants to merge 104 commits intomainfrom
sample-metadata-functionalities

Conversation

@brinkdp
Copy link
Contributor

@brinkdp brinkdp commented Feb 16, 2026

This PR is a major refactoring of the sidecar sample metadata TSV query logic. Focus was on adding functions like numerical operations, and improving UX.

Things covered in this PR:

  • supported numerical operators (inequalities, ranges) and negative values in query filters
  • supported ! for NOT filtering
  • supported multi-value cells in TSV, delimited by semicolon
  • added CLI command to generate sidecar tsv template from project's dimensions index
  • added CLI command for a client-side sidecar tsv validator (decided against server-side since it would need to send potentially big request payloads). The validation logic is in the libs package so that the same code can be used in the client side validator and in the server side query processing.
  • specified requirements for sample metadata files in DivBase in user guide and implemented them as warnings and errors in the validator and the query manager. Most important one is: if a column is not strictly numeric, treat it as string and send warning to user when they try to use numerical operators on that column.
  • ensured errors (and warnings) propagate back to the user (auth exception is triggered first, so it should be fine to have details in the subsequent exceptions, I thought.)
  • refactored error on VCF dimensions not being up-to-date and raised it for metadata queries too (previously only for VCF queries)
  • added separate CRUDs and endpoints for only getting unique samples (and unique scaffolds; not really used in this PR but very similar logic) from dimensions table. divbase-cli dimensions show --unique-samples now gives number and names of samples (and scaffolds for --unique-samples`)
  • added unit and e2e tests for many, many use cases
  • add a script that loads a TSV and prints the result df to terminal for inspection. Calls the same function that DivBase uses to load TSVs on the client and server side.
  • wrote user guide pages. Related pages on dimensions and VCF queries is a TODO that will be covered in its own branch(es).

Sidecar metadata TSV format and filtering syntax

To not repeat what I wrote in the user guide, I would suggest spinning up mkdocs and reading: http://localhost:8008/divbase/user-guides/sidecar-metadata/

Also updated the quick-start guide, but there is really no room there to discuss the metadata format and query syntax there.

Create pre-filled template

The first column in the TSV, Sample_ID is the only column that we stricly enforce since it needs to match exactly with the sample IDs in the VCFs in the project. To help users get this column right, I added a CLI command that reads the dimensions index of the project and generates at TSV with this first column populated with the sample names.

For instance, if local-project-1 has files as per local_dev_setup.py and divbase-cli dimensions update --project local-project-1 has been run:

divbase-cli dimensions create-metadata-template 

will save a single-column TSV with all unique sample names from the dimensions file to sample_metadata_<project_name>.tsv. Output filename can be controlled with --output.

TSV file Validation

The TSV validator runs client-side and is intended to be run on local TSV files before they are uploaded to DivBase. It validates the content in the TSV; checks on the query filters are done elsewhere. The validator does make one request to the server to fetch a list of the samples from the project's dimensions index to be able to validate sample names. Therefore I sorted the CLI command under divbase-cli dimensions just like the TSV template creator.

We talked about having Pandas in the libs to be able to ensure that the same code can be used client-side and server-side for doing the same checks. To my surpise, Pandas was already in the libs package... I could have sworn that it was in the api package when I looked last week! Well, anyway, that made it an easy choice to refactor all shared code to a class in the libs package.

For local-project-1 after its dimensions index has been updated:

# Should give a warning about mixed-type column, but should otherwise be fine:
divbase-cli dimensions validate-metadata-file tests/fixtures/sample_metadata.tsv --project local-project-1

# Intentionally, no sample names match dimensions here, should return a lot of warnings and errors:
divbase-cli dimensions validate-metadata-file tests/fixtures/sample_metadata_incorrect_formatting_to_test_tsv_validator.tsv --project local-project-1

Metadata queries

In addition to the shared TSV content validator that is also run by SidecarQueryManager when a metadata query begins processing, there are also checks for the filter processing. For instance they e.g. check if numerical filters can be applied or not. This part of the code does not raise errors, only warnings. The user will typically get metadata query results but with warnings.

Users can use this to add their own user defined columns for these
samples.
For example:
divbase-cli query tsv "Population:2-3" --metadata-tsv-name \
tutorial_mock_metadata_mgpv3snps.tsv

Pandas read_csv() infer if columns are numerical or string based on
the data so we already had the correct type in the df.
See:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Might need some more testing and docs, but the basics are in place.
I.e inquality OR range OR discrete value
example "Weight": ">50,30-40,25,45"
Up until now, warnings were only displayed in the worker logs.
Use textwrap to ensure that bulleted warnings keep bullet indentation
when warnings string is too long to fit within the terminal width.
Use lambda functions and pandas .apply() to split values on semicolons
and apply filters accordingly.

This allows users to define sample metadata TSVs where cells can contain
multiple values separated by semicolons, and filter strings
can match any of those values.

Sample_ID  Group
S1   1
S2   1;2
S3   2
There might be an issue with semicolon separated numerical values...
Pandas will not infer a column as numeric if it contains
semicolon-separated values, even if those values are numeric. This means
that the inequalities and range logic will not work for such values.

This adds a helper method that checks if a column contains
semicolon-separated numeric values and raises an error if the values
are of mixed types (e.g. "1";two;3). The lambda functions in the
run_query() logic will convert each semicolon-separated numerical value
float or int before applying the inequality or range logic.
For semicolon-separated columns.
Return 400 and print error message with details about the mixed
numeric and string types in the column.
String filtering has less set operations that the numerical filters
(inquealities, range, discrete) since it only takes discrete values.
Will thus need less test coverage for now.
And tests that act on that column. This ensure that the query logic can
handle columns that contain some semicolon-separated and columns with
no semicolon-separated values
Some of these tests fail at the moment which show spots where the
SidecarQueryManager's type infereance and handling should either be
improved or clarified in the docs.
Add helper methods to keep the checks DRY, and update fixture and tests.
To make the code easier to read, and a little more DRY.

Also took the opurtunity to change the lamda functions used to
create filters for numerical operations on the dataframe into
named nested functions. This should hopefully be a little more
readable and still comply with the Ruff linter.
Could be a common use case to have hyphens in string values,
e.g. "North-East", "South-West".

Numerical values are not supported and will raise an exception.
Big refactoring. The same processing needs to be applied to positive
and to negative user-inputted filter values. NOT conditions are applied
with AND logic after positive conditions: in the end, the rows must NOT
match any negated values.

To try to keep the code DRY and manageable, several
helper methods have been added.
Whitespaces inside values are preserved
This ensures that the validation of the TSV content has a single
source of truth in the form of the SharedMetadataValidator,
which is used by both the CLI and the API.

Note! This is only for the contents of the TSV files. The filtering
logic for the queries is still separate to the query manager on the
server side.
This script runs SharedMetadataValidator, i.e. same logic that the CLI
and API uses when loading a sample metadata TSV file. Thus, this script
can be used to ensure that the TSV results in the expected DataFrame
structure after being ingested by DivBase.
If users include Python/JSON-style array notation (e.g., '[1, 2, 3]')
in their sample metadata TSV, it will be treated as a string column by
DivBase. The warning tells that to the user and advises them to use
semicolons (e.g., '1;2;3') instead for multi value columns.
@brinkdp brinkdp marked this pull request as ready for review February 18, 2026 12:36
@brinkdp brinkdp requested a review from a team as a code owner February 18, 2026 12:36
Turns out that some TSVs exported from df will not match the original
TSV, and that mainly has to do with ints and floats, e.g. a 0 being
exported as 0.0. I think that the discrepencies are acceptable but it
could be something to further work on.
@brinkdp
Copy link
Contributor Author

brinkdp commented Feb 18, 2026

Hi @RMCrean, this is ready for review now. Happy to demo this PR and talk about it in person if you like.

About nested arrays and pandas:

I had a little look at it since we talked about this yesterday, and my current understanding is still that the main way to load a TSV and treat bracketed arrays in the TSV as nparrays, e.g. [1,2,3], is to use a apply a custom function to the dataframe or to use a pandas converter in the load_csv() step. For instance with something like:

# For serialised arrays in the TSV cells, e.g.: [1,2,3]
df['values'] = df['values'].apply(json.loads).apply(np.array)

# For semicolon values in the TSV cells, e.g. 1;2;3
df['values'] = df['values'].apply(lambda x: np.array(x.split(';'), dtype=int))

I'm not using nparrays at all for the multi-value cells at the moment, though. But I added a check that sends a warning to the user if they try to use bracket array notation in their sidecar TSV files to make the current treatment clear to the user.

@RMCrean
Copy link
Member

RMCrean commented Feb 19, 2026

Thanks for this! I have started looking through it now to try and get to grips with it, but I agree it will be a good idea to look at this together. I think being able to validate the tsv as a user will be nice! I also like the use of dataclasses and pydantic models :) The logic for handling filtering/queries is very impressive too!

As this is such a big and impressive piece of work I do have some general questions/points from the first pass:

  • Have you tested this on a large metadata file? I'm wondering how validation impacts the query time for a tsv query?

  • I have two failing tests on this branch about a results file not showing up. Do you also have them?

  • Validation run on a very large file that is badly formatted (e.g. same problem for every row) might cause the terminal to hang, as it will try to print out every validation failure. Maybe there should be some kind of cap at e.g. 500 errors/warnings before you return it, and just say there are potentially more errors but here is the first 500.

  • Regading "ensured errors (and warnings) propagate back to the user (auth exception is triggered first, so it should be fine to have details in the subsequent exceptions, I thought.)" I agree it is good that the auth checks are first, but it would still not be desirable to potentially leak secrets back to an authenticated user in a error log. We can perhaps treat the S3 operations a little differently and not for example give tracebacks. Can probably also make use of pydantic secretstrs for the sensitive variables like done in the API and CLI, to further reduce the risk.

  • The exception handling in each task seems to be different and I think it would be good to standardise this somewhat so it is more predictable. This stackoverflow thread seems like something worth discussing https://stackoverflow.com/questions/16658371/how-to-log-exceptions-occurring-in-a-django-celery-task/45333231#45333231, so create a class that all tasks can inherit from that tells each task how to handle an unexpected error. Perhaps you've experimented with this already though? I see your comments about how one has to be careful about making sure errors can be propagated up correctly unchanged (which I had no idea about) so I appreciate this is not an easy topic at all, but I think it could be worth some time brainstorming on together.

Regarding pandas, my point each time the few times we've talked about this has been more can we use as much pandas default implementation as possible? So, can we use the pandas way to store an array instead of a custom way with ";" is one example, but the general idea is more about limiting what divbase has to do itself and hopefully therefore limit the number of edge cases/complexity of the code.

Here is a small snippet to show more what I mean (my pandas knowledge is very rusty so this was rather reliant on AI). In this case handling both treatment of arrays and assigning their underlying types.

import ast
import pandas as pd
import pathlib

data = {
    "ID": [1, 2, 3, 4],
    "Name": ["alpha", "beta", "gamma", "delta"],
    "Tags": [["tag1, with, special chars%^()", "tag2"], ["tag2"], ["tag1", "tag3", "tag4"], ["tag3"]],
    "float_list": [[1.1, 2.2], [3.3], [4.4, 5.5, 6.6], [7.7]],
    "mixed_list": [[1, "two", 3.0], ["four", 5, 6.6], [7.7, "eight", 9], ["ten"]],
    "ints_list": [[1, 2], [3], [4, 5, 6], [7]],
}

df_original = pd.DataFrame(data)

# save and load as tsv
file_path = "mixed_data.tsv"
df_original.to_csv(file_path, sep='\t', index=False)
df_loaded = pd.read_csv(file_path, sep='\t')

# Only apply ast.literal_eval to object (string) columns that look like Python literals
for col in df_loaded.select_dtypes(include=['object']).columns:
    try:
        df_loaded[col] = df_loaded[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
    except ValueError:
        print(f"Skipping column '{col}' due to invalid literal values.")

print(f"""
    {type(df_loaded["Tags"][0][0])=},
    {type(df_loaded["float_list"][0][0])=},
    {type(df_loaded["mixed_list"][0][2])=},
    {type(df_loaded["ints_list"][0][0])=}
""")
print(all(df_loaded == df_original))

Output:

Skipping column 'Name' due to invalid literal values.
    type(df_loaded["Tags"][0][0])=<class 'str'>,
    type(df_loaded["float_list"][0][0])=<class 'float'>,
    type(df_loaded["mixed_list"][0][2])=<class 'float'>,
    type(df_loaded["ints_list"][0][0])=<class 'int'>
True

But I appreciate that would be a major refactor, so I understand completely if you'd prefer not to consider this as an alternative.

@brinkdp
Copy link
Contributor Author

brinkdp commented Feb 19, 2026

Thanks! That's many great ideas already and it sounds good to work on some of these things together.

large metadata file? a very large file that is badly formatted?

Oh, thanks for thinking about that. I did not try it yet, but that sounds like something we really need to do. I can generate a large TSV file and some mock VCF with many samples and run some tests. Paginating the validation errors is also a good idea. Maybe making an option to print the validation messages to a log file.

I have two failing tests on this branch about a results file not showing up. Do you also have them?

No, all my tests pass... Let's look at that together.

We can perhaps treat the S3 operations a little differently and not for example give tracebacks. Can probably also make use of pydantic secretstrs for the sensitive variables like done in the API and CLI, to further reduce the risk.

I like that! Secrets should be very protected. Project info, like file and sample metadata I feel is needs to be returned to make the messages useful, but it good be good to think about the level of detail there too.

The exception handling in each task seems to be different and I think it would be good to standardise this somewhat so it is more predictable.

Yes, you are right 😄 The reason why this one is quite a bit different is that the route polls for task completion. We should maybe consider if this should still be doing that? The other tasks pass their errors to the celery logs that users access with the task-history CLI. This task was extra awkward since the celery JSON serilisation/deserilisation did not work properly with custom exception classes that use __init__ functions like DivBaseAPIException. Sounds good to brainstorm on this together to centralise the task exception handling.

can we use as much pandas default implementation as possible? and hopefully therefore limit the number of edge cases/complexity of the code.

I am completely with you here. The challenge, I think, is to understand what is pandas default or best-practice for multi-value cells. I think we are currently looking at "pandas default" from two different perspectives, so we just need to decide what to go for. I am using default behaviour in the sense that columns that are not inferred as int or float in the df will be treated as string. If I understand your perspective, you are thinking about how to represent the arrays in the TSV in the most pandas default way? If bracket array notation feels more universal (and it is what many programming languages use), then I think it would be worth to refactor to use that instead of semicolon.

Let's find time when we can look more at this together! In the meantime I can look at the large file-case.

To be able to test the sample metadata validation and queries with
larger files. The large files should not be under source control
due to their potential size. The scripts use random seeds for
reproducibility.

Example to generate a mock VCF file with 5000 samples and 100 variants,
and then generate the corresponding sample metadata TSV file with four
fixed column, one warning column that will trigger validator warnings,
followed by 25 random columns.

bash scripts/benchmarking/generate_mock_vcf.sh -s 5000 -r 100
will generate a file named mock_vcf_5000s_100r.vcf.gz

python scripts/generate_mock_sample_metadata.py
--vcf mock_vcf_5000s_100r.vcf.gz
--output mock_metadata_mock_vcf_5000s_100r.tsv
--columns 25 --add-warning-column
@RMCrean
Copy link
Member

RMCrean commented Feb 20, 2026

That sounds like a good plan. For the error messages I did mean to just cap after e.g. 500 messages and give back to the user. You could implement pagination etc... but my guess would be if they are getting 500+ errors, there is probably something generally wrong in how they've formatted (e.g. every cell in a column misformatted). So I'm not sure pagination is needed, but I don't have very strong opinions. A log file for all errors is perhaps easier to implement I suppose.

And sorry about the tests, I think the problem was me not having uv sync'd or at least everything is passing now.

R.E. the pandas part. I think about it like this: The goal is to have a way to store an array (aka collection of data) in a single cell. What is a way that pandas/python natively supports? A python list (or np.array) is a way to do it. A cell with "[1, 1.0, "hello"]" can use ast.literal_eval and be auto converted into a list with the correct underlying types for each element. So "[1, 1.0, 'hello']" becomes: list[int, float, str], rather than having to rely on type casting. If ast.literal_eval can't convert it, then its a user error and they have to reformat there input. There is no direct way in python to convert "1;1.0;'hello'". So if we store an array in the TSV using e.g. "[1,2,3]", python can use ast.literal_eval and auto convert that into an array for us, and assign the underlying data types correctly.

(I think the benefits of using an array are more than just the conversion though)

This also makes me think. I wonder if the constraint that each column must contain only arrays/list of multiple values or only single values. It's perfectly fine to have a column called e.g. "Groups" with the following data:

Groups
["A", "C"] # fine to be a member of multiple groups
["B"] # fine to be a member of 1 group
[] # fine to not be a member of any group

And this would not be allowed:

Groups
["A", "C"] 
"B"
# empty cell here 

The ";" separation syntax does not have that same benefit, where you cannot directly tell if you're dealing with a list/array of items or just a string and therefore what kind of logic to apply to it. A column made up of lists/array columns can use e.g. if "A" in column style checks for simple filtering for example.

Can cause issues with terminal settings, rich, typer etc. Add assertion
for the expected sample names
Use python list syntax for multi-value cells instead of previous
semicolon-separated string format.

To test the new functionaly, CLI commands and their unit tests are also
updated here

Tagged WIP since there are probably maby breaking changes outside of the
CLI command and its units tests.
The shared metadata validator has two callers: the CLI validator and
the server-side query engine. The CLI validator will only need to
display the error and warning messages to the user. But the query engine
need to act on them (raise or warn).

The previous implementation that relied on string matching the error
messages was brittle: if message phrasing change (which they are prone
to do in the pilot phase when we expect a lot of feedback),
troubleshooting the query engine will be needed often. Inspired by HTML
status error codes, this creates a custom DivBase metadata validation
Enum category system that will allow the query engine to look for the
validation Enums and not for the exact strings in the messages.

Tagged WIP here too since things are probably still broken as the
refactoring is not complete
Reuse calculations from the shared metadata validator: no need to
recheck if a col is num, string, or mixed-type, just import the results.

The shared validator does the parsing of multi-value lists and passes
the dataframe to the query engine logic.
Also collect all shared unit test fixtures together, now that all tests
pass again.
To reflect new syntax using bracket arrays.
Should probably have been commited as part of #70.
When using
divbase-cli dimensions validate-metadata-file

with the mock 5000 sample data, it will print every single sample name
to terminal if they are not in the dimensions of that project. This
limits it to 20.
Having the classmethod was convienient for calling the for calling
the class from the CLI since the instance was never reused. But the
classmethod still captured state and used the __init__ method so it can
be clearer to just follow the instance-based style used in the rest of
the codebase.
The old name was MetadataTSVValidator, but this could become a source of
confusion since the core validation logic is in the
SharedMetadataValidator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants