SQ-837: Refine the sample metadata TSV handling and write user guides#67
SQ-837: Refine the sample metadata TSV handling and write user guides#67
Conversation
Users can use this to add their own user defined columns for these samples.
For example: divbase-cli query tsv "Population:2-3" --metadata-tsv-name \ tutorial_mock_metadata_mgpv3snps.tsv Pandas read_csv() infer if columns are numerical or string based on the data so we already had the correct type in the df. See: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Might need some more testing and docs, but the basics are in place.
I.e inquality OR range OR discrete value example "Weight": ">50,30-40,25,45"
Up until now, warnings were only displayed in the worker logs.
Use textwrap to ensure that bulleted warnings keep bullet indentation when warnings string is too long to fit within the terminal width.
Use lambda functions and pandas .apply() to split values on semicolons and apply filters accordingly. This allows users to define sample metadata TSVs where cells can contain multiple values separated by semicolons, and filter strings can match any of those values. Sample_ID Group S1 1 S2 1;2 S3 2
There might be an issue with semicolon separated numerical values...
Pandas will not infer a column as numeric if it contains semicolon-separated values, even if those values are numeric. This means that the inequalities and range logic will not work for such values. This adds a helper method that checks if a column contains semicolon-separated numeric values and raises an error if the values are of mixed types (e.g. "1";two;3). The lambda functions in the run_query() logic will convert each semicolon-separated numerical value float or int before applying the inequality or range logic.
For semicolon-separated columns.
Return 400 and print error message with details about the mixed numeric and string types in the column.
String filtering has less set operations that the numerical filters (inquealities, range, discrete) since it only takes discrete values. Will thus need less test coverage for now.
And tests that act on that column. This ensure that the query logic can handle columns that contain some semicolon-separated and columns with no semicolon-separated values
Some of these tests fail at the moment which show spots where the SidecarQueryManager's type infereance and handling should either be improved or clarified in the docs.
Add helper methods to keep the checks DRY, and update fixture and tests.
To make the code easier to read, and a little more DRY. Also took the opurtunity to change the lamda functions used to create filters for numerical operations on the dataframe into named nested functions. This should hopefully be a little more readable and still comply with the Ruff linter.
Could be a common use case to have hyphens in string values, e.g. "North-East", "South-West". Numerical values are not supported and will raise an exception.
Big refactoring. The same processing needs to be applied to positive and to negative user-inputted filter values. NOT conditions are applied with AND logic after positive conditions: in the end, the rows must NOT match any negated values. To try to keep the code DRY and manageable, several helper methods have been added.
Whitespaces inside values are preserved
…-loc' into sample-metadata-functionalities
This ensures that the validation of the TSV content has a single source of truth in the form of the SharedMetadataValidator, which is used by both the CLI and the API. Note! This is only for the contents of the TSV files. The filtering logic for the queries is still separate to the query manager on the server side.
This script runs SharedMetadataValidator, i.e. same logic that the CLI and API uses when loading a sample metadata TSV file. Thus, this script can be used to ensure that the TSV results in the expected DataFrame structure after being ingested by DivBase.
If users include Python/JSON-style array notation (e.g., '[1, 2, 3]') in their sample metadata TSV, it will be treated as a string column by DivBase. The warning tells that to the user and advises them to use semicolons (e.g., '1;2;3') instead for multi value columns.
Turns out that some TSVs exported from df will not match the original TSV, and that mainly has to do with ints and floats, e.g. a 0 being exported as 0.0. I think that the discrepencies are acceptable but it could be something to further work on.
|
Hi @RMCrean, this is ready for review now. Happy to demo this PR and talk about it in person if you like. About nested arrays and pandas: I had a little look at it since we talked about this yesterday, and my current understanding is still that the main way to load a TSV and treat bracketed arrays in the TSV as nparrays, e.g. # For serialised arrays in the TSV cells, e.g.: [1,2,3]
df['values'] = df['values'].apply(json.loads).apply(np.array)
# For semicolon values in the TSV cells, e.g. 1;2;3
df['values'] = df['values'].apply(lambda x: np.array(x.split(';'), dtype=int))I'm not using nparrays at all for the multi-value cells at the moment, though. But I added a check that sends a warning to the user if they try to use bracket array notation in their sidecar TSV files to make the current treatment clear to the user. |
|
Thanks for this! I have started looking through it now to try and get to grips with it, but I agree it will be a good idea to look at this together. I think being able to validate the tsv as a user will be nice! I also like the use of dataclasses and pydantic models :) The logic for handling filtering/queries is very impressive too! As this is such a big and impressive piece of work I do have some general questions/points from the first pass:
Regarding pandas, my point each time the few times we've talked about this has been more can we use as much pandas default implementation as possible? So, can we use the pandas way to store an array instead of a custom way with ";" is one example, but the general idea is more about limiting what divbase has to do itself and hopefully therefore limit the number of edge cases/complexity of the code. Here is a small snippet to show more what I mean (my pandas knowledge is very rusty so this was rather reliant on AI). In this case handling both treatment of arrays and assigning their underlying types. Output: But I appreciate that would be a major refactor, so I understand completely if you'd prefer not to consider this as an alternative. |
|
Thanks! That's many great ideas already and it sounds good to work on some of these things together.
Oh, thanks for thinking about that. I did not try it yet, but that sounds like something we really need to do. I can generate a large TSV file and some mock VCF with many samples and run some tests. Paginating the validation errors is also a good idea. Maybe making an option to print the validation messages to a log file.
No, all my tests pass... Let's look at that together.
I like that! Secrets should be very protected. Project info, like file and sample metadata I feel is needs to be returned to make the messages useful, but it good be good to think about the level of detail there too.
Yes, you are right 😄 The reason why this one is quite a bit different is that the route polls for task completion. We should maybe consider if this should still be doing that? The other tasks pass their errors to the celery logs that users access with the task-history CLI. This task was extra awkward since the celery JSON serilisation/deserilisation did not work properly with custom exception classes that use
I am completely with you here. The challenge, I think, is to understand what is pandas default or best-practice for multi-value cells. I think we are currently looking at "pandas default" from two different perspectives, so we just need to decide what to go for. I am using default behaviour in the sense that columns that are not inferred as int or float in the df will be treated as string. If I understand your perspective, you are thinking about how to represent the arrays in the TSV in the most pandas default way? If bracket array notation feels more universal (and it is what many programming languages use), then I think it would be worth to refactor to use that instead of semicolon. Let's find time when we can look more at this together! In the meantime I can look at the large file-case. |
To be able to test the sample metadata validation and queries with larger files. The large files should not be under source control due to their potential size. The scripts use random seeds for reproducibility. Example to generate a mock VCF file with 5000 samples and 100 variants, and then generate the corresponding sample metadata TSV file with four fixed column, one warning column that will trigger validator warnings, followed by 25 random columns. bash scripts/benchmarking/generate_mock_vcf.sh -s 5000 -r 100 will generate a file named mock_vcf_5000s_100r.vcf.gz python scripts/generate_mock_sample_metadata.py --vcf mock_vcf_5000s_100r.vcf.gz --output mock_metadata_mock_vcf_5000s_100r.tsv --columns 25 --add-warning-column
|
That sounds like a good plan. For the error messages I did mean to just cap after e.g. 500 messages and give back to the user. You could implement pagination etc... but my guess would be if they are getting 500+ errors, there is probably something generally wrong in how they've formatted (e.g. every cell in a column misformatted). So I'm not sure pagination is needed, but I don't have very strong opinions. A log file for all errors is perhaps easier to implement I suppose. And sorry about the tests, I think the problem was me not having uv sync'd or at least everything is passing now. R.E. the pandas part. I think about it like this: The goal is to have a way to store an array (aka collection of data) in a single cell. What is a way that pandas/python natively supports? A python list (or np.array) is a way to do it. A cell with (I think the benefits of using an array are more than just the conversion though) This also makes me think. I wonder if the constraint that each column must contain only arrays/list of multiple values or only single values. It's perfectly fine to have a column called e.g. "Groups" with the following data: And this would not be allowed: The ";" separation syntax does not have that same benefit, where you cannot directly tell if you're dealing with a list/array of items or just a string and therefore what kind of logic to apply to it. A column made up of lists/array columns can use e.g. |
Can cause issues with terminal settings, rich, typer etc. Add assertion for the expected sample names
Use python list syntax for multi-value cells instead of previous semicolon-separated string format. To test the new functionaly, CLI commands and their unit tests are also updated here Tagged WIP since there are probably maby breaking changes outside of the CLI command and its units tests.
The shared metadata validator has two callers: the CLI validator and the server-side query engine. The CLI validator will only need to display the error and warning messages to the user. But the query engine need to act on them (raise or warn). The previous implementation that relied on string matching the error messages was brittle: if message phrasing change (which they are prone to do in the pilot phase when we expect a lot of feedback), troubleshooting the query engine will be needed often. Inspired by HTML status error codes, this creates a custom DivBase metadata validation Enum category system that will allow the query engine to look for the validation Enums and not for the exact strings in the messages. Tagged WIP here too since things are probably still broken as the refactoring is not complete
Reuse calculations from the shared metadata validator: no need to recheck if a col is num, string, or mixed-type, just import the results. The shared validator does the parsing of multi-value lists and passes the dataframe to the query engine logic.
Also collect all shared unit test fixtures together, now that all tests pass again.
To reflect new syntax using bracket arrays.
Should probably have been commited as part of #70.
When using divbase-cli dimensions validate-metadata-file with the mock 5000 sample data, it will print every single sample name to terminal if they are not in the dimensions of that project. This limits it to 20.
Having the classmethod was convienient for calling the for calling the class from the CLI since the instance was never reused. But the classmethod still captured state and used the __init__ method so it can be clearer to just follow the instance-based style used in the rest of the codebase.
The old name was MetadataTSVValidator, but this could become a source of confusion since the core validation logic is in the SharedMetadataValidator
This PR is a major refactoring of the sidecar sample metadata TSV query logic. Focus was on adding functions like numerical operations, and improving UX.
Things covered in this PR:
divbase-cli dimensions show --unique-samplesnow gives number and names of samples (and scaffolds for --unique-samples`)Sidecar metadata TSV format and filtering syntax
To not repeat what I wrote in the user guide, I would suggest spinning up mkdocs and reading: http://localhost:8008/divbase/user-guides/sidecar-metadata/
Also updated the quick-start guide, but there is really no room there to discuss the metadata format and query syntax there.
Create pre-filled template
The first column in the TSV,
Sample_IDis the only column that we stricly enforce since it needs to match exactly with the sample IDs in the VCFs in the project. To help users get this column right, I added a CLI command that reads the dimensions index of the project and generates at TSV with this first column populated with the sample names.For instance, if
local-project-1has files as perlocal_dev_setup.pyanddivbase-cli dimensions update --project local-project-1has been run:will save a single-column TSV with all unique sample names from the dimensions file to
sample_metadata_<project_name>.tsv. Output filename can be controlled with--output.TSV file Validation
The TSV validator runs client-side and is intended to be run on local TSV files before they are uploaded to DivBase. It validates the content in the TSV; checks on the query filters are done elsewhere. The validator does make one request to the server to fetch a list of the samples from the project's dimensions index to be able to validate sample names. Therefore I sorted the CLI command under
divbase-cli dimensionsjust like the TSV template creator.We talked about having Pandas in the libs to be able to ensure that the same code can be used client-side and server-side for doing the same checks. To my surpise, Pandas was already in the libs package... I could have sworn that it was in the api package when I looked last week! Well, anyway, that made it an easy choice to refactor all shared code to a class in the libs package.
For
local-project-1after its dimensions index has been updated:Metadata queries
In addition to the shared TSV content validator that is also run by SidecarQueryManager when a metadata query begins processing, there are also checks for the filter processing. For instance they e.g. check if numerical filters can be applied or not. This part of the code does not raise errors, only warnings. The user will typically get metadata query results but with warnings.