-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Guidelines
- I agree to follow this project's Contributing Guidelines.
Description
Data.validator strongly follow idea of table and validations running on the table.
IMO it doesn't fit most of use cases.
E.g. I do:
validate(data.frame(), name = "Comparing testing vs postgres data") |>
validate_if(
identical(
names(get_cols(...)),
names(get_cols(...))
),
description = "Column names are the same in 1 table"
) |>
validate_if(
identical(
as.vector(get_cols(...)),
as.vector(get_cols(...))
),
description = "Column types are the same in 1 table"
) |>
add_results(report)
As you can see, I have to pass empty data frame to validate() but I don't use it.
Then when I do print(report)
I see:
|table_name |description |type | total_violations|
|:----------------------------------|:-------------------------------------------------|:-------|----------------:|
|Comparing testing vs ci data |Column names are the same in 1 table |success | NA|
|Comparing testing vs ci data |Column names are the same in 1 table |success | NA|
Name of column table_name doesn't make sense for me in this situation. It should be maybe Group?
Also Violated data doesn't work with this flexible approach.
Another example from practice
We used data.validator to show rows, that are returned by queries. Queries were built in the way that they return only invalid rows, and there is nothing returned if there is no invalid data. More documentation about how to hack data.validator for this cases would be nice.
Problem
My use of this package doesn't fit standard use of the package. I think package should be more flexible and allow validations based on multiple data frames without specifing them explicitly in validate call.
Proposed Solution
- Change column names in report object.
- Remove requirement of dataframe in
validate() - Update docs with examples of more advanced and customized use-cases.
Alternatives Considered
Stick to what you have. Write in docs explicitly that it is dedicated to working with data frames.