-
Notifications
You must be signed in to change notification settings - Fork 5
Correction for the detection of duplicated measurementType/measurementTypeID and addition of a check for exact duplicated emofs #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances duplicate detection for eMoF records, adds an exact duplicate check, and updates unit definitions and documentation accordingly.
- Added R logic to flag duplicate
measurementTypeandmeasurementTypeIDat both occurrence and event levels, plus an exact-eMoF duplicate check. - Updated
README.mdto describe the new duplicate checks. - Cleaned up
BODCunits_20250701.csvby removing duplicates and appending new unit codes.
Reviewed Changes
Copilot reviewed 3 out of 8 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| R/checkdataset.R | New blocks for duplicate measurementType/measurementTypeID and exact eMoF checks |
| README.md | Added entries for the new duplicate checks in the general issues table |
| files/BODCunits_20250701.csv | Removed duplicate unit entries and inserted new codes |
Comments suppressed due to low confidence (2)
R/checkdataset.R:647
- [nitpick] For the
measurementTypeIDduplicate check, consider settingfield = 'measurementTypeID'instead of'measurementType'to clearly indicate which column is in error.
mutate (field = 'measurementType',
README.md:80
- [nitpick] The description uses "duplicated measurementTypes" but the next entries refer to
measurementTypeID; consider consistent singular/plural naming, e.g. "Check if there are duplicated measurementTypeIDs linked to the same occurrenceID."
| Check if there are duplicated measurementTypes linked to the same occurrenceID / Does the same measurement occurs more than once for the same occurrenceID? | Duplicate measurementType linked to the same occurrence | IPTreport\$dtb\$general_issues | Ideally a given occurrenceID can be linked to the same measurementType only once. E.g. It doesn't make sense to say that one same occurrenceID contains adults in one record but larvae in another record, it must be or either adults or larvae or both, but all expressed in only one record. |
| select (id, measurementTypeID) %>% | ||
| group_by (id, measurementTypeID) %>% | ||
| summarize(count = n()) %>% | ||
| filter (count >1 ) %>% | ||
| inner_join((eMoF) %>% mutate (row = row_number()), | ||
| by = c("id", "measurementTypeID") ) %>% |
Copilot
AI
Jul 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In duplicated_measurementTypeID_ev(), you’re selecting and grouping by id instead of eventID. It should use eventID to match the filter on event-level records.
| select (id, measurementTypeID) %>% | |
| group_by (id, measurementTypeID) %>% | |
| summarize(count = n()) %>% | |
| filter (count >1 ) %>% | |
| inner_join((eMoF) %>% mutate (row = row_number()), | |
| by = c("id", "measurementTypeID") ) %>% | |
| select (eventID, measurementTypeID) %>% | |
| group_by (eventID, measurementTypeID) %>% | |
| summarize(count = n()) %>% | |
| filter (count >1 ) %>% | |
| inner_join((eMoF) %>% mutate (row = row_number()), | |
| by = c("eventID", "measurementTypeID") ) %>% |
| } else {mof_oc_Value_war <-NULL } | ||
|
|
||
| mof_oc_dubs <- eMoF %>% filter (!is.na(occurrenceID), !is.na(measurementTypeID)) %>% | ||
| duplicated_measurementType_oc <- eMoF %>% filter (!is.na(occurrenceID),is.na(measurementTypeID)) %>% |
Copilot
AI
Jul 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter for duplicate measurementType currently excludes measurementTypeID but does not ensure measurementType is not NA. Consider adding !is.na(measurementType) to avoid grouping NA values.
| duplicated_measurementType_oc <- eMoF %>% filter (!is.na(occurrenceID),is.na(measurementTypeID)) %>% | |
| duplicated_measurementType_oc <- eMoF %>% filter (!is.na(occurrenceID), !is.na(measurementType), is.na(measurementTypeID)) %>% |
| SDN:P06::M2MS,Moles per square metre per second,mol/m^2/s,false,http://vocab.nerc.ac.uk/collection/P06/current/M2MS/ | ||
| SDN:P06::MLPS,Moles per second,mol/s,false,http://vocab.nerc.ac.uk/collection/P06/current/MLPS/ | ||
| SDN:P06::SQM2,Square metres per second squared,m^2/s^2,false,http://vocab.nerc.ac.uk/collection/P06/current/SQM2/ | ||
| SDN:P06::XXXX,Not applicable,NA,false,http://vocab.nerc.ac.uk/collection/P06/current/XXXX/ |
Copilot
AI
Jul 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] New unit codes are inserted in multiple spots; consider keeping the CSV in sorted or grouped order to simplify future maintenance.
|
Will redo but on dev branch |
In the details of the commit, you can see the changes between the BODC files of this month and the ones from april but it's weird because the repo don't contain the ones from april anymore. I don't think it should be a problem since it wants to make changes in a file that doesn't exist anymore.