Standardization and curation of biodiversity records using the Darwin Core (DwC) data standard
This repository documents a data management and curation workflow developed to improve to quality, structure, and interoperability of biodiversity datasets. The workflow focuses on transforming heterogeneous biodiveristy records into a standardized structure based on the Darwin Core biodiversity data standard.
The repository provides scripts, documentation, and examples that demonstrate how biodiversity data can be curated, normalized, and prepared for integration into global biodiversity information infrastructures.
- Consolidate a centralized database containing information on species diversity, distribution, and abundance.
- Standardize and format biodiversity datasets according to the core structure and terms of the Darwin Core (DwC) standard.
- Improve data quality, traceability, and reproducibility during the curation process.
- Facilitate data interoperability, reuse, and publication across biodiversity information platforms.
This repository may be useful for:
- Researchers working with biodiversity and ecological datasets
- Environmental professionals and biodiversity monitoring practitioners
- Undergraduate and graduate students in ecology, biology, and enviromental sciences
- Biodiversity data managers and curators
- Citizen science initiatives and biodiversity data contributors
This repository applies the Darwin Core biodiversity data standard as the primary framework for structuring biodiversity recods.
Darwin Core is designed to document:
- Biological occurrences
- Sampling events
- Taxonomic information
- Spatial and temporal context of observations
Because of this design, some types of ecological data can be directly represented using DwC terms, while others must be documented as metadata or extensions.
The table below summarizes the compatibility of different ecological data types with the Darwin Core standard.
| Data type | DwC compatibility | Description/considerations |
|---|---|---|
| Occurrence and distribution data | ✅ High | Core of DwC using terms such as occurrenceID, scientificName, eventDate y locality. |
| Presence/absence data | Documented using occurrenceStatus; reliables absence data requires a well -defined sampling design. |
|
| Abundance and individual counts | ✅ High | Represented using individualCount, organismQuantity and organismQuantityType. |
| Biomas, size, and life stages | Life stages (lifeStage) can be recorded; detailed morphometric data may require extensions. |
|
| Abiotic measurements | Documented through the MeasurementOrFact; extension DwC is not primarily an environmental data standard. |
|
| Biotic measurements | Documented using MeasurementOrFact or descriptive fields, with limited support for complex interactions. |
|
| Sampling methods and study areas | ✅ High (metadata) | Documented using samplingProtocol, eventRemarks, locationID, and dataset-level metadata. |
| Sample processing protocols | ❌ Low | Outside DwC's main scope; recommended to documented to document in EML metadata or external documentation. |
The workflow implemented in this repository includes several steps aimed at ensuring data consistency, traceability, and interoperability.
Multiple source datasets were integrated into a unifed structure while preserving the original information.
Key fields were standardized, including:
- scientific names
- locality names
- taxonomic authorities
- date formars
- sampling metadata
Scientific names were reviewed and standardized using external taxonomic authorities such as the World Register of Marine Species to ensure taxonomic consistency.
Records were reorganized following the Event Core and Occurrence Core structure, alowwing hierachical relationships between sampling events and species occurrences.
Occurrence records were expanded when necessary to represent species-event relationships, allowing a normalized struture compatible with biodiveristy databases.
Data processing steps were implemented using reproducible scriptrs to ensure transparency and facilitate future updates of the dataset.
this repository demonstrates the scope and capabilities of biodiversity data curation and standardization under the Darwin Core framework, including:
- Curation of occurrence records with or without spatial information (geographic coordinates)
- Standardization and validation of sicentific names, including correction of typographical errors and taxonomic normalization
- Conversion of heterogeneous date formats to ISO 8601
- Normalization of key fields such as locality identifiers, taconomic auyhorities, and individual counts
- Implementation of hierarchical identifiers linking documents, events, and species occurrences
- Documentation of curation decisions to ensure traceability of original records
- Preparation of datasets compatible with publication in biodiversity infrastructures such as Global Biodiversity Information Facility and Ocean Biodiversity Information System
Darwin Core is not a comprehensive ecological data standard, but rather a schema designed to facilitate interoperability of biodiversity occurrence data.
Environmental variables, detailed methodologies, and laboratory protocols should be documented primarily as metadata, often using complementary standards such as:
- EML (Ecological Metadata Language)
- OBIS-ENV extensions
- domain-specific environmental data schemas
Combining these standards allows biodiversity datasets to remain both interoperable and scientifically reproducible.