Skip to content

b-cubed-eu/comp-unstructured-data

DOI Release GitHub repo status GitHub repo size funder

Compare unstructured data (Flanders case study)

Langeraert, WardORCID logo123 Cartuyvels, EmmaORCID logo13 Van Daele, ToonORCID logo13 Research Institute for Nature and Forest (INBO)4 European Union's Horizon Europe Research and Innovation Programme (ID No 101059592)5

keywords: structured data; data quality; unstructured data; data cubes; biodiversity informatics

Scripts to explore the conditions that determine the reliability of models, trends and status by comparing aggregated cubes with structured monitoring schemes.

This code is developed in context of T4.5 of the B-Cubed project.

Analysis workflow

This repository follows a reproducible, pipeline-based workflow built around {targets}. The analysis proceeds in four clearly separated stages: data acquisition, preparation, pipeline execution, and reporting.

1. Data acquisition (raw → processed)

Raw biodiversity data are downloaded and pre-processed using dedicated R Markdown reports.

What to run

  • prepare_abv_data.Rmd
  • prepare_data_10km.Rmd

Where

  • source/reports/prepare_data/

What happens

  • Downloads the latest available versions of the required datasets (mainly via GBIF).
  • Alternatively, the exact same data versions used in the analyses can be retrieved by following the GBIF download links embedded in the Rmd files.
  • Performs initial cleaning and standardisation.
  • Adds spatial (geometric) information.

Outputs

  • Raw data are stored in: data/raw/
  • Cleaned and enriched datasets are written to: data/processed/ in both .csv and .gpkg formats.

2. Species list preparation (shared input)

Both analysis pipelines rely on a consistent list of ABV bird species.

What to run

  • get_abv_species.R

Where

  • source/R/

What happens

  • Extracts and prepares the list of ABV species.
  • This list is used to filter observations consistently across all pipelines, ensuring comparability between structured and unstructured data sources.

3. Analysis pipelines (targets)

All core analyses are implemented as {targets} pipelines, allowing reproducible, incremental, and efficient execution.

What to run

  • run_pipeline.R

Where

  • Inside the folder of the pipeline you want to execute, e.g.: source/pipelines/<pipeline_name>/

What happens

  • Builds and runs the complete dependency graph defined by {targets}.
  • Aggregates data into cubes, fits models, and computes indicators as defined in the pipeline.
  • Intermediate and final results are cached automatically by {targets}.

See https://books.ropensci.org/targets/ for details on how {targets} works and how to inspect or debug pipelines.

4. Reporting and visualisation

Once a pipeline has been successfully run, results can be summarised and visualised using dedicated reports.

What to run

  • The relevant R Markdown (.Rmd) files

Where

  • source/reports/<analysis_name>/

What happens

  • Reads outputs generated by the corresponding {targets} pipeline.
  • Produces figures, tables, and narrative summaries.
  • Creates output directories automatically if they do not yet exist.
  • A logical order in which to run the reports is:
    1. explorative_analysis
    2. comparing_biodiv_indicators
    3. standardisation
    4. dataset_cv

Outputs

  • Stored under: output/<analysis_name>/

Repository structure

The repository is organised to clearly separate data, analysis pipelines, and reporting. All necessary directories are created automatically during execution.

├── source
│   ├── pipelines                  ├ {targets} pipelines (one folder per analysis)
│   │     └── ...
│   ├── R                          ├ shared R helper scripts
│   └── reports                    ├ Rmd reports based on pipeline outputs
│         └── ...
│
├── data
│   ├── raw                        ├ manually created; stores raw downloaded data
│   ├── interim                    ├ automatically created; stores R Markdown cache data
│   └── processed                  ├ automatically created; cleaned & spatialised data
│
├── output                         ├ automatically created; analysis outputs (figures, results)
│     └── ...
│
├── README.md                      ├ project description
├── LICENSE.md                     ├ license
├── CITATION.cff                   ├ citation metadata
├── comp-unstructured-data.Rproj   ├ RStudio project file
│
├── checklist.yml                  ├ checklist package configuration
├── organisation.yml               ├ organisation metadata
│
├── inst
│   └── en_gb.dic                  ├ custom dictionary for checklist
├── .github
│   ├── workflows
│   │   └── checklist_project.yml  ├ GitHub Actions workflow
│   ├── CODE_OF_CONDUCT.md
│   └── CONTRIBUTING.md
└── .gitignore

Footnotes

  1. author 2 3

  2. contact person

  3. Research Institute for Nature and Forest (INBO), Herman Teirlinckgebouw, Havenlaan 88 PO Box 73, B-1000 Brussels, Belgium 2 3

  4. copyright holder

  5. funder

About

Scripts to explore the conditions that determine the reliability of models, trends and status by comparing aggregated cubes with structured monitoring schemes

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages