The function testing for geographic outliers appears to not drop records as expected when utilising the raster approximation (or "thinning").
I'm working with large datasets and would expect the function to use the "raster approximation" method (over 20,000 records, with 100+ species). However, i'm consistently finding no records are being dropped/flagged when calling the function, counter to expectations. Forcing "thinning = F" is leading to unfeasibly large computation times.
Here is a reproducible example with the feature being applied to a smaller subset of my data to illustrate. I'm having this issue across multiple method applications of cc_outl(), although only showing method = "distance" below.
Any help in getting this running efficiently would be greatly appreciated.
records.csv
Records for use in the below example.
Data prep and visualisation:
library(sf)
# v1.0-19
library(CoordinateCleaner)
# v3.0.1
records.sf <- read.csv("records.csv") |>
st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = "EPSG:4326", remove = F)
# Example dataset attached with github submission. Highlights two example species datasets for reprex
# Read in csv and convert to sf object, assigning crs, and not removing original coordinate columns
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
## Two sets of occurrence records, both with isolated points one would expect to be removed by cc_outl implementations
Example of default implimentation working as expected:
## Default implementation of cc_outl ##
# Method applied when record numbers are < 10,000 #
x <- cc_outl(x = as.data.frame(records.sf),
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "scientificName",
method = "distance", tdi = 80
)
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
plot(x$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
# Example 1
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
plot(x$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
# Example 2
## Working as expected based on understanding of "distance" method implementation,
## isolated data points are dropped
Raster approximation method which is failing to drop points:
## Raster approximation method ##
# Method applied when records are > 10,000 for computational efficiency #
# I have forced with a small dataset to demonstrate issue,
# although my full datasets are in the 20,000 + and over multiple species.
y <- cc_outl(x = as.data.frame(records.sf),
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "scientificName",
method = "distance", tdi = 80,
thinning = T,
# Forced use of thinning method
thinning_res = 0.01
# Set raster threshold conservatively. Isolated points should be much farther away than this?
)
# No records dropped
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
plot(y$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
# Example 1
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
plot(y$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
# Example 2
## Outputs are the same, in contrast to original examples, records are not dropped as expected
## Distance method shown, but failing across all other implementations
Session info:
## Session info ##
# sessionInfo()
# R version 4.3.2 (2023-10-31 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 11 x64 (build 26100)
#
# Matrix products: default
#
#
# locale:
# [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8 LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
# [5] LC_TIME=English_Australia.utf8
#
# time zone: Australia/Sydney
# tzcode source: internal
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] CoordinateCleaner_3.0.1 sf_1.0-19
The function testing for geographic outliers appears to not drop records as expected when utilising the raster approximation (or "thinning").
I'm working with large datasets and would expect the function to use the "raster approximation" method (over 20,000 records, with 100+ species). However, i'm consistently finding no records are being dropped/flagged when calling the function, counter to expectations. Forcing "thinning = F" is leading to unfeasibly large computation times.
Here is a reproducible example with the feature being applied to a smaller subset of my data to illustrate. I'm having this issue across multiple method applications of cc_outl(), although only showing method = "distance" below.
Any help in getting this running efficiently would be greatly appreciated.
records.csv
Records for use in the below example.
Data prep and visualisation:
Example of default implimentation working as expected:
Raster approximation method which is failing to drop points:
Session info: