Skip to content

cc_outl() Raster approximation method not correctly flagging outliers #110

@amvincent92

Description

@amvincent92

The function testing for geographic outliers appears to not drop records as expected when utilising the raster approximation (or "thinning").

I'm working with large datasets and would expect the function to use the "raster approximation" method (over 20,000 records, with 100+ species). However, i'm consistently finding no records are being dropped/flagged when calling the function, counter to expectations. Forcing "thinning = F" is leading to unfeasibly large computation times.

Here is a reproducible example with the feature being applied to a smaller subset of my data to illustrate. I'm having this issue across multiple method applications of cc_outl(), although only showing method = "distance" below.

Any help in getting this running efficiently would be greatly appreciated.

records.csv
Records for use in the below example.

Data prep and visualisation:

library(sf)
# v1.0-19
library(CoordinateCleaner)
# v3.0.1

records.sf <- read.csv("records.csv") |>
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = "EPSG:4326", remove = F)

# Example dataset attached with github submission. Highlights two example species datasets for reprex
# Read in csv and convert to sf object, assigning crs, and not removing original coordinate columns

plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])

## Two sets of occurrence records, both with isolated points one would expect to be removed by cc_outl implementations

Example of default implimentation working as expected:

## Default implementation of cc_outl ##
# Method applied when record numbers are < 10,000 #

x <- cc_outl(x = as.data.frame(records.sf),
             lon = "decimalLongitude", 
             lat = "decimalLatitude", 
             species = "scientificName", 
             method = "distance", tdi = 80
)

plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
plot(x$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
# Example 1

plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
plot(x$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
# Example 2

## Working as expected based on understanding of "distance" method implementation,
## isolated data points are dropped 

Raster approximation method which is failing to drop points:

## Raster approximation method ##
# Method applied when records are > 10,000 for computational efficiency #

# I have forced with a small dataset to demonstrate issue, 
# although my full datasets are in the 20,000 + and over multiple species.

y <- cc_outl(x = as.data.frame(records.sf),
             lon = "decimalLongitude", 
             lat = "decimalLatitude", 
             species = "scientificName", 
             method = "distance", tdi = 80,
             thinning = T, 
             # Forced use of thinning method
             thinning_res = 0.01
             # Set raster threshold conservatively. Isolated points should be much farther away than this?
)
# No records dropped

plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
plot(y$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[1])])
# Example 1

plot(records.sf$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
plot(y$geometry[which(records.sf$scientificName == unique(records.sf$scientificName)[2])])
# Example 2

## Outputs are the same, in contrast to original examples, records are not dropped as expected 
## Distance method shown, but failing across all other implementations

Session info:

## Session info ##

# sessionInfo()

# R version 4.3.2 (2023-10-31 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 11 x64 (build 26100)
# 
# Matrix products: default
# 
# 
# locale:
#   [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8    LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
# [5] LC_TIME=English_Australia.utf8    
# 
# time zone: Australia/Sydney
# tzcode source: internal
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] CoordinateCleaner_3.0.1 sf_1.0-19    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions