This issue is not a bug, but rather a question of preventing inadvertent use of the dataproduct downstream.
Now the help file states: "Calculate the total reported catch weight by species and haul." I initially interpreted this to be the actual catch in the haul. But if I understand things correctly the interpretation of the value in CatCatchWeight in the HL-data is dependent on the DataType in the HH-data. I.e. standardized to one hour if DataType is "C", otherwise ias reported in the haul.
So question if this function should not in addition return an explicit cpue value and make a clear distinction between that value and CatchWgt in the help file (also if zero's should not explicitly be returned, see sidenote below).
Below is a code that illustrate the difference these two values would give in a typical downstream user analysis:
library(icesDatras)
library(tidyverse)
dr_add_id <- function (d) {
d |>
dplyr::mutate(.id = paste(Survey, Year, Quarter, Country,
Ship, Gear, StNo, HaulNo, sep = ":"))
}
cw <-
icesDatras::getCatchWgt("NS-IBTS", years = 2020:2025, quarters = 1, aphia = 126437) |>
dr_add_id() |>
select(.id, Valid_Aphia, CatchWgt)
hh <-
icesDatras::getDATRAS("HH", "NS-IBTS", 2020:2025, 1) |>
dr_add_id() |>
select(.id, DataType, HaulDur, Year)
d <-
hh |>
left_join(cw,
by = join_by(.id)) |>
as_tibble() |>
mutate(CatchWgt = replace_na(CatchWgt, 0), # This may not be kosher is some cases
wgt = case_when(DataType == "C" ~ CatchWgt/60 * HaulDur,
.default = CatchWgt),
cpue = wgt / HaulDur * 60)
d |>
count(Year, DataType) |>
ggplot(aes(Year, n, colour = DataType)) +
geom_point()
d |>
select(.id, Year, CatchWgt, cpue) |>
gather(var, value, -c(.id, Year)) |>
ggplot(aes(Year, value, colour = var)) +
stat_summary(fun.data = "mean_cl_boot")
Side note: In the above the missing CatchWgt are interpreted as zero. I have however come across cases where the CatCatchWgt may be missing for some CatIdentifier within the same tow. I guess these should return an NA in the CatchWgt of a particular haul (or possibly be a QC-flag issue), but cases where the species is not reported at all in the HL-data for a given haul should be explicitly set to zero.
Just some food for thought.
This issue is not a bug, but rather a question of preventing inadvertent use of the dataproduct downstream.
Now the help file states: "Calculate the total reported catch weight by species and haul." I initially interpreted this to be the actual catch in the haul. But if I understand things correctly the interpretation of the value in CatCatchWeight in the HL-data is dependent on the DataType in the HH-data. I.e. standardized to one hour if DataType is "C", otherwise ias reported in the haul.
So question if this function should not in addition return an explicit cpue value and make a clear distinction between that value and CatchWgt in the help file (also if zero's should not explicitly be returned, see sidenote below).
Below is a code that illustrate the difference these two values would give in a typical downstream user analysis:
Side note: In the above the missing CatchWgt are interpreted as zero. I have however come across cases where the CatCatchWgt may be missing for some CatIdentifier within the same tow. I guess these should return an NA in the CatchWgt of a particular haul (or possibly be a QC-flag issue), but cases where the species is not reported at all in the HL-data for a given haul should be explicitly set to zero.
Just some food for thought.