An R implementation for calculating scaled length compositions and age compositions from fisheries sampling data, supporting sex-based analysis and bootstrap uncertainty estimation.
Supports both commercial fisheries (weight-based scaling) and research surveys (density-based scaling).
Note: This package is under active development with comprehensive testing. Core functionality is stable and well-tested (341+ passing tests), though new features may be added.
- Installation
- Quick Start
- Overview
- Key Concepts
- Requirements
- Available Functions
- Cohort-Based Age Composition Workflow
- Length Measurement Bias Correction
- Creating Complete Age-Length Keys
- Input Data Format
- Usage
- Data Visualisation
- Function Parameters
- Output Structure
- Worked Example
- Advanced Usage
- Notes for Real Applications
- Package Information
- Citation
- Contributing
- Troubleshooting
- R (version 4.0 or higher)
- Base R packages:
stats
- Clone or download the repository
- In R, navigate to the package directory and install:
# Install development dependencies (optional)
install.packages(c("roxygen2", "devtools"))
# Build and install the package
devtools::install_github("alistairdunn1/scala", subdir = "scala")
# Or using R CMD (from command line)
R CMD build scala
R CMD INSTALL scala_0.1.tar.gzFor Windows users, use the provided build script:
build.batThis script will:
- Generate documentation using roxygen2
- Build the package
- Run R CMD check for validation
- Install the package locally
library(scala)For plotting capabilities, install suggested packages:
install.packages(c("ggplot2", "dplyr", "tidyr", "RColorBrewer", "viridis", "patchwork"))Here's a minimal example to get you started:
# Load the package
library(scala)
# Generate test data (commercial fisheries example)
test_data <- generate_test_data(data_type = "commercial")
# Get default length-weight parameters
lw_params <- get_default_lw_params()
# Calculate scaled length compositions
results <- calculate_length_compositions(
fish_data = test_data$fish_data,
strata_data = test_data$strata_data,
length_range = c(20, 35),
lw_params_male = lw_params$male,
lw_params_female = lw_params$female,
lw_params_unsexed = lw_params$unsexed,
bootstraps = 100
)
# View results
print(results)
# Generate age-length key for age composition analysis
age_key <- generate_age_length_key(
length_range = c(20, 35),
age_range = c(2, 8),
growth_type = "vonbert"
)
# For incomplete age-length keys, create a complete key first
complete_alk <- create_alk(
age_length_key = age_key,
lengths = 20:35, # All length bins in your data
ages = 2:8, # All age bins you want
verbose = TRUE
)
# Calculate age compositions (auto-detects complete vs incomplete keys)
age_results <- calculate_age_compositions(
x = results,
age_length_key = complete_alk, # Uses complete key
age_range = c(2, 8)
)
# View age composition results
print(age_results)
# Alternative: Cohort-based age composition workflow (for multi-year data)
# 1. Fit cohort model on aged subsample
aged_data <- generate_test_age_data()
aged_data$year <- sample(2018:2022, nrow(aged_data), replace = TRUE)
cohort_model <- fit_cohort_alk(alk_data = aged_data, by_sex = TRUE)
# 2. Assign ages to full dataset
fish_with_ages <- assign_ages_from_cohort(
fish_data = test_data$fish_data,
cohort_model = cohort_model,
method = "mode"
)
# 3. Calculate scaled age compositions
age_results_cohort <- calculate_age_compositions_from_cohort(
fish_data = fish_with_ages,
strata_data = test_data$strata_data,
age_range = c(1, 10),
lw_params_male = lw_params$male,
lw_params_female = lw_params$female,
lw_params_unsexed = lw_params$unsexed,
bootstraps = 100
)
# Create publication-quality plots with lines and uncertainty ribbons
if (requireNamespace("ggplot2", quietly = TRUE)) {
# Length composition plots
plot(results, by_stratum = FALSE, show_CIs = TRUE) # Pooled
plot(results, by_stratum = TRUE, show_CIs = TRUE) # Faceted
# Age composition plots (same interface!)
plot(age_results, by_stratum = FALSE, show_CIs = TRUE) # Pooled
plot(age_results, by_stratum = TRUE, show_CIs = TRUE) # Faceted
}This tool calculates scaled length compositions and age compositions from fish sampling data by:
- Scaling within samples: Upweighting sampled fish to represent the entire sample catch (weight-based) or area coverage (density-based)
- Scaling within strata: Upweighting samples to represent the entire stratum catch or area
- Scaling across strata: Combining strata to get total population estimates
- Bootstrap resampling: Providing uncertainty estimates through resampling procedures
- Age composition conversion: Converting length compositions to age compositions using complete age-length keys with streamlined workflow and clear warnings for incomplete data
- Cohort-based age assignment: Direct age assignment to individual fish using cohort models for multi-year datasets, followed by scaled age composition calculations
The package supports two scaling approaches:
- Sample scaling:
total_catch_weight / observed_sample_weight - Stratum scaling:
stratum_total_catch / sum_of_sample_catches_in_stratum - Final scaling: Combined effect scales individual fish counts to population estimates
- Sample scaling:
(catch_density * sample_area) / observed_sample_weight - Stratum scaling:
stratum_area / sum_of_sample_areas_in_stratum - Final scaling: Combined effect scales individual fish counts to population estimates based on area coverage
- Resamples samples within each stratum (with replacement)
- Resamples individual fish within each sample
- Recalculates scaled length compositions for each bootstrap iteration
- Propagates uncertainty through age-length key application for age compositions
- Calculates coefficient of variation (CV) from bootstrap distribution
- R (version 4.0 or higher)
- Required packages (per DESCRIPTION):
ggplot2,rlang,dplyr,stats,zoo,mgcv - Suggested packages:
tidyr,RColorBrewer,viridis,patchwork,plotly,gridExtra,rmarkdown,testthat - Memory: Sufficient RAM for bootstrap operations (depends on data size and bootstrap iterations)
- Storage: Minimal disk space required (package size ~1MB)
- Bootstrap iterations: More iterations provide better uncertainty estimates but increase computation time
- Data size: Large datasets with many strata/samples will require more memory and processing time
- Recommended: 300+ bootstrap iterations for final analyses, fewer (50-100) for exploratory work
The package provides the following main functions:
calculate_length_compositions(): Main function for calculating length compositions with optional bootstrap uncertainty estimationcalculate_age_compositions(): Convert length compositions to age compositions using age-length keys with smart detection for complete vs incomplete keys and warning system for missing length datacreate_alk(): Create complete age-length keys from raw age data with interpolation/extrapolation for missing length-age combinations, supporting both single and sex-specific keys, length binning, and comprehensive sampling requirement analysiscalculate_multinomial_n(): Calculate multinomial effective sample size from length or age composition proportions and CVsget_summary(): Get comprehensive summary statistics including mean weighted CV, fish counts, and haul counts from length composition results
fit_ordinal_alk(): Fit an ordinal GAM (cumulative logit) age-at-length model (optionally by sex) with optional additional smooth terms, and return a prediction function for age probabilities by lengthfit_cohort_alk(): Fit a cohort-based ordinal GAM model to estimate year classes from length, year, and sex, with optional additional smooth terms and age back-calculation capability for multi-year datasetsassign_ages_from_cohort(): Assign ages to individual fish observations using predictions from a fitted cohort model (mode, expected, or random assignment). Supports predicting ages for years with length data but no age data viapredict_missing = TRUEcalculate_age_compositions_from_cohort(): Calculate scaled age compositions from cohort model-assigned ages using the same scaling methodology as length compositionscompare_alks(): Compare age-length keys from empirical (create_alk) and model-based (fit_ordinal_alk) methods with summary metrics and optional visualisationgenerate_age_length_key(): Create sample age-length keys with various growth models
evaluate_sample_sizes(): Evaluate whether sample sizes are adequate for reliable bootstrap uncertainty estimation by checking fish per sample and samples per stratum against recommended thresholdsfilter_small_samples(): Create filtered datasets by excluding specific samples identified by sample evaluation (simplified interface)
diagnose_length_bias(): Detect preference for certain length values (e.g., multiples of 5) in fish measurement data, with diagnostic plots and statistical testscorrect_length_bias(): Apply smoothing corrections to reduce the effects of length measurement bias using smooth, jitter, or redistribute methodsadjust_age_length_key_bias(): Adjust age-length key proportions to account for known length measurement bias patternscompare_length_distributions(): Create side-by-side comparison plots of original vs corrected length distributions
generate_test_data(): Generate sample datasets for testing and examples (wrapper for commercial/survey data)generate_commercial_test_data(): Generate commercial fisheries test data (weight-based scaling)generate_survey_test_data(): Generate research survey test data (density-based scaling)generate_test_age_data(): Generate synthetic age-at-length data for testing age-length key functions
get_default_lw_params(): Get default length-weight parameters for testingget_abundance_lw_params(): Get length-weight parameters that disable weight scaling for abundance-based (count) calculations instead of biomass-based scaling
plot.length_composition(): Create professional visualisations with lines and uncertainty ribbons (requires ggplot2)plot.age_composition(): Visualise age compositions with uncertainty ribbons using identical interface to length plottingplot_alk(): Visualise age-length keys as heatmaps or bubble plots, with optional y-axis rug showing length distributionplot_length_composition_comparison(): Create multi-panel comparison plots for multiple length composition resultsplot_sample_size_distribution(): Create stacked histograms showing sample size distributions by strata, with samples categorised by adequacy (non-representative small, representative small, adequate)
resample_fish_data(): Internal function for bootstrap resampling
The plotting system includes:
- Consistent interface: Both length and age composition plotting functions use identical parameter names and functionality
- Line plots with uncertainty ribbons instead of traditional bar charts
- Faceted layouts for multi-stratum, multi-sex visualisation
- Bootstrap uncertainty visualisation as shaded confidence regions
- Flexible binning for both length (
length_bin_size) and age (age_bin_size) data aggregation - Length binning for aggregated visualisation at coarser scales
- Sex category filtering with optional inclusion of unsexed fish
Functionality for calculating multinomial effective sample sizes from both length and age compositions:
- Individual calculations for specific stratum and sex combinations
- Batch processing for all combinations simultaneously
- Robust fitting with outlier detection and removal options
- Quality diagnostics including model fit statistics
- Compatible with both length and age composition data
This is particularly useful for:
- Stock assessment applications requiring effective sample sizes
- Data weighting in integrated models
- Quality control of length composition data
The package provides a workflow for converting length compositions to age compositions:
Smart Age-Length Key Detection:
- Complete keys:
calculate_age_compositions()automatically detects complete age-length keys (created bycreate_alk()) and applies them directly - Incomplete keys: Traditional age-length keys are detected and users are warned about missing length data instead of silent interpolation
- Warnings: Explicit warnings identify which lengths are missing from incomplete keys
Two-Step Workflow for Incomplete Keys:
create_alk(): Creates complete age-length keys with interpolation/extrapolation for missing combinationscalculate_age_compositions(): Applies the complete key with full uncertainty propagation
Key Features:
- Exact-match processing: Simplified age-length key application using only exact length matches (no interpolation during application)
- Linear interpolation: Smart interpolation between known length-age combinations in
create_alk() - Tail extrapolation: Automatic extrapolation for lengths beyond the key range, with optional user-specified tail ages
- Sex-specific support: Full support for male, female, and unsexed fish categories
- Bootstrap uncertainty: Uncertainty propagation through all age composition calculations
- Comprehensive visualisation: Plots with uncertainty ribbons and multi-panel layouts
Benefits of the Simplified Approach:
- Clear separation: Data preparation (
create_alk()) vs calculation (calculate_age_compositions()) - Explicit warnings: Users know exactly which length data is problematic
- Robust processing: Complete keys ensure all length bins are covered
- Maintainable code: Single-responsibility functions that are easier to debug and extend
This enables:
- Full age composition analysis from length-based sampling with complete uncertainty estimation
- Robust handling of incomplete age-length keys through explicit interpolation step
- Clear user feedback about data completeness and processing decisions
- Stock assessment ready outputs with proper uncertainty quantification
- Stock assessment ready outputs with proper uncertainty quantification
For detailed documentation of any function, use ?function_name in R (e.g., ?calculate_length_compositions).
When raw age-length observations are available, you can fit a smooth ordinal age-at-length model and predict an ALK over desired lengths (and sex):
# Fit model from raw age-length-sex data (one row per aged fish)
ord <- fit_ordinal_alk(alk_data = age_data, by_sex = TRUE)
# Predict per-age probabilities for given lengths and sex
lengths <- 20:70
sex <- rep(c("male", "female"), each = length(lengths))
probs <- ord$predict_function(lengths, sex)
# Combine into an ALK-like data.frame if needed
pred_alk <- cbind(
data.frame(length = rep(lengths, 2), sex = sex),
as.data.frame(probs)
)Notes:
- Uses mgcv::ocat with cumulative logit; returns a robust prediction function.
- Uses
select = TRUEandgamma = 1.4by default for regularisation and overfitting protection. - The basis dimension
kis chosen adaptively from the data when left at the default (-1):min(10, max(3, floor(n_unique_lengths / 3))). - Supports
additional_termsfor extra covariates (e.g., spatial or temporal terms). - Probabilities are normalized per row and non-negative; ages increase with length on average.
The package provides a complete workflow for calculating scaled age compositions using cohort-based modelling. This approach is particularly powerful for multi-year datasets where cohort tracking provides temporal coherence in age assignments.
The cohort workflow consists of three steps:
- Fit cohort model on aged subsample using
fit_cohort_alk() - Assign ages to all fish using
assign_ages_from_cohort() - Calculate scaled age compositions using
calculate_age_compositions_from_cohort()
This approach directly assigns ages to individual fish observations and then applies the same scaling methodology used in length composition calculations, providing a seamless integration with existing workflows.
# Your aged subsample needs: age, length, year, sex (optional)
aged_data <- data.frame(
age = c(2, 3, 3, 4, 5),
length = c(25, 30, 32, 38, 42),
year = c(2020, 2020, 2021, 2021, 2022),
sex = c("male", "female", "male", "female", "male")
)
# Fit cohort model
cohort_model <- fit_cohort_alk(
alk_data = aged_data,
by_sex = TRUE, # Fit sex-specific growth patterns
age_offset = 1, # Year class = (Year - Age) - 1
verbose = TRUE
)
# View model summary
print(cohort_model)Key Parameters:
by_sex: Whether to fit sex-specific smooth terms (recommended if you have sex data)age_offset: Year class offset (default 1, meaning YC = Year - Age - 1)k_length,k_year: Basis dimensions for smooth terms (auto-selected by default; see below)select: Whether to add an extra penalty allowing smooth terms to be penalized to zero for variable selection (default TRUE)gamma: Multiplier for effective degrees of freedom in smoothing parameter selection; values > 1 produce smoother models (default 1.4, following Wood 2006)additional_terms: Optional character vector of extra GAM formula terms (e.g.,"te(lat, long)","s(day_of_year, bs = 'cc')") -- automatically fitted withby = sexinteractions whenby_sex = TRUEmethod: Smoothing parameter estimation method (default "REML")
Basis dimension (k) selection:
The k parameter controls the maximum complexity (wiggliness) of each smooth term. When left at the default (-1), both fit_ordinal_alk() and fit_cohort_alk() choose k adaptively based on the data:
- Length terms (
k/k_length):min(10, max(3, floor(n_unique_lengths / 3)))-- uses one-third of the unique length values, clamped between 3 and 10. - Year terms (
k_year, cohort model only):min(10, max(3, floor(n_unique_years / 2)))-- uses half of the unique year values, clamped between 3 and 10. For very short time series (<= 3 years),k_yearis further restricted to at mostn_unique_years - 1.
These defaults provide enough flexibility to capture non-linear relationships while avoiding overfitting in small datasets. Users can override them by passing explicit positive values. The actual k values used are printed when verbose = TRUE.
Model Features:
- Fits cohort ~ s(length) + s(year) with optional sex interactions
- Uses mgcv::ocat with cumulative logit link for stable ordinal modelling
- Uses
select = TRUEandgamma = 1.4by default for regularisation and overfitting protection - Provides two prediction functions:
predict_cohort()andpredict_age() - Automatically handles sex-specific growth patterns
- Validates year range and data structure
Important: Do not use predict(cohort_model$model) directly -- it returns the raw cumulative logit linear predictor for cohorts, not ages. Always use cohort_model$predict_age() which converts cohort probabilities to age probabilities.
# Your fish data needs: length, year, sex (if model was fitted with by_sex=TRUE)
# Plus standard columns: stratum, sample_id, male, female, unsexed counts
fish_with_ages <- assign_ages_from_cohort(
fish_data = raw_fish_data,
cohort_model = cohort_model,
method = "mode", # "mode", "expected", or "random"
predict_missing = TRUE, # Also predict for years without age data
verbose = TRUE
)
# Check assigned ages
table(fish_with_ages$age)Assignment Methods:
"mode": Most probable age (deterministic, recommended for standard analyses)"expected": Expected age value (weighted mean of probability distribution)"random": Stochastic sampling from probability distribution (for uncertainty propagation)
Additional Options:
predict_missing: WhenTRUE, predict ages for all years in the fish data, including years not present in the cohort model's training data (i.e., years with length observations but no age data). WhenFALSE(default), fish from non-training years are assignedNA. The cohort model interpolates/extrapolates from surrounding year classes to fill these gaps.seed: Set random seed for reproducible sampling (when method = "random")keep_probabilities: Retain all age probabilities in output (default FALSE)
# Define length-weight parameters
lw_male <- c(a = 0.01, b = 3.0)
lw_female <- c(a = 0.011, b = 2.95)
lw_unsexed <- c(a = 0.0105, b = 2.98)
# Calculate scaled age compositions with bootstrap uncertainty
age_comps <- calculate_age_compositions_from_cohort(
fish_data = fish_with_ages,
strata_data = strata_data,
age_range = c(1, 10),
lw_params_male = lw_male,
lw_params_female = lw_female,
lw_params_unsexed = lw_unsexed,
bootstraps = 100,
plus_group_age = TRUE,
verbose = TRUE
)
# View results
print(age_comps$pooled_age_composition)
print(age_comps$pooled_age_cv)Key Features:
- Same scaling methodology as
calculate_length_compositions() - Supports weight-based (commercial) and density-based (survey) scaling
- Hierarchical bootstrap for uncertainty estimation at sample and fish levels
- Returns CVs and 95% confidence intervals
- Compatible with existing plotting and analysis functions
Output Structure:
age_composition: 3D array (age x sex x stratum)age_proportions: Proportions by age, sex, stratumpooled_age_composition: Pooled across all strataage_cvs: Coefficients of variation from bootstrapage_ci_lower,age_ci_upper: Bootstrap confidence intervals
# Complete workflow from start to finish
library(scala)
# Step 1: Fit cohort model on aged subsample
cohort_model <- fit_cohort_alk(
alk_data = aged_fish_subsample,
by_sex = TRUE,
age_offset = 1
)
# Step 2: Assign ages to full dataset
fish_with_ages <- assign_ages_from_cohort(
fish_data = full_fish_data,
cohort_model = cohort_model,
method = "mode",
predict_missing = TRUE # Include years with length data but no age data
)
# Step 3: Calculate scaled age compositions
age_results <- calculate_age_compositions_from_cohort(
fish_data = fish_with_ages,
strata_data = strata_info,
age_range = c(1, 10),
lw_params_male = lw_male,
lw_params_female = lw_female,
lw_params_unsexed = lw_unsexed,
bootstraps = 100,
plus_group_age = TRUE
)
# Analyze results
summary_stats <- data.frame(
Age = age_results$ages,
Composition = age_results$pooled_age_composition[, "total"],
Proportion = age_results$pooled_age_proportions[, "total"],
CV = age_results$pooled_age_cv[, "total"]
)
print(summary_stats)Advantages:
- Leverages temporal patterns for improved age predictions
- Better extrapolation beyond observed length ranges
- Natural handling of multi-year datasets
- Flexible sex-specific growth modelling
- Captures cohort-specific characteristics
Best suited for:
- Multi-year sampling programs
- Data with clear cohort progression signals
- Need for temporal smoothing
- Sparse age sampling in some years/lengths
Comparison with Traditional ALK:
- Traditional ALK: Single-year focus, simple age-length relationship
- Cohort Model: Multi-year temporal tracking, cohort-based predictions
For complete details, see COHORT_AGE_WORKFLOW.md and run the demo script:
source("Test/demo_cohort_age_compositions.R")Fisheries length data often exhibits measurement bias, particularly "digit preference" where measurers round to convenient values (typically multiples of 5). This package provides tools to diagnose and correct such biases.
Use diagnose_length_bias() to detect and assess the severity of measurement bias:
# Check for length measurement bias
test_data <- generate_test_data()
bias_check <- diagnose_length_bias(test_data$fish_data, modulus = 5, plot_results = TRUE)
# View results
print(bias_check$bias_severity) # "None", "Mild", "Moderate", or "Strong"
print(bias_check$preference_ratio) # Ratio of preferred vs non-preferred lengthsThe function returns:
- preference_ratio: Ratio of mean frequency at preferred vs non-preferred lengths
- bias_severity: Classification ("None", "Mild", "Moderate", "Strong")
- chi_square_test: Statistical test for uniformity
- Diagnostic plots: Histograms and frequency comparisons
Apply corrections using correct_length_bias() with three methods:
# Smooth method (recommended) - redistributes fish using rolling mean
corrected_data <- correct_length_bias(test_data$fish_data, method = "smooth")
# Jitter method - adds random noise to break up heaping
corrected_data <- correct_length_bias(test_data$fish_data, method = "jitter")
# Redistribute method - explicitly moves fish from over to under-represented lengths
corrected_data <- correct_length_bias(test_data$fish_data, method = "redistribute")Visualise the effect of corrections:
# Compare original vs corrected distributions
compare_length_distributions(test_data$fish_data, corrected_data)When length bias affects your age-length keys, use adjust_age_length_key_bias():
# Create age-length key
alk <- generate_age_length_key(length_range = c(20, 40), age_range = c(1, 8))
# Define bias pattern (multiples of 5 are over-represented by 50%)
bias_pattern <- data.frame(
length = 20:40,
bias_factor = ifelse((20:40) %% 5 == 0, 1.5, 1.0)
)
# Adjust the age-length key
adjusted_alk <- adjust_age_length_key_bias(alk, bias_pattern)The create_alk() function provides a comprehensive solution for creating complete age-length keys from raw age data, addressing common problems in fisheries analysis where age-length coverage is incomplete or sparse.
- Raw Data Input: Takes raw age-length data (one row per aged fish) rather than pre-processed age-length keys
- Automatic Key Creation: Converts raw data to proportional age-length keys with comprehensive interpolation/extrapolation
- Length Binning: Optional binning of length data (e.g., 2cm or 5cm bins) to consolidate sparse data
- Interpolation Control: Optional linear interpolation between observed length bins
- Extrapolation Control: Optional extrapolation beyond the observed length range
- Sex-Specific Support: Handles male, female, and combined unsexed keys automatically
- Sampling Analysis: Calculates current otolith counts and additional requirements for minimum (3 per length bin) and optimum (10 per length bin) coverage
- User-Specified Tails: Optional specification of age assignments for extreme lengths
- Validation: Functions to undertake input validation and progress reporting
- Normalisation: Ensures age proportions sum to 1.0 within each length bin
The default optimum target of 10 otoliths per length bin is based on a simulation study that recommends the number of otoliths for reliable age composition estimates (Coggins et al., 2013).
The create_alk() function uses sophisticated algorithms to fill missing length-age combinations, applied in priority order:
User-defined length-age pairs are applied first for extreme lengths:
- Applied to lengths at or beyond the observed data range
- Takes precedence over automatic interpolation/extrapolation
- Allows expert knowledge to override automatic methods
For missing lengths that fall between observed data points:
- Method: Distance-weighted linear interpolation
- Calculation:
interpolated_prop = weight_lower x lower_prop + weight_upper x upper_prop - Weights: Based on distance from bracketing lengths
- Assumption: Age composition changes gradually between observed lengths
Example:
- Observed: Length 20 (Age 2: 60%, Age 3: 40%), Length 24 (Age 2: 30%, Age 3: 70%)
- Missing: Length 22 (halfway between)
- Result: Length 22 gets (Age 2: 45%, Age 3: 55%)
For missing lengths outside the observed data range:
- Method: Copy age proportions from nearest observed length
- Lower tail: Copy from smallest observed length
- Upper tail: Copy from largest observed length
- Assumption: Age composition remains constant beyond observed range
Example:
- Observed range: Lengths 22-28
- Missing length 20: Copies all proportions from length 22
- Missing length 30: Copies all proportions from length 28
The algorithms are applied in this order for each missing length:
- Check for user-specified tail ages -> if found, use exact specification
- If not found and length is between observed data -> use linear interpolation (if enabled)
- If not found and length is beyond observed range -> use constant extrapolation (if enabled)
- If disabled, the length remains unfilled (user must handle manually)
This hierarchical approach ensures expert knowledge takes precedence while providing automatic methods for routine data gaps.
# Create age data frame (one row per aged fish)
age_data <- data.frame(
age = c(2, 3, 3, 4, 4, 4, 5, 5, 6),
length = c(22, 25, 26, 28, 29, 30, 32, 33, 35),
sex = c("male", "male", "female", "male", "female", "male", "female", "male", "female")
)
# Create complete age-length key from raw data
complete_alk <- create_alk(
age_data = age_data,
lengths = 20:40, # All lengths in your length composition data
ages = 1:8, # All ages you want to include
verbose = TRUE # Show progress and interpolation summary
)
# Use the complete key in age composition calculation
age_results <- calculate_age_compositions(
x = length_composition_results,
age_length_key = complete_alk,
age_range = c(1, 8)
)# Use length binning and control interpolation/extrapolation
complete_alk <- create_alk(
age_data = age_data,
lengths = 20:40,
ages = 1:8,
length_bin_size = 2, # Use 2cm length bins
interpolate = TRUE, # Allow linear interpolation (default)
extrapolate = FALSE, # Disable extrapolation beyond observed range
min_ages_per_length = 5, # Minimum otoliths per length bin
optimum_ages_per_length = 15, # Optimum otoliths per length bin
verbose = TRUE
)The function provides detailed analysis of current sampling coverage and additional requirements:
# The summary table shows for each sex category:
# - missing_lengths: Number of length bins requiring interpolation/extrapolation
# - interpolation_rate: Percentage of length bins requiring linear interpolation
# - extrapolation_rate: Percentage requiring extrapolation beyond observed range
# - current_otoliths: Number of aged fish currently available
# - min_additional_required: Additional otoliths needed for minimum coverage (3 per length)
# - optimum_additional_required: Additional otoliths needed for optimum coverage (10 per length)# Specify ages for extreme lengths that might not interpolate well
complete_alk <- create_alk(
age_data = age_data,
lengths = 15:45,
ages = 1:10,
tail_ages = list(
min_lengths = data.frame(length = c(13, 14), age = c(1, 1)), # Young fish
max_lengths = data.frame(length = c(46, 47), age = c(10, 10)) # Old fish
),
verbose = TRUE
)The calculate_age_compositions() function automatically detects complete age-length keys created by create_alk() and processes them efficiently:
# Complete workflow with raw age data
complete_key <- create_alk(age_data, lengths = 15:35, ages = 1:8)
age_results <- calculate_age_compositions(lc_results, complete_key, age_range = c(1, 8))The package supports two types of scaling approaches:
A data frame with the following required columns:
stratum(character/factor): Sampling stratum identifiersample_id(character/factor): Unique sample identifierlength(numeric): Fish length in cmmale(numeric): Number of male fish at this length in this samplefemale(numeric): Number of female fish at this length in this sampleunsexed(numeric): Number of unsexed fish at this length in this sampletotal(numeric, optional): Total fish at this length (auto-calculated if missing)sample_weight_kg(numeric): Total weight of this sample in kgtotal_catch_weight_kg(numeric): Total catch weight this sample represents in kg
A data frame with stratum-level information:
stratum(character/factor): Stratum identifier that matches fish_datastratum_total_catch_kg(numeric): Total catch weight for entire stratum in kg
A data frame with the following required columns:
stratum(character/factor): Sampling stratum identifiersample_id(character/factor): Unique sample identifierlength(numeric): Fish length in cmmale(numeric): Number of male fish at this length in this samplefemale(numeric): Number of female fish at this length in this sampleunsexed(numeric): Number of unsexed fish at this length in this sampletotal(numeric, optional): Total fish at this length (auto-calculated if missing)sample_area_km2(numeric): Area sampled in this sample in km^2catch_density_kg_km2(numeric): Catch density observed in this sample in kg/km^2
A data frame with stratum-level information:
stratum(character/factor): Stratum identifier that matches fish_datastratum_area_km2(numeric): Total area of the stratum in km^2
The function automatically detects whether you're using weight-based or density-based data based on the column names provided.
Length-Weight Parameters Required: All analyses require species and sex-specific length-weight parameters using the allometric relationship: Weight (g) = a x Length(cm)^b
# Load test data for commercial fisheries
test_data <- generate_test_data(data_type = "commercial")
# Example length-weight parameters
lw_male <- c(a = 0.0085, b = 3.10) # Males
lw_female <- c(a = 0.0092, b = 3.05) # Females
lw_unsexed <- c(a = 0.0089, b = 3.08) # Unsexed
# Calculate scaled length compositions
result <- calculate_length_compositions(
fish_data = test_data$fish_data,
strata_data = test_data$strata_data,
lw_params_male = lw_male,
lw_params_female = lw_female,
lw_params_unsexed = lw_unsexed,
length_range = c(20, 80),
bootstraps = 300
)
# View results
print(result)The package includes professional plotting capabilities to visualise both length and age composition results using lines with uncertainty ribbons. Both plotting functions share a consistent interface for ease of use:
# Install ggplot2 if not already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
# LENGTH COMPOSITIONS
# Plot pooled results across all strata (lines with confidence intervals)
plot(lc_result, by_stratum = FALSE, show_CIs = TRUE)
# Plot results by stratum (faceted layout: rows=strata, columns=sex categories)
plot(lc_result, by_stratum = TRUE, show_CIs = TRUE)
# Plot only a specific stratum (with sex facets)
plot(lc_result, stratum = "North", show_CIs = TRUE)
# AGE COMPOSITIONS (same interface!)
# Plot pooled age compositions across all strata
plot(age_result, by_stratum = FALSE, show_CIs = TRUE)
# Plot age compositions by stratum with faceting
plot(age_result, by_stratum = TRUE, show_CIs = TRUE)
# Plot specific stratum for age compositions
plot(age_result, stratum = "North", show_CIs = TRUE)
# COMMON OPTIONS FOR BOTH FUNCTIONS
# Show proportions instead of absolute compositions
plot(lc_result, by_stratum = FALSE, type = "proportion")
plot(age_result, by_stratum = FALSE, type = "proportion")
# Hide confidence intervals
plot(lc_result, by_stratum = FALSE, show_CIs = FALSE)
plot(age_result, by_stratum = FALSE, show_CIs = FALSE)
# Use partial matching for convenience
plot(lc_result, type = "prop") # equivalent to type = "proportion"
# Aggregate data into larger bins for summarized view
plot(lc_result, length_bin_size = 2) # 2 cm length bins
plot(age_result, age_bin_size = 2) # 2 year age bins
# Combine specific stratum with other options
plot(lc_result, stratum = "South", type = "proportion", length_bin_size = 5)
plot(age_result, stratum = "South", type = "proportion", age_bin_size = 2)
# Include or exclude unsexed fish category
plot(lc_result, unsexed = FALSE) # Default: exclude unsexed category
plot(lc_result, unsexed = TRUE) # Include unsexed category
plot(age_result, unsexed = FALSE) # Same options for age compositions
plot(age_result, unsexed = TRUE)# By-stratum faceted plot with confidence intervals for length compositions
# Rows represent different strata (e.g., "North", "South")
# Columns represent sex categories (Male, Female, Unsexed, Total)
plot(lc_result,
by_stratum = TRUE,
show_CIs = TRUE,
type = "composition")
# Same layout for age compositions - consistent interface!
plot(age_result,
by_stratum = TRUE,
show_CIs = TRUE,
type = "composition")
# Proportional view for comparing distribution shapes
plot(lc_result,
by_stratum = TRUE,
type = "proportion",
show_CIs = TRUE)
plot(age_result,
by_stratum = TRUE,
type = "proportion",
show_CIs = TRUE)
# Combine binning with other options for simplified views
plot(lc_result,
by_stratum = TRUE,
length_bin_size = 5,
type = "proportion")
plot(age_result,
by_stratum = TRUE,
age_bin_size = 2,
type = "proportion")
# Include unsexed category with other options
plot(lc_result,
by_stratum = TRUE,
unsexed = TRUE,
show_CIs = TRUE)
plot(age_result,
by_stratum = TRUE,
unsexed = TRUE,
show_CIs = TRUE)
# Clean plot without confidence intervals for publication
plot(lc_result,
by_stratum = FALSE,
show_CIs = FALSE,
type = "proportion")
plot(age_result,
by_stratum = FALSE,
show_CIs = FALSE,
type = "proportion")Both plot.length_composition() and plot.age_composition() share identical parameters:
by_stratum:FALSE- Pooled across strata with separate lines for each sexTRUE- Faceted layout (rows = strata, columns = sex categories)
stratum:NULL(default) - Useby_stratumsetting- Character string - Plot only the specified stratum (overrides
by_stratum)
type:"composition"- Absolute scaled compositions (number of fish)"proportion"- Relative proportions (0-1 scale)
show_CIs:TRUE- Display bootstrap confidence interval ribbons (95% CI)FALSE- Show only the point estimates (lines without ribbons)
length_bin_size/age_bin_size:NULL(default) - Use original 1 cm/1 year bins- Numeric value (e.g., 2, 5, 10) - Aggregate data into larger bins
unsexed:FALSE(default) - Exclude unsexed fish category from plotsTRUE- Include unsexed fish category alongside male, female, and total
The binning feature allows aggregation of fine-scale data into larger, more manageable bins:
- Flexible bin sizes: Choose any bin size for both length (e.g., 2 cm, 5 cm) and age (e.g., 2 years, 3 years)
- Automatic aggregation: Compositions and confidence intervals are properly summed across bins
- Cleaner visualisation: Reduces noise in data visualisation for broader pattern analysis
- Compatible with all options: Works with both pooled and by-stratum plots, confidence intervals, and proportion scaling
- Consistent interface: Same parameter structure for both length (
length_bin_size) and age (age_bin_size) binning
The plotting system generates line plots with confidence interval ribbons when bootstrap results are available, providing a clear and professional visualisation of uncertainty in both length and age compositions.
Key Parameter Updates:
- Consistent Interface: Both
plot.length_composition()andplot.age_composition()use identical parameter names and functionality by_stratum: Logical (TRUE/FALSE) for faceted by-stratum displaystratum: Character string to plot only a specific stratum (overridesby_stratum)show_CIs: Display bootstrap confidence interval ribbons (replaces oldinclude_uncertainty)type: "composition" (default) or "proportion" for y-axis scaling- Binning parameters:
length_bin_sizeandage_bin_sizefor data aggregation unsexed: Logical to include/exclude unsexed fish category- Removed deprecated parameters from age plotting:
strata,sex_categories,include_pooled,age_bins
- Lines: Clean representation of length composition trends
- Confidence interval ribbons: Shaded areas showing 95% bootstrap confidence intervals
- Faceted layout: Professional multi-panel display for comparing strata and sex categories
- Colour customisation: Flexible theming options for publication-quality figures
# Load test data for research surveys
test_data <- generate_test_data(data_type = "survey")
# Example length-weight parameters for different species/populations
lw_male <- c(a = 0.0067, b = 3.15) # Males
lw_female <- c(a = 0.0071, b = 3.12) # Females
lw_unsexed <- c(a = 0.0069, b = 3.14) # Unsexed
# Calculate scaled length compositions with bootstrap uncertainty
result <- calculate_length_compositions(
fish_data = test_data$fish_data,
strata_data = test_data$strata_data,
lw_params_male = lw_male,
lw_params_female = lw_female,
lw_params_unsexed = lw_unsexed,
length_range = c(25, 85),
bootstraps = 300
)
# Create comprehensive visualisations
library(ggplot2)
# 1. Pooled results with confidence interval ribbons
pooled_plot <- plot(result,
by_stratum = FALSE,
show_CIs = TRUE)
# 2. Faceted by-stratum plot showing all sexxstratum combinations
faceted_plot <- plot(result,
by_stratum = TRUE,
show_CIs = TRUE)
# 3. Proportional comparison for pattern analysis
proportion_plot <- plot(result,
by_stratum = TRUE,
type = "proportion")
# 4. 5cm length binning for simplified view
binned_plot <- plot(result,
length_bin_size = 5,
show_CIs = TRUE)
# Display plots
print(pooled_plot)
print(faceted_plot)
print(proportion_plot)
print(binned_plot)The plot_length_composition_comparison() function allows for side-by-side comparison of multiple length composition results in a multi-panel layout.
plot_length_composition_comparison(
x, # List of length_composition objects
by_stratum = FALSE, # Use pooled (FALSE) or by-stratum (TRUE) data
stratum = NULL, # Specific stratum name for by-stratum plots
type = "composition", # "composition" or "proportion"
length_bin_size = NULL, # Optional length binning (e.g., 2, 5, 10)
unsexed = FALSE, # Include unsexed category
show_CIs = TRUE # Show confidence intervals
)# Calculate length compositions for different scenarios
commercial_result <- calculate_length_compositions(...)
survey_result <- calculate_length_compositions(...)
# Create comparison plots
results_list <- list(
"Commercial" = commercial_result,
"Survey" = survey_result
)
# Basic comparison (pooled across strata)
comparison_plot <- plot_length_composition_comparison(results_list)
# Comparison including unsexed fish
comparison_with_unsexed <- plot_length_composition_comparison(
results_list,
unsexed = TRUE
)
# Comparison with 5cm length binning
binned_comparison <- plot_length_composition_comparison(
results_list,
length_bin_size = 5,
type = "proportion"
)
# By-stratum comparison for first available stratum
stratum_comparison <- plot_length_composition_comparison(
results_list,
by_stratum = TRUE
)
# Specific stratum comparison
specific_stratum <- plot_length_composition_comparison(
results_list,
by_stratum = TRUE,
stratum = "North"
)The comparison plots use a multi-panel layout:
- Rows: Each element from the input list (e.g., different scenarios/datasets)
- Columns: Sex categories (Male, Female, Total, and optionally Unsexed)
- Shared scales: Consistent x and y axes across all panels for easy comparison
- Named scenarios: Use named lists for better plot labels
- Length binning: Apply consistent binning across all scenarios
- Sex filtering: Include/exclude unsexed category across all scenarios
- Confidence intervals: Show uncertainty consistently across comparisons
- Flexible data: Mix pooled and by-stratum data as needed
# Advanced comparison example
advanced_comparison <- plot_length_composition_comparison(
list(
"Scenario A" = result_a,
"Scenario B" = result_b,
"Scenario C" = result_c
),
type = "proportion",
length_bin_size = 5,
unsexed = TRUE,
show_CIs = TRUE
)The plot_alk() function creates visualisations of age-length keys showing the probability distribution of ages for each length.
# Create an age-length key
alk <- create_alk(
age_data = generate_test_age_data(),
lengths = 15:35,
ages = 1:8
)
# Heatmap visualisation (default)
plot_alk(alk, type = "heatmap")
# Bubble plot showing sample sizes
plot_alk(alk, type = "points")
# Add length distribution rug on y-axis
plot_alk(alk, type = "heatmap", rug = TRUE)type = "heatmap"(default): Colour-coded tiles showing age probabilities at each length, with darker colours indicating higher probabilitiestype = "points": Bubble plot where point size represents sample size (n) or proportion
# Plot with sex-specific facets (default when sex data available)
plot_alk(alk, by_sex = TRUE)
# Combined plot without sex separation
plot_alk(alk, by_sex = FALSE)by_sex: Logical, whether to facet by sex category (default TRUE)type: "heatmap" or "points" for different visualisation stylesrug: Logical, add a rug plot on the y-axis showing length distribution (default FALSE)
The get_summary() function provides comprehensive summary statistics from length composition results, including mean weighted coefficient of variation (CV), number of fish sampled by sex category, and total number of hauls sampled.
get_summary(
x, # length_composition object
by_stratum = FALSE, # Use pooled (FALSE) or by-stratum (TRUE) data
type = "composition", # "composition" or "proportion"
sex = c("male", "female", "total"), # Sex categories to include
stratum = NULL, # Specific stratum for by-stratum analysis
length_range = NULL # Length range c(min, max) to include
)The function returns a list containing:
mean_weighted_cv: Mean weighted coefficient of variationn_fish: Number of fish sampled by sex category (for pooled data)total_fish: Total number of fish sampledn_hauls: Total number of hauls/samplesfish_haul_summary: Detailed breakdown by stratum (for by-stratum data)
# Calculate length compositions with bootstrap uncertainty
result <- calculate_length_compositions(
fish_data = test_data$fish_data,
strata_data = test_data$strata_data,
bootstraps = 300
)
# Basic summary statistics (pooled across strata)
summary_stats <- get_summary(result)
print(summary_stats$mean_weighted_cv) # Overall uncertainty
print(summary_stats$total_fish) # Total fish count
print(summary_stats$n_hauls) # Number of hauls
# Summary by sex category
summary_by_sex <- get_summary(result, sex = c("male", "female", "total"))
print(summary_by_sex$n_fish) # Fish counts by sex
# Summary for proportions instead of absolute compositions
summary_proportions <- get_summary(result, type = "proportion")
# Summary by stratum
summary_by_stratum <- get_summary(result, by_stratum = TRUE)
print(summary_by_stratum$fish_haul_summary) # Detailed stratum breakdown
# Summary for specific length range
summary_length_range <- get_summary(result, length_range = c(30, 50))
# Summary for specific stratum and sex
summary_specific <- get_summary(
result,
by_stratum = TRUE,
stratum = "North",
sex = "total"
)The weighted CV is calculated as:
where
Typical CV ranges:
- CV < 0.20: Low uncertainty, high precision
- CV 0.20-0.40: Moderate uncertainty
- CV > 0.40: High uncertainty, results should be interpreted with caution
# Compare uncertainty and sample sizes across different scenarios
commercial_summary <- get_summary(commercial_result)
survey_summary <- get_summary(survey_result)
print("Commercial fisheries:")
print(paste("CV:", round(commercial_summary$mean_weighted_cv, 3)))
print(paste("Fish:", commercial_summary$total_fish))
print(paste("Hauls:", commercial_summary$n_hauls))
print("Research survey:")
print(paste("CV:", round(survey_summary$mean_weighted_cv, 3)))
print(paste("Fish:", survey_summary$total_fish))
print(paste("Hauls:", survey_summary$n_hauls))
# Stratum-specific uncertainty and sample size assessment
stratum_summary <- get_summary(result, by_stratum = TRUE)
for (stratum in names(stratum_summary$fish_haul_summary)) {
cv_val <- stratum_summary$mean_weighted_cv[stratum, "total"]
fish_count <- stratum_summary$fish_haul_summary[[stratum]]$total_fish
haul_count <- stratum_summary$fish_haul_summary[[stratum]]$n_hauls
print(paste(stratum, "- CV:", round(cv_val, 3),
"Fish:", fish_count, "Hauls:", haul_count))
}
# Length-specific uncertainty patterns with sample information
early_summary <- get_summary(result, length_range = c(25, 40))
late_summary <- get_summary(result, length_range = c(40, 60))
print(paste("Early lengths - CV:", round(early_summary$mean_weighted_cv, 3),
"Fish:", early_summary$total_fish))
print(paste("Late lengths - CV:", round(late_summary$mean_weighted_cv, 3),
"Fish:", late_summary$total_fish))# Load the functions
source('length_composition_calculator.R')
# Define length-weight parameters (same as above)
lw_params_male <- c(a = 0.0085, b = 3.10)
lw_params_female <- c(a = 0.0092, b = 3.05)
lw_params_unsexed <- c(a = 0.0088, b = 3.08)
# Density-based data
survey_results <- calculate_length_compositions(
fish_data = your_survey_fish_data, # Contains sample_area_km2, catch_density_kg_km2
strata_data = your_survey_strata_data, # Contains stratum_area_km2
length_range = c(min_length, max_length),
lw_params_male = lw_params_male,
lw_params_female = lw_params_female,
lw_params_unsexed = lw_params_unsexed,
bootstraps = 300,
plus_group = FALSE,
minus_group = FALSE
)
# View results
print(survey_results)The plot_sample_size_distribution() function creates stacked histograms showing the distribution of sample sizes by strata, with samples categorised by adequacy for bootstrap analysis. This visualisation helps identify potential issues with bootstrap uncertainty estimation before running analyses.
# Generate test data
test_data <- generate_test_data()
# Evaluate sample sizes for bootstrap adequacy
evaluation <- evaluate_sample_sizes(test_data$fish_data)
# Create sample size distribution plot
plot_sample_size_distribution(evaluation)
# Customized plot with smaller bins and specific axis range
plot_sample_size_distribution(
evaluation,
bin_width = 3, # 3 fish per bin
max_fish = 100, # Show up to 100 fish
show_thresholds = TRUE # Show minimum size threshold lines
)The plot uses colour coding to show different sample categories:
- Non-representative Small (red): Small samples with poor representativeness (< 20% of total catch)
- Representative Small (amber): Small samples that are highly representative (>= 20% of total catch)
- Adequate (green): Samples meeting minimum fish count thresholds (>= 15 fish by default)
When sample weight data is not available, all small samples are grouped as "Small Samples" since representativeness cannot be assessed.
# Comprehensive sample size workflow
test_data <- generate_test_data()
# Step 1: Evaluate sample adequacy
evaluation <- evaluate_sample_sizes(test_data$fish_data)
# Step 2: Review recommendations
cat("Sample Size Recommendations:\n")
for (rec in evaluation$recommendations) {
cat("-", rec, "\n")
}
# Step 3: Visualise sample distribution
p <- plot_sample_size_distribution(evaluation)
# Step 4: Filter problematic samples if needed
if (nrow(evaluation$non_representative_small_samples) > 0) {
filtered_data <- filter_small_samples(evaluation)
cat("Filtered data contains", nrow(filtered_data), "samples\n")
}
# Step 5: Save visualisation
ggplot2::ggsave("sample_size_distribution.png", p, width = 10, height = 6)This workflow helps ensure robust bootstrap uncertainty estimation by:
- Identifying samples with inadequate fish counts
- Assessing representativeness of small samples when weight data is available
- Providing visual confirmation of sample size patterns across strata
- Supporting informed decisions about sample filtering
fish_data: Data frame with fish sampling data including columns: stratum, sample_id, length, male, female, unsexed, plus either sample_weight_kg + total_catch_weight_kg (weight-based) or sample_area_km2 + catch_density_kg_km2 (density-based)strata_data: Data frame with columns: stratum, plus either stratum_total_catch_kg (weight-based) or stratum_area_km2 (density-based)length_range: Numeric vector of min and max lengths to include (e.g., c(15, 35))lw_params_male: Named vector with length-weight parameters for males: c(a = 0.01, b = 3.0) (required)lw_params_female: Named vector with length-weight parameters for females: c(a = 0.01, b = 3.0) (required)lw_params_unsexed: Named vector with length-weight parameters for unsexed fish: c(a = 0.01, b = 3.0) (required)bootstraps: Integer, number of bootstrap iterations. Set to 0 for no bootstrapping (default: 300)plus_group: Logical, combine lengths >= max length into a plus group (default: FALSE)minus_group: Logical, combine lengths <= min length into a minus group (default: FALSE)return_full_bootstraps: Logical, whether to return all individual bootstrap results along with summaries (default: FALSE)verbose: Logical, whether to print progress messages (default: TRUE)
x: A length_composition object from calculate_length_compositions()age_length_key: Either a single data frame or a named list of sex-specific data frames containing age-length key dataage_range: Numeric vector of min and max ages to include (e.g., c(1, 10))plus_group_age: Logical, combine ages >= max age into a plus group (default: TRUE)minus_group_age: Logical, combine ages <= min age into a minus group (default: FALSE)verbose: Logical, whether to print progress messages (default: TRUE)
age_data: Data frame containing age-length data with required columns: 'age', 'length', and optionally 'sex'lengths: Vector of all length bins that will be encountered (required)ages: Vector of all age bins (required)tail_ages: Named list specifying length-age pairs for lengths at tails (optional)min_ages_per_length: Minimum number of ages required per length bin (default: 3)optimum_ages_per_length: Optimum number of ages per length bin (default: 10)length_bin_size: Numeric value for binning lengths (e.g., 2 for 2cm bins). If NULL (default), uses raw length valuesinterpolate: Logical, whether to allow linear interpolation for missing length bins (default: TRUE)extrapolate: Logical, whether to allow extrapolation for lengths outside the observed range (default: TRUE)verbose: Logical, whether to print progress messages
alk: Age-length key data frameby_sex: Logical, whether to plot by sex (default: TRUE)type: Character, type of plot: "heatmap" (heatmap visualisation) or "points" (point plot) (default: "heatmap")rug: Logical, whether to add a rug on the y-axis showing length distribution (default: FALSE)
data_type: Character, either "commercial" for weight-based data or "survey" for density-based data
x: A length_composition object from calculate_length_compositions() or an age_composition object from calculate_age_compositions()stratum: Character, name of stratum to analyse. If NULL (default), uses pooled data across all stratasex: Character, sex category to analyse: "male", "female", "unsexed", or "total" (default)all: Logical, whether to calculate for all combinations of strata and sex categories (default: FALSE)sex_categories: Character vector of sex categories to analyse when all = TRUE (default: c("male", "female", "unsexed", "total"))include_pooled: Logical, whether to include pooled results when all = TRUE (default: TRUE)remove_outliers: Numeric, proportion of outliers to remove (0-1, default: 0.05)min_proportion: Numeric, minimum proportion threshold to include in analysis (default: 0.0001)max_cv: Numeric, maximum CV threshold to include in analysis (default: 5.0)trace: Logical, whether to show fitting details (default: FALSE)quiet: Logical, whether to suppress individual fitting messages when all = TRUE (default: TRUE)
- No parameters: Returns default length-weight parameters for Ross Sea toothfish
fish_data: Data frame with fish observation data including columns: stratum, sample_id, length, male, female, unsexed, total
- Parameters follow the allometric relationship: Weight (g) = a x Length(cm)^b
- Both
aandbmust be positive values - Parameters must be provided as named vectors with elements 'a' and 'b'
- Different parameters for each sex category allow for sexual dimorphism in growth
- Example parameters for common species:
- Snapper: Male c(a = 0.0085, b = 3.10), Female c(a = 0.0092, b = 3.05)
- Haddock: Male c(a = 0.0067, b = 3.15), Female c(a = 0.0071, b = 3.12)
The function returns a scaled_length_composition object containing sex-based arrays:
length_composition: 3D array (length x sex x stratum) of scaled countsproportions: 3D array (length x sex x stratum) of proportionspooled_length_composition: matrix (length x sex) of total scaled countspooled_proportions: matrix (length x sex) of total proportionslf_cvs: 3D array of CVs for length compositions by stratumproportions_cvs: 3D array of CVs for proportions by stratumpooled_lf_cv: matrix of CVs for pooled length compositionspooled_proportions_cv: matrix of CVs for pooled proportionslf_bootstraps: 4D array of full bootstrap results (if requested)
We have length sampling data from a trawl survey with two depth strata (Shallow and Deep). We want to estimate the length distribution of the entire fish population.
# Load the functions
source('length_composition_calculator.R')
# Example fish sampling data
fish_data <- data.frame(
stratum = c(rep("Shallow", 8), rep("Deep", 6)),
sample_id = c("T001", "T001", "T001", "T002", "T002", "T003", "T003", "T003",
"T004", "T004", "T005", "T005", "T006", "T006"),
length = c(22, 24, 26, 23, 25, 21, 23, 25,
28, 30, 29, 31, 27, 29),
count = c(15, 25, 10, 20, 18, 12, 22, 8,
18, 12, 25, 8, 20, 15),
sample_weight_kg = c(rep(8.5, 3), rep(7.2, 2), rep(6.8, 3),
rep(9.1, 2), rep(8.8, 2), rep(7.5, 2)),
total_catch_weight_kg = c(rep(85, 3), rep(120, 2), rep(95, 3),
rep(110, 2), rep(140, 2), rep(90, 2))
)
# Stratum total catches (from fisheries statistics)
strata_data <- data.frame(
stratum = c("Shallow", "Deep"),
stratum_total_catch_kg = c(15000, 12000) # Total estimated catch by stratum
)
# View the input data
print("Fish sampling data:")
print(fish_data)
print("\nStrata totals:")
print(strata_data)# Calculate with bootstrap uncertainty
results <- calculate_length_compositions(
fish_data = fish_data,
strata_data = strata_data,
length_range = c(20, 32),
bootstraps = 100, # Use 100 for faster example (300+ recommended for real analysis)
plus_group = FALSE,
minus_group = FALSE
)
# Display results
print(results)Expected output:
Scaled Length Frequency Results
===============================
Length range: 20 to 32
Number of strata: 2
Bootstrap iterations: 100
Pooled Length Frequency:
Length Count Proportion CV_Count CV_Prop
1 20 0.00 0.00 0.0 0.0
2 21 1234.56 0.08 15.2 14.8
3 22 2345.67 0.15 12.3 11.9
4 23 3456.78 0.22 10.5 10.2
5 24 2890.12 0.18 11.8 11.4
...
For example, if we had:
- Sample T001 in Shallow stratum: 15 fish at 22cm
- Sample weight: 8.5 kg, represents 85 kg total catch
- Sample scaling factor: 85/8.5 = 10
- Shallow stratum total: 15,000 kg, samples represent 300 kg
- Stratum scaling factor: 15,000/300 = 50
- Final scaling: 15 fish x 10 x 50 = 7,500 fish at 22cm in shallow stratum
- CV < 10%: High precision, reliable estimates
- CV 10-20%: Moderate precision, adequate for most purposes
- CV 20-30%: Lower precision, interpret with caution
- CV > 30%: Low precision, may need more sampling
# Get pooled length distribution (sex-based)
pooled_results <- data.frame(
Length = results$lengths,
Male = round(results$pooled_length_composition[, "male"]),
Female = round(results$pooled_length_composition[, "female"]),
Unsexed = round(results$pooled_length_composition[, "unsexed"]),
Total = round(results$pooled_length_composition[, "total"]),
CV_Male = round(results$pooled_lf_cv[, "male"] * 100, 1),
CV_Female = round(results$pooled_lf_cv[, "female"] * 100, 1),
CV_Unsexed = round(results$pooled_lf_cv[, "unsexed"] * 100, 1),
CV_Total = round(results$pooled_lf_cv[, "total"] * 100, 1)
)
print("Population length distribution (sex-based):")
print(pooled_results)
# Get results by stratum (sex-based)
shallow_counts <- results$length_composition[, "total", which(results$strata_names == "Shallow")]
deep_counts <- results$length_composition[, "total", which(results$strata_names == "Deep")]
stratum_comparison <- data.frame(
Length = results$lengths,
Shallow = round(shallow_counts),
Deep = round(deep_counts),
Total = round(shallow_counts + deep_counts)
)
print("\nLength distribution by stratum (total across sexes):")
print(stratum_comparison)# Check CVs to assess if more sampling is needed
high_cv_lengths <- results$lengths[results$pooled_lf_cv > 0.3]
if (length(high_cv_lengths) > 0) {
cat("Lengths with high uncertainty (CV > 30%):", high_cv_lengths, "\n")
cat("Consider additional sampling for these size classes.\n")
} else {
cat("All length classes have adequate precision (CV <= 30%).\n")
}
# Summary statistics
cat("\nSampling summary:\n")
cat("Total estimated fish:", round(sum(results$pooled_length_composition)), "\n")
cat("Mean CV across lengths:", round(mean(results$pooled_lf_cv) * 100, 1), "%\n")
cat("Length classes with CV > 20%:", sum(results$pooled_lf_cv > 0.2), "of", length(results$lengths), "\n")# Access full bootstrap results
if (!is.na(results$lf_bootstraps[1])) {
# Calculate confidence intervals for each length class
bootstrap_data <- results$lf_bootstraps
# 95% confidence intervals for pooled results
pooled_bootstraps <- apply(bootstrap_data, c(1, 3), sum) # Sum across strata
ci_results <- data.frame(
Length = results$lengths,
Estimate = round(results$pooled_length_composition),
Lower_95 = round(apply(pooled_bootstraps, 1, quantile, 0.025)),
Upper_95 = round(apply(pooled_bootstraps, 1, quantile, 0.975))
)
print("95% Confidence intervals:")
print(ci_results)
}# Test sensitivity to number of bootstrap iterations
bootstrap_counts <- c(50, 100, 200, 300)
cv_stability <- data.frame()
for (n_boot in bootstrap_counts) {
temp_results <- calculate_length_compositions(
fish_data = fish_data,
strata_data = strata_data,
bootstraps = n_boot
)
cv_stability <- rbind(cv_stability, data.frame(
n_bootstrap = n_boot,
mean_cv = mean(temp_results$pooled_lf_cv, na.rm = TRUE)
))
}
print("Bootstrap stability analysis:")
print(cv_stability)Bootstrap uncertainty estimation requires adequate sample sizes at both the sample and stratum levels. Use evaluate_sample_sizes() to check whether your data meets the recommended thresholds and assess sample representativeness:
# Evaluate sample sizes with default thresholds
sample_eval <- evaluate_sample_sizes(fish_data)
# Check overall data adequacy
cat("Sample size evaluation:\n")
cat("Total samples:", nrow(sample_eval$sample_sizes), "\n")
cat("Samples with < 15 fish:", nrow(sample_eval$all_small_samples), "\n")
cat("Strata with < 8 samples:", nrow(sample_eval$small_strata), "\n")
# Review representativeness analysis (if weight data available)
if (nrow(sample_eval$representative_small_samples) > 0) {
cat("\nHighly representative small samples (>=20% of catch - RETAIN):\n")
print(sample_eval$representative_small_samples[c("sample_id", "total", "sample_proportion")])
}
if (nrow(sample_eval$non_representative_small_samples) > 0) {
cat("\nProblematic small samples (<20% of catch - consider exclusion):\n")
print(sample_eval$non_representative_small_samples[c("sample_id", "total", "sample_proportion")])
}
# Display problematic samples if any
if (nrow(sample_eval$all_small_samples) > 0) {
cat("\nSamples with insufficient fish (< 15):\n")
print(sample_eval$all_small_samples[order(sample_eval$all_small_samples$total), ])
}
# Display problematic strata if any
if (nrow(sample_eval$small_strata) > 0) {
cat("\nStrata with insufficient samples (< 8):\n")
print(sample_eval$small_strata)
}
# Review specific recommendations
cat("\nRecommendations for handling small samples:\n")
for(rec in sample_eval$recommendations) {
cat("-", rec, "\n")
}
# Check exclusion impact summary
cat("\nImpact of excluding small samples:\n")
print(sample_eval$exclusion_summary)
# Example: Smart filtering that uses specific sample list
filtered_data <- filter_small_samples(
fish_data,
sample_eval$non_representative_small_samples # Only exclude problematic samples
)
cat("\nSmart filtering impact (excludes only problematic samples):\n")
print(filtered_data$filter_summary)
# Alternative: Exclude all small samples if preferred
filtered_data_all <- filter_small_samples(
fish_data,
sample_eval$all_small_samples # Exclude all small samples
)
cat("\nFull filtering impact (excludes all small samples):\n")
print(filtered_data_all$filter_summary)
# Compare bootstrap results with and without small samples
cat("\nSensitivity analysis - comparing results:\n")
# Original data results
results_original <- calculate_length_compositions(
fish_data = fish_data,
strata_data = strata_data,
bootstraps = 100
)
# Filtered data results
results_filtered <- calculate_length_compositions(
fish_data = filtered_data$filtered_fish_data,
strata_data = strata_data,
bootstraps = 100
)
# Compare mean CVs
cv_original <- mean(results_original$pooled_lf_cv, na.rm = TRUE)
cv_filtered <- mean(results_filtered$pooled_lf_cv, na.rm = TRUE)
cat("Mean CV - Original data:", round(cv_original, 3), "\n")
cat("Mean CV - Filtered data:", round(cv_filtered, 3), "\n")
cat("CV improvement:", round((cv_original - cv_filtered) / cv_original * 100, 1), "%\n")
}
# Use stricter thresholds for critical analyses
strict_eval <- evaluate_sample_sizes(
fish_data,
min_fish_per_sample = 20, # Stricter fish threshold
min_samples_per_stratum = 10 # Stricter sample threshold
)
# Compare sample size distributions
hist(sample_eval$sample_sizes$total,
main = "Distribution of Fish per Sample",
xlab = "Number of Fish",
breaks = 20)
abline(v = 15, col = "red", lty = 2, lwd = 2) # Minimum threshold
text(15, par("usr")[4] * 0.9, "Min threshold", pos = 4, col = "red")Sample Representativeness Assessment:
When sample weight and total catch weight data are available, the function evaluates whether small samples represent a significant proportion of the total catch:
- Highly Representative (>=20% of total catch): Small samples that are retained despite low fish counts because they provide adequate coverage of the catch
- Poorly Representative (<20% of total catch): Small samples that may be excluded because they provide limited information relative to the total catch
- Smart Filtering: The
filter_small_samples()function can automatically preserve representative samples while excluding problematic ones
Key Insight: A sample with only 10 fish might be perfectly adequate if it represents 50% of the total catch, while a sample with 12 fish representing only 2% of the catch may be problematic.
Sample Size Guidelines for Bootstrap Analysis:
- Minimum fish per sample: 15 fish (default) provides stable within-sample resampling
- Minimum samples per stratum: 8 samples (default) provides stable between-sample resampling
alk_data: Data frame (or named list of sex-specific frames) with columns:age,length, and optionallysex(one row per aged fish)by_sex: Logical, whether to fit sex-specific smooths (default: TRUE)k: Basis dimension for length smooth terms; when -1 (default), chosen adaptively asmin(10, max(3, floor(n_unique_lengths / 3)))select: Whether to add an extra penalty allowing smooth terms to be penalized to zero (default: TRUE)gamma: Multiplier for effective degrees of freedom; values > 1 produce smoother models (default: 1.4)method: Smoothing parameter estimation method for mgcv (default: "REML")weights: Optional observation weights (default: NULL)verbose: Logical, print model details (default: TRUE)
Note: additional_terms can also be passed as a character vector of extra GAM formula terms (e.g., "te(lat, long)"). When by_sex = TRUE, terms without a by = specification are automatically fitted with by = sex interactions.
Returns a list with:
model: fitted mgcv::gam object (ocat family)predict_function(lengths, sex): function returning a matrix of per-age probabilitiessummary: key model metricsage_levels,sex_levels,by_sexadditional_terms: the terms passed to the model- Recommended fish per sample: 20+ fish for critical stock assessment applications
- Recommended samples per stratum: 10+ samples for robust uncertainty estimation
Consequences of Insufficient Sample Sizes:
- Small samples (< 15 fish): Unreliable within-sample bootstrap distributions
- Few strata samples (< 8): Unstable between-sample bootstrap distributions
- Both issues: Biased variance estimates and unreliable confidence intervals
Data Quality Recommendations:
- Run
evaluate_sample_sizes()before any bootstrap analysis - Review specific recommendations provided by the evaluation function
- Consider filtering only extremely small samples (< 5 fish) that represent minimal data loss
- Always perform sensitivity analysis comparing results with and without small samples
- Prefer combining adjacent strata over excluding samples when possible
- Consider combining adjacent strata if sample sizes are inadequate
- Use fewer bootstrap iterations (50-100) for exploratory analysis with marginal sample sizes
- Increase to 300+ bootstrap iterations only when sample sizes are adequate
Decision Framework for Small Sample Handling:
- Assess impact: Check what percentage of data would be lost
- Review coverage: Ensure filtering doesn't create spatial/temporal gaps
- Test sensitivity: Compare results with and without filtering
- Document decisions: Record rationale for inclusion/exclusion choices
- Consider alternatives: Stratum combination, weighted analysis, or conditional bootstrapping
### Multinomial Effective Sample Size Analysis
```r
# Calculate effective sample sizes for all combinations
all_n <- calculate_all_multinomial_n(results)
print(all_n)
# Calculate for specific stratum and sex
eff_n_male <- calculate_multinomial_n(results, stratum = "North", sex = "male")
cat("Effective sample size for North stratum males:", eff_n_male$effective_n, "\n")
# Calculate for pooled data (recommended for stock assessment)
eff_n_total <- calculate_multinomial_n(results, sex = "total")
cat("Pooled effective sample size:", eff_n_total$effective_n, "\n")
cat("Based on", eff_n_total$n_bins, "length bins\n")
cat("Model fit quality (residual SE):", round(eff_n_total$fit_summary$sigma, 4), "\n")
# Use stricter filtering for robust estimates
robust_n <- calculate_all_multinomial_n(
results,
min_proportion = 0.001, # Exclude very small proportions
max_cv = 3.0, # Exclude high-CV length bins
remove_outliers = 0.1 # Remove 10% of worst-fitting points
)
print("Robust effective sample sizes:")
print(robust_n)
# Create age-length key using von Bertalanffy
# Note, use this functions only for testing and simulation purposes. This should
# not be used to generate an age-length-key that represents biological reality.
age_key <- generate_age_length_key(
length_range = c(20, 40),
age_range = c(1, 8),
growth_type = "vonbert",
growth_params = list(linf = 50, k = 0.3, t0 = -0.5),
cv = 0.25
)
# Visualise the age-length key
plot_alk(age_key)
# Convert length compositions to age compositions using complete age-length key
age_results <- calculate_age_compositions(
x = results,
age_length_key = age_key,
age_range = c(1, 8)
)
# View age composition results
print(age_results)
# Plot age compositions with consistent interface (same as length compositions!)
plot(age_results, by_stratum = FALSE, show_CIs = TRUE) # Pooled across strata
plot(age_results, by_stratum = TRUE, show_CIs = TRUE) # Faceted by stratum
plot(age_results, stratum = "North", show_CIs = TRUE) # Specific stratum only
# Plot age proportions instead of absolute compositions
plot(age_results, type = "proportion", show_CIs = TRUE)
# Use age binning to aggregate into larger age groups
plot(age_results, age_bin_size = 2, show_CIs = TRUE) # 2-year age bins
# Include unsexed fish in age composition plots
plot(age_results, unsexed = TRUE, show_CIs = TRUE)
# Calculate effective sample sizes for age compositions
age_eff_n <- calculate_multinomial_n(age_results, sex = "total")
cat("Age composition effective sample size:", age_eff_n$effective_n, "\n")
# Compare length vs age effective sample sizes
length_eff_n <- calculate_multinomial_n(results, sex = "total")
cat("\nComparison of effective sample sizes:\n")
cat("Length compositions:", length_eff_n$effective_n, "based on", length_eff_n$n_bins, "length bins\n")
cat("Age compositions:", age_eff_n$effective_n, "based on", age_eff_n$n_bins, "age bins\n")
# Calculate effective sample sizes for all combinations (both length and age)
all_length_n <- calculate_all_multinomial_n(results)
all_age_n <- calculate_all_multinomial_n(age_results)
cat("\nEffective sample sizes by stratum and sex:\n")
print("Length compositions:")
print(all_length_n)
print("Age compositions:")
print(all_age_n)In real-world applications, age-length keys often don't cover the full length range of the composition data. The create_alk() function provides a streamlined solution:
# Create a limited age-length key (30-35 cm range)
limited_age_key <- generate_age_length_key(
length_range = c(30, 35), # Limited range
age_range = c(3, 6)
)
# Length compositions might span 20-40 cm, creating gaps
print("Length composition range vs age-length key coverage:")
cat("Length compositions: 20-40 cm\n")
cat("Age-length key: 30-35 cm\n")
cat("Gaps: 20-29 cm and 36-40 cm need interpolation\n")
# Step 1: Create complete age-length key from raw age data with interpolation
complete_age_key <- create_alk(
age_data = age_sample_data, # Raw age data (one row per aged fish)
lengths = 20:40, # Full length range in composition data
ages = 1:8, # Full age range desired
verbose = TRUE # Show interpolation details
)
# Step 2: Use complete key for age composition calculation
age_results_complete <- calculate_age_compositions(
x = results,
age_length_key = complete_age_key, # Complete key automatically detected
age_range = c(1, 8)
)
# Alternative: Specify tail ages for better biological realism
tail_ages <- list(
min_lengths = data.frame(
length = c(20, 22, 25), # Specific lengths below key range
age = c(1, 1, 2) # Corresponding ages
),
max_lengths = data.frame(
length = c(38, 40), # Specific lengths above key range
age = c(7, 8) # Corresponding ages
)
)
complete_age_key_tails <- create_alk(
age_data = age_sample_data,
lengths = 20:40,
ages = 1:8,
tail_ages = tail_ages,
verbose = TRUE
)
age_results_tails <- calculate_age_compositions(
x = results,
age_length_key = complete_age_key_tails,
age_range = c(1, 8)
)
# For comparison: Using incomplete key directly (will show warnings)
cat("\nTrying incomplete key (will show warnings):\n")
age_results_incomplete <- calculate_age_compositions(
x = results,
age_length_key = limited_age_key, # Incomplete key
age_range = c(1, 8)
)
cat("\nComparison of approaches:\n")
cat("Complete key (create_alk): 100% of fish allocated to ages\n")
cat("Incomplete key: Only fish with exact length matches allocated\n")
cat("Missing lengths generate explicit warnings\n")Weight-based scaling is appropriate when:
- Working with commercial fisheries data
- You have total catch weights at sample and stratum levels
- Scaling from samples to represent total commercial catch
- Working with observer data or port sampling
Density-based scaling is appropriate when:
- Working with research survey data
- You have swept area or sampling area information
- You know catch density (biomass per unit area)
- You want to estimate population abundance over specific areas
- Length-Weight Relationships: The examples uses placeholder weight calculations. In practice, incorporate species-specific length-weight relationships.
- Sample Size Considerations: Ensure adequate sample sizes within each stratum. Small samples can lead to unstable bootstrap estimates.
- Stratification: Choose strata that minimize within-stratum variability while maintaining adequate sample sizes.
- Bootstrap Iterations: Use 300+ iterations for final analyses. The examples use fewer for speed.
- Data Quality: Validate input data for consistency in units, completeness, and biological plausibility.
- Survey Design: For density-based scaling, ensure that sampling areas are representative of the strata and that density estimates are reliable.
Version: 2025-10 (automatically updated from git commit date) Author: Alistair Dunn Maintainer: Alistair Dunn alistair.dunn@OceanEnvironmental.co.nz License: GPL (>= 3) Encoding: UTF-8
The package uses a year-month (YYYY-MM) versioning scheme:
- Version number is automatically generated from the git commit date
- Format:
YYYY-MM(e.g.,2025-10for October 2025) - Package date is set to the exact commit date (
YYYY-MM-DD) - This ensures version numbers are meaningful and chronologically ordered
Coggins, L.G.; Gwinn, D.C.; Allen, M.S. (2013). Evaluation of Age-Length Key Sample Sizes Required to Estimate Fish Total Mortality and Growth. Transactions of the American Fisheries Society, 142(3), 832-840. https://doi.org/10.1080/00028487.2013.768550
To cite the scala package in publications, please use:
# Get citation information
citation("scala")Suggested citation:
Dunn, A. (2025). scala: Scaled catch at length and age composition analyses. R package version 2025-10. https://github.com/alistairdunn1/scala
This package is under active development. For bug reports, feature requests, or contributions:
- Check existing issues on the GitHub repository
- Submit bug reports with reproducible examples
- Suggest improvements or additional features
- Submit pull requests with proposed changes
The package includes development tools for maintainers:
# Generate documentation
roxygen2::roxygenise()
# Run tests
devtools::test()
# Check package
devtools::check()
# Build package
devtools::build()Use build.bat for automated building on Windows systems:
build.batThis script automatically:
- Updates version and date from git commit information
- Generates documentation with roxygen2
- Builds the package with updated version number
- Installs the package locally
- Runs R CMD check for validation
Manual Version Update: To update version without full build:
update-version.batView Current Version:
# In R console
source('show-version.R')The build process ensures the package version always reflects the actual development state from git.
Error: "Package not found"
- Ensure you have built and installed the package correctly
- Try
devtools::install(".")from the package directory
Error: "Length-weight parameters missing"
- All three sex-specific length-weight parameters are required
- Use
get_default_lw_params()for testing purposes
Error: "Column not found in data"
- Check that your data has the required column names
- For weight-based:
sample_weight_kg,total_catch_weight_kg,stratum_total_catch_kg - For density-based:
sample_area_km2,catch_density_kg_km2,stratum_area_km2
High CVs (> 30%)
- Increase sample sizes within strata
- Consider increasing bootstrap iterations (300+ recommended)
- Review stratification scheme
Plotting errors
- Install ggplot2:
install.packages("ggplot2") - Check that results object is valid
- Use
?function_namefor function documentation - Check function examples with
example(function_name) - Review the worked examples in this README
- Submit issues on the GitHub repository for bugs or questions