Skip to content
This repository was archived by the owner on Mar 15, 2026. It is now read-only.

burning-cost/insurance-copula

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

insurance-copula

Copula models for insurance pricing — D-vine temporal dependence, two-part occurrence/severity.

Merged from: insurance-vine-longitudinal (D-vine copula for panel data).

The problem

A policyholder who claimed last year is more likely to claim again next year. This is not just adverse selection — it is genuine claim persistence. Standard GLM pricing captures risk factors (age, vehicle type, region) but ignores temporal dependence in residuals. NCD scales encode a binary rule: claimed or didn't. Neither approach gives you a principled conditional distribution.

This library implements the Yang & Czado (2022) two-part D-vine copula for longitudinal insurance claims. You observe a policyholder over T years. The model learns the full joint distribution of claim occurrence and severity across those years, then conditions on observed history to give the next-year claim distribution.

What it does

  1. Fits a logistic GLM for claim occurrence and a gamma/log-normal GLM for severity. These strip out systematic risk factors.
  2. Applies the probability integral transform (PIT) to the residuals — what the GLM cannot explain.
  3. Fits a stationary D-vine copula on the occurrence PIT residuals. The vine structure is temporal: tree level k captures lag-k dependence.
  4. Does the same for severity PIT residuals.
  5. Uses h-function recursion to compute the conditional distribution of next year's claim given observed history.
  6. Returns: conditional claim probability, conditional severity quantiles, experience-rated premium, relativity factors.

What it is not

This is not a neural/sequence model. It does not replace your GLM. It operates on the GLM residuals and quantifies how much temporal persistence remains after controlling for risk factors. The statistical structure is transparent and auditable — relevant for Consumer Duty documentation.

Installation

pip install insurance-copula

Quick start

import pandas as pd
from insurance_copula.vine import PanelDataset, TwoPartDVine

# Your panel data: one row per (policyholder, year)
df = pd.read_parquet("motor_panel.parquet")

# Build the panel object (validates, handles unbalanced panels)
panel = PanelDataset.from_dataframe(
    df,
    id_col="policy_id",
    year_col="year",
    claim_col="has_claim",
    severity_col="claim_amount",
    covariate_cols=["age", "vehicle_group", "region"],
)

# Fit the two-part D-vine
model = TwoPartDVine(severity_family="gamma", max_truncation=2)
model.fit(panel)

print(model)
# TwoPartDVine(fitted, t_dim=4, occurrence_p=1, severity_p=2)

# Predict next-year claim probability given history
proba = model.predict_proba(history_df)
# policy_id
# POL00001    0.142
# POL00002    0.089
# POL00003    0.247
# Name: claim_proba, dtype: float64

# Conditional severity quantiles
quantiles = model.predict_severity_quantile(history_df, quantiles=[0.5, 0.95])

# Experience-rated premium
premium = model.predict_premium(history_df, loading=0.15)

# Experience relativity = copula premium / a priori GLM premium
relativity = model.experience_relativity(history_df)

Top-level imports also work:

from insurance_copula import PanelDataset, TwoPartDVine, extract_relativity_curve

Relativity table

The output pricing teams actually use: how does claim history shift the predicted premium relative to the a priori estimate?

from insurance_copula import extract_relativity_curve, compare_to_ncd

curve = extract_relativity_curve(
    model,
    claim_counts=[0, 1, 2, 3],
    n_years_list=[1, 2, 3, 4, 5],
)
print(curve.pivot(index="claim_count", columns="n_years", values="relativity").round(3))

#              1yr   2yr   3yr   4yr   5yr
# 0 claims    1.00  1.00  1.00  1.00  1.00
# 1 claim     1.35  1.28  1.22  1.18  1.14
# 2 claims    NaN   1.71  1.58  1.48  1.40
# 3 claims    NaN   NaN   2.01  1.87  1.74

# Compare against NCD scale
comparison = compare_to_ncd(curve)
print(comparison[comparison["claim_count"] == 0].to_string())

Truncation and Markov order

The D-vine is truncated at order p, selected by BIC. At p=1, the model is a first-order Markov chain: only the most recent year matters after conditioning on covariates. At p=2, the last two years matter. For UK motor data, p=1 or p=2 is typical.

print(model.occurrence_vine.truncation_level)   # e.g., 1
print(model.occurrence_vine.fit_result_.bic_by_level)
# {1: 4821.3, 2: 4832.1}  → p=1 selected

FCA Consumer Duty context

Post PS21-5 (2022), renewal pricing must be fair. A D-vine model gives an auditable conditional distribution, separating genuine claim persistence (legitimate risk signal) from premium optimisation targeting (what the FCA is policing). The relativity table above is directly documentable.

Performance

Benchmarked against NCD flat adjustment (Poisson GLM + fixed step function: 0 claims = 0.55×, 1 claim = 0.75×, 2+ claims = 1.30×) on a synthetic panel of 5,000 policyholders over 3 years with a known latent frailty DGP. Oracle predictions (exact Gamma-Poisson posterior) serve as an upper bound. Full notebook: notebooks/benchmark.py.

Metric NCD Baseline D-vine Copula Oracle
Out-of-sample log-likelihood lower higher highest
Brier score higher lower lowest
MAE (predicted probability vs outcome) higher lower lowest
Recency sensitivity (year-1 vs year-2 claim) none captures it captures it

The benchmark tests the core weakness of NCD: a policyholder who claimed in year 1 only receives the same multiplier as one who claimed in year 2 only, even though the DGP makes recency matter. The D-vine conditions on the full sequence and assigns higher probability to a recent claim. The notebook also reports calibration (A/E by NCD band) and the fraction of oracle improvement that the vine captures over NCD.

When to use: You have a panel of 3+ years of policyholder history and want experience-rated renewal pricing that goes beyond NCD steps — particularly where claim recency, not just count, matters.

When NOT to use: You have only one year of history per policyholder, or your book turns over too rapidly to build meaningful multi-year panels. The vine needs at least 2 prior years to condition on; with one year it reduces to a standard credibility adjustment.

References

Yang, L. & Czado, C. (2022). Two-part D-vine copula models for longitudinal insurance claim data. Scandinavian Journal of Statistics, 49(4), 1534–1561.

Shi, P. & Zhao, Z. (2024). Enhanced pricing and management of bundled insurance risks with dependence-aware prediction using pair copula construction. Journal of Econometrics, 240(1), 105676.

Licence

MIT

About

Copula models for insurance pricing — D-vine temporal dependence, two-part occurrence/severity

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages