Skip to content

urme-b/CalmSense

Repository files navigation

CalmSense

The Accuracy You Read Is Not the Accuracy You Get: Leakage, Motion, Shift, and Calibration

ML: Logistic Regression, Random Forest, XGBoost, LightGBM

DL: 1D-CNN, SHAP

Live demo · Colab · Paper

CalmSense dashboard

What this is

  • Detects stress vs baseline from wearable signals: ECG, EDA (skin conductance), temperature, respiration, motion.
  • Scored Leave-One-Subject-Out (LOSO): train on 14 people, test on the 15th, rotate.
  • Shows where the usual high numbers come from: subject leakage, motion, dataset shift, calibration.
  • Runs in the browser (ONNX, no backend). make demo runs the full pipeline offline on synthetic signals.

Results

Binary (baseline vs stress), 15 subjects, LOSO, mean over held-out subjects.

ModelAccuracyF1 (macro)
Random Forest0.9130.898
XGBoost0.9030.873
Logistic Regression0.9020.883
LightGBM0.8940.860
1D-CNN (raw signal)0.7180.648
  • The 4 feature models are a statistical tie (Friedman p = 0.81). RF 95% CI: [0.860, 0.960].

Key findings, one per check:

CheckQuestionResult
Subject leakageDoes same-person testing inflate scores?3-class 0.66 to 0.79 (+13 pts)
Motion confoundIs it just movement?Drop all motion: 0.913 to 0.901
Wrist vs chestIs a cheap sensor enough?0.893 vs 0.913 (2 pts lower)
Dataset shiftDoes it transfer to another dataset?Near chance (0.50 balanced)
CalibrationAre the probabilities trustworthy?ECE 0.070; isotonic map to 0.025
PersonalizationDoes a short enrollment help?20 windows: ECE 0.146 to 0.069

Models

ModelTypeKey settings
Logistic RegressionLinearC=1.0, L2, class-balanced
Random ForestBagged trees200 trees, depth 10, class-balanced
XGBoostBoosted trees200 trees, depth 7, lr 0.1
LightGBMBoosted trees200 trees, 50 leaves, lr 0.1
1D-CNNDeep net on raw signalResidual blocks, AdamW, early stopping
  • Every model runs inside an impute (median) to scale to classifier pipeline, fit per fold, seeded.

Features (58)

GroupCountExamples
HRV time domain12MeanNN, SDNN, RMSSD, pNN50
HRV frequency8LF/HF power, LF/HF ratio
HRV nonlinear10SampEn, DFA, SD1/SD2, CSI
EDA (skin conductance)15SCL level, SCR count, SCR amplitude
Temperature + respiration8temp slope, respiration rate
Accelerometer (motion)5magnitude mean, std, energy

Graphs & charts

Model comparison
Model comparison (LOSO)
Optimism gap
Optimism gap (leakage)
Ablation
Feature ablation
Wrist vs chest
Wrist vs chest
Cross-dataset
Cross-dataset transfer
Reliability
Calibration reliability
Personalization
Few-shot personalization
SHAP
Top features (SHAP)
Confusion
Confusion matrix

Tech stack

AreaTools
Modellingscikit-learn, XGBoost, LightGBM, PyTorch
Signal processingNeuroKit2, SciPy
ExplainabilitySHAP
DashboardReact, TypeScript, ONNX Runtime Web
ToolingGitHub Actions, ruff, mypy, pytest

Limitations

  • 15 subjects, lab-induced stress. Underpowered, wide CIs. No clinical claim.
  • Ablation, calibration, and personalization are exploratory, not multiplicity-corrected.
  • The 1D-CNN is a small baseline, not a fair test of deep learning.
  • Cross-dataset uses one confounded pair. Illustrative, not conclusive.

Future work

  • A third corpus (SWELL / AffectiveROAD) for leave-one-dataset-out generalization.
  • Real-world, non-lab stress data beyond the 15-subject benchmark.
  • Real-time streaming inference from a live wearable.

Ethics & data use

  • Physiological signals are sensitive personal data.
  • This is a research benchmark, not a product.
  • Data minimization: collect and keep only what an analysis needs.
  • No surveillance: do not monitor or penalize people without informed consent.
  • Datasets keep their own licenses and are not redistributed here.

License

MIT License