tsauditor

A data-quality auditing library for time-series tabular data, with a focus on financial and sensor domains. tsauditor scans a DataFrame and returns a structured report of structural problems, anomalies, and — its core contribution — data-leakage between features and the prediction target.

The project grew out of a real bug in a Pakistani equity (OGDC) direction-prediction model: a same-day percentage-change feature (ChangeP) was mathematically near-identical to the target it was meant to predict. With ChangeP included, a Random Forest classifier reached 99.68% accuracy (AUC 0.9987); a Gradient Boosting classifier reached the same 99.68% accuracy (AUC 0.9967). Removing it — along with same-day Open, High, and Low, which are equally unavailable at prediction time — dropped accuracy to 69.81% (RF, AUC 0.7795) and 73.70% (GBM, AUC 0.8072) on a held-out test period (2025-01-09 to 2026-04-03). Both models still beat a 50% baseline, but the headline accuracy had been almost entirely an artifact of the leak. tsauditor exists to catch this class of mistake automatically before it reaches a model. See examples/ogdc_leakage_case for the full experiment, script, and measured results.

Installation

pip install tsauditor

Requires Python ≥ 3.9. Core dependencies: pandas, numpy, scipy, statsmodels, rich.

Development setup

git clone https://github.com/imann128/tsauditor.git
cd tsauditor
pip install -e ".[dev]"

Note: Set domain="None" for domain agnostic usage. Similarly, it works well without defining a domain at all.

For usage snippets, scroll down in the readme or check out the examples directory for sample scripts

Quickstart

import tsauditor as tsa

report = tsa.scan(df, target="Direction", domain="finance")

report.summary()                 # rich-formatted CLI table
report.critical                  # list[Issue] that block modeling
report.filter(module="leakage")  # programmatic filtering
report.to_json("report.json")    # structured export

scan() returns a GuardReport holding Issue dataclasses bucketed by severity (critical, warnings, info) plus dataset metadata.

Output:

Sensor:

Real-World Sensor Validation Example

Below is an example using real weather station telemetry data. To showcase how tsauditor behaves during typical field failures, we manually inject three classic hardware faults: a frozen sensor reading, a complete network dropout gap, and a high-voltage electrical spike.

import pandas as pd
import tsauditor as tsa

print(" Fetching real-world weather station sensor dataset...")
url = "[https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv)"

try:
    df = pd.read_csv(url, parse_dates=["Date"], index_col="Date")
    df.columns = ["air_temperature"]
    print(" Dataset successfully into memory")
except Exception as e:
    print(f" Error loading dataset: {e}")

print(" Injecting typical hardware field failures for evaluation...")
# 1. Stuck sensor condition: flatlined at 12.2°C for 15 days straight
df.iloc[100:115] = 12.2

# 2. Transmission blackout: 10 days of completely missing telemetry
df.iloc[300:310] = None

# 3. Electrical surge: an impossible 75°C transient spike
df.iloc[500] = 75.0

print("\n Running `tsauditor` validation sweep")

# Execute the audit using the optimized sensor preset
report = tsa.scan(df, domain="sensor")
report.summary()

Output:

What it checks

Module	Code	Severity	Detects
profiler	PRF001	warning	Irregular timestamp frequency
profiler	PRF002	warning	Clustered missing values
profiler	PRF003	info	Non-stationarity (Augmented Dickey-Fuller)
profiler	PRF004	warning	Duplicate timestamps
profiler	PRF005	warning	Clustered gaps
profiler	PRF006	warning	High overall missing rate
anomaly	ANO001	warning	Stuck / repeated constant values
anomaly	ANO002	warning	Point outliers (z-score + IQR)
anomaly	ANO003	warning	Contextual spikes (local rolling z-score)
leakage	LEK001	critical	Target equivalence (feature reproduces the target)
leakage	LEK002	warning	Positive-lag cross-correlation peak (future info)
leakage	LEK003	warning	Rolling-window lookahead (excess over persistence)

Leakage detection (the research core)

Leakage checks are rank-based, chosen by target type:

LEK001 — equivalence. Continuous targets use |Spearman ρ|; binary targets use AUC separation (max(AUC, 1−AUC)). This is deliberate: Pearson against a binary 0/1 target is point-biserial correlation, which is capped near √(2/π) ≈ 0.798, so a feature whose sign defines the target scores only ~0.80 and slips under a naive threshold. AUC scores it 1.0.
LEK002 — cross-correlation. Flags features whose peak association with the target falls at a positive lag (the feature aligns with the target's future).
LEK003 — temporal lookahead. Flags features that correlate with the future target beyond what the target's own autocorrelation can explain — the signature of a forward-looking or centered window. The persistence baseline is what keeps a legitimate trailing feature from being false-flagged.

LEK002/LEK003 are WARNING-level suspicions: in pure cross-correlation a genuine strong predictor and a leak are distinguishable only by magnitude. LEK001 is CRITICAL because equivalence is near-deterministic.

Architecture

tsauditor/
├── scanner.py          # scan() — orchestrates all modules into a GuardReport
├── profiler/           # structural checks: frequency, missing, stationarity
├── anomaly/            # point.py, contextual.py
├── leakage/            # equivalence.py, correlation.py, temporal.py
├── report/summary.py   # GuardReport + Issue dataclasses, rich/JSON output
└── utils/validation.py # input validation & DataFrame normalization

Testing

pytest -q

Contributing

Contributions are welcome. Check open issues for ideas, or look for the good first issue label. Run pytest -q before opening a PR — all 93 tests must pass, and CI will verify this across Python 3.9–3.14 on Linux, Windows, and macOS.

Status

Beta (0.1.2). Profiler, anomaly, and leakage modules are implemented and tested (93 tests passing, CI across Python 3.9–3.14 on Linux, Windows, macOS).

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github		.github
examples		examples
images		images
tests		tests
tsauditor		tsauditor
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tsauditor

Installation

Development setup

Note: Set domain="None" for domain agnostic usage. Similarly, it works well without defining a domain at all.

Quickstart

Output:

Sensor:

Real-World Sensor Validation Example

Output:

What it checks

Leakage detection (the research core)

Architecture

Testing

Contributing

Status

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tsauditor

Installation

Development setup

Note: Set domain="None" for domain agnostic usage. Similarly, it works well without defining a domain at all.

Quickstart

Output:

Sensor:

Real-World Sensor Validation Example

Output:

What it checks

Leakage detection (the research core)

Architecture

Testing

Contributing

Status

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages