V4FinBench

V4FinBench is a benchmark for corporate financial distress prediction in Visegrad Group economies. It accompanies the paper "V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction".

The repository is intended to reproduce the public benchmark pipeline and reference evaluations from the released Kaggle data. It does not reproduce the original EMIS extraction process because the raw EMIS files are not redistributed.

License

The code in this repository is released under the MIT License. The released V4FinBench dataset hosted on Kaggle is licensed under CC BY 4.0.

Public Data

The canonical public dataset is hosted on Kaggle:

https://www.kaggle.com/datasets/sebastiantomczak10/v4-group-corporate-bankruptcy/data

Dataset license: Creative Commons Attribution 4.0 International (CC BY 4.0).

Expected files:

File	Meaning
`company_years.parquet`	Unlabeled company-year records
`company_years_h1.parquet`	Horizon file 1, paper horizon `h=0`, current-year distress
`company_years_h2.parquet`	Horizon file 2, paper horizon `h=1`, one-year-ahead distress
`company_years_h3.parquet`	Horizon file 3, paper horizon `h=2`, two-year-ahead distress
`company_years_h4.parquet`	Horizon file 4, paper horizon `h=3`, three-year-ahead distress
`company_years_h5.parquet`	Horizon file 5, paper horizon `h=4`, four-year-ahead distress
`company_years_h6.parquet`	Horizon file 6, paper horizon `h=5`, five-year-ahead distress

You can either download the files manually from Kaggle and place them under data/raw/, or use the download script once it is added:

uv run python scripts/download_kaggle.py --out data/raw

Manual layout:

data/raw/
├── company_years.parquet
├── company_years_h1.parquet
├── company_years_h2.parquet
├── company_years_h3.parquet
├── company_years_h4.parquet
├── company_years_h5.parquet
└── company_years_h6.parquet

What Is Reproducible

The repository should make the following reproducible from public artifacts:

loading the Kaggle parquet files,
verifying or regenerating horizon labels from company_years.parquet,
generating deterministic train/validation/test folds,
running the standard tabular baselines,
running TabPFN fine-tuning and evaluation,
running the separate Llama-3-8B QLoRA experiment,
aggregating results into paper-style tables.

The original EMIS reconstruction is documented as data provenance only. The raw EMIS source files are not shared, so scripts for raw Excel ingestion should not be part of the required reproduction path.

Evaluation Protocol

The benchmark uses five grouped folds. All observations from a company must stay in the same fold, and fold construction must preserve the country structure.

Fold generation must use the fixed seed used for the paper:

n_splits = 5
random_state = 42
group_col = company
country_col = country

For each run, one fold is used for validation, the next fold is used for testing, and the remaining three folds are used for training:

val_fold = fold
test_fold = (fold + 1) % 5
train_folds = all other folds

The generated fold files should therefore be identical for all users given the same Kaggle parquet files and seed.

Models Included

The standard tabular approaches should come from the economic-data implementation path, not from the financial-distress-foundational-models XGBoost-only code. The benchmark should include:

logistic regression,
multilayer perceptron,
random forest,
XGBoost,
LightGBM,
CatBoost.

TabPFN fine-tuning should come from the foundation-model code, but with hard-coded cluster paths removed and the data/fold paths made configurable.

The old Llama evaluation code from economic-data is not used. Llama-3-8B QLoRA fine-tuning is implemented as a separate experiment under src/v4finbench/llama, configs/llama, and scripts/llama_*.

Repository Structure

V4FinBench/
├── README.md
├── pyproject.toml
├── uv.lock
├── configs/
│   ├── data.yaml
│   ├── baselines/
│   ├── llama/
│   └── tabpfn/
├── data/
│   ├── raw/
│   ├── processed/
│   └── folds/
├── docs/
│   ├── data_provenance.md
│   ├── llama_experiment.md
│   ├── benchmark_protocol.md
│   └── reproduction.md
├── scripts/
│   ├── aggregate_finetune_best.py
│   ├── aggregate_results.py
│   ├── download_kaggle.py
│   ├── build_labels.py
│   ├── build_folds.py
│   ├── finetune_tabpfn.py
│   ├── llama_eval.py
│   ├── llama_prepare_data.py
│   ├── llama_threshold.py
│   ├── llama_train_qlora.py
│   ├── reproduce_*.sh
│   ├── run_baselines.py
│   ├── run_tabpfn.py
│   ├── summarize_data.py
│   └── verify_labels.py
├── src/
│   └── v4finbench/
│       ├── data/
│       │   ├── io.py
│       │   ├── labels.py
│       │   ├── folds.py
│       │   └── preprocessing.py
│       ├── evaluation/
│       │   ├── metrics.py
│       │   ├── thresholds.py
│       │   └── protocol.py
│       ├── llama/
│       │   ├── formatting.py
│       │   ├── inference.py
│       │   ├── metrics.py
│       │   └── sampling.py
│       ├── models/
│       │   ├── baselines.py
│       │   ├── tabpfn.py
│       │   └── tabpfn_finetune.py
│       └── sampling/
│           ├── strategies.py
│           └── prototypes.py
├── tests/
├── slurm/
└── results/

Development

This project should use uv.

uv sync --extra dev
uv run --extra dev pytest

Suggested reproduction commands once the scripts are in place:

uv run python scripts/build_labels.py --input data/raw/company_years.parquet --out data/processed
uv run python scripts/verify_labels.py --generated data/processed --reference data/raw
uv run python scripts/summarize_data.py --data-dir data/raw --out results/generated/data_summary.csv
uv run python scripts/build_folds.py --data-dir data/raw --out data/folds --seed 42
uv run python scripts/run_baselines.py --data-dir data/raw --folds-dir data/folds
uv run python scripts/run_tabpfn.py --data-dir data/raw --folds-dir data/folds
uv run python scripts/aggregate_results.py --input results/generated/baselines/metrics.csv --out results/generated/baselines/summary.csv

For a quick baseline smoke run before launching the full grid:

uv run python scripts/run_baselines.py \
  --data-dir data/raw \
  --folds-dir data/folds \
  --horizon 0 \
  --fold 0 \
  --model logistic_regression \
  --max-candidates 1 \
  --no-save-model

Aggregate baseline metrics after running multiple folds:

uv run python scripts/aggregate_results.py \
  --input results/generated/baselines/metrics.csv \
  --out results/generated/baselines/summary.csv

Run a local vanilla TabPFN smoke test after installing the optional TabPFN dependencies. Keep the context small for local checks.

uv sync --extra tabpfn
uv run --extra tabpfn python scripts/run_tabpfn.py \
  --config configs/tabpfn/local_smoke.yaml \
  --data-dir data/raw \
  --folds-dir data/folds \
  --horizon 0 \
  --fold 0

To evaluate a specific TabPFN checkpoint or weights file, add:

--model-path /path/to/tabpfn_checkpoint.ckpt

Fine-tune TabPFN for one horizon/fold:

uv run --extra tabpfn python scripts/finetune_tabpfn.py \
  --config configs/tabpfn/finetune_prototype_undersample.yaml \
  --data-dir data/raw \
  --folds-dir data/folds \
  --horizon 0 \
  --fold 0 \
  --model-path /path/to/tabpfn_checkpoint.ckpt \
  --device cuda

Aggregate fine-tuning best epochs:

uv run python scripts/aggregate_finetune_best.py \
  --root results/generated/tabpfn_finetune \
  --out results/generated/tabpfn_finetune/best_epochs.csv \
  --summary results/generated/tabpfn_finetune/summary.csv

Prepare the separate Llama QLoRA experiment data:

uv run python scripts/llama_prepare_data.py \
  --config configs/llama/qlora_llama3_8b.yaml \
  --data-dir data/raw \
  --out data/llama \
  --horizon 0

The Llama system prompt is configurable in configs/llama/qlora_llama3_8b.yaml or via --system-prompt / --system-prompt-file. The default asks whether a company will go bankrupt within {horizon_years} year(s), not whether it will file a legal bankruptcy case.

Train and evaluate a Llama adapter. This path requires GPU infrastructure and the optional Llama dependencies. Train one separate adapter per horizon dataset; do not train one shared adapter across all six horizons.

uv sync --extra llama
uv run --extra llama python scripts/llama_train_qlora.py \
  --config configs/llama/qlora_llama3_8b.yaml \
  --train-file data/llama/llama_h0_train.csv \
  --output-dir results/generated/llama/h0_adapter

uv run --extra llama python scripts/llama_eval.py \
  --model-name meta-llama/Meta-Llama-3-8B \
  --adapter-path results/generated/llama/h0_adapter \
  --test-file data/llama/llama_h0_test.csv \
  --out results/generated/llama/h0_predictions.csv \
  --compute-yes-no-probs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

V4FinBench

License

Public Data

What Is Reproducible

Evaluation Protocol

Models Included

Repository Structure

Development

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
data		data
docs		docs
results		results
scripts		scripts
slurm		slurm
src/v4finbench		src/v4finbench
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DATA_LICENSE.md		DATA_LICENSE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

V4FinBench

License

Public Data

What Is Reproducible

Evaluation Protocol

Models Included

Repository Structure

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages