GitHub - Humanity-Unleashed/benchmarking: Benchmarking

A tool to evaluate LLMs with time-series forecasting and policy NLP tasks, alongside other forecasting models.

Humun Org

Description

Instruct prompt method inspired by CiK forecasting [ paper | github ].

Installation

Note

Requires uv. See instructions here if not available on your system.

make install

Data

The dataset being used is economic timeseries data scraped from FRED by the Data Collection team. This data has been mounted to the server currently at -

/workspaces/datasets/fred/fred.parquet

when read via humun_benchmark.data.load_from_parquet(), it assumes the format;

    { "id1": { "history": pd.DataFrame, "forecast": pd.DataFrame, "title": str, "notes" : str },
      "id2": ... }

Note

This data is truncated and split in the function - see function definition for details.

Environment Variables

Note

Alternative values can be provided at runtime via the benchmark.py or benchmark() parameters.

Contained in .env and loaded by pydotenv.

Variable Name	Description	Default Value
DATASETS_PATH	Path to FRED time series data parquet file	`/workspace/datasets/fred/fred.parquet`
RESULTS_STORE	Directory for storing benchmark results	`/workspace/pretraining/benchmarks`
HF_HOME	Directory for shared HuggingFace model cache	`/workspace/huggingface_cache`
HF_TOKEN_PATH	Path for HuggingFace authentication token	`~/.cache/huggingface/token`
HF_STORED_TOKENS_PATH	Path for additional HuggingFace tokens	Auto-set based on HF_TOKEN_PATH see here

Benchmarking Instructions

To run a benchmark, you can simply run the benchmark.py file, where a call is made to the function contained in the same file, using a set of config parameters which you can edit (arg parse will be re-added soon for easier config);

Required:
* models: A dictionary containing models to benchmark.
    e.g. models = {
        "llm": [
            "Qwen/Qwen2.5-7B-Instruct",
            "meta-llama/Llama-3.1-8B-Instruct",
            "Ministral-8B-Instruct-2410",
        ],
        "statistical": ['arima'],
    }

Optional:
* output_path: Where to store results
* datasets_path: Path to time series data
* series_ids: List of series IDs from FRED data
* n_datasets: Number of datasets to retrieve (used with filters)
* batch_size: Number of runs per inference
* train_ratio: Multiplier for training period  
* forecast_steps: Number of forecast steps
* context: Bool for whether to include context or not for LLMs
* available_gpu_ids: List of available GPU IDs to use. Tries to use all when not provided.
* level: logging level (default is logging.INFO)

Generating forecasts -

> make install
> source .venv/bin/activate
> python humun_benchmark/benchmark.py

Results store. Uses .env + datetime string by default.

/workspace/pretraining/benchmarks/YYYYMMDD_HHMMSS/
  ├ benchmark.log
  ├ Qwen…parquet
  ├ meta-llama…parquet
  └ …

Calculating metrics -

from humun_benchmark.data.metrics import read_results, compute_all_metrics

paths = glob.glob(f"/workspace/pretraining/benchmarks/<folder_name>/*.parquet")
results = read_results(paths)
metrics = compute_all_metrics(results)
metrics['overall_metrics'] # pd.DataFrame of cross-dataset results for all models selected

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
humun_benchmark		humun_benchmark
notebooks		notebooks
tests		tests
.env		.env
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Installation

Data

Environment Variables

Benchmarking Instructions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Description

Installation

Data

Environment Variables

Benchmarking Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages