Skip to content

Humanity-Unleashed/benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A tool to evaluate LLMs with time-series forecasting and policy NLP tasks, alongside other forecasting models.

Humun Org

Notion Page

Description

Instruct prompt method inspired by CiK forecasting [ paper | github ].

Installation

Note

Requires uv. See instructions here if not available on your system.

make install

Data

The dataset being used is economic timeseries data scraped from FRED by the Data Collection team. This data has been mounted to the server currently at -

  • /workspaces/datasets/fred/fred.parquet

when read via humun_benchmark.data.load_from_parquet(), it assumes the format;

    { "id1": { "history": pd.DataFrame, "forecast": pd.DataFrame, "title": str, "notes" : str },
      "id2": ... }

Note

This data is truncated and split in the function - see function definition for details.

Environment Variables

Note

Alternative values can be provided at runtime via the benchmark.py or benchmark() parameters.

Contained in .env and loaded by pydotenv.

Variable Name Description Default Value
DATASETS_PATH Path to FRED time series data parquet file /workspace/datasets/fred/fred.parquet
RESULTS_STORE Directory for storing benchmark results /workspace/pretraining/benchmarks
HF_HOME Directory for shared HuggingFace model cache /workspace/huggingface_cache
HF_TOKEN_PATH Path for HuggingFace authentication token ~/.cache/huggingface/token
HF_STORED_TOKENS_PATH Path for additional HuggingFace tokens Auto-set based on HF_TOKEN_PATH see here

Benchmarking Instructions

To run a benchmark, you can simply run the benchmark.py file, where a call is made to the function contained in the same file, using a set of config parameters which you can edit (arg parse will be re-added soon for easier config);

Required:
* models: A dictionary containing models to benchmark.
    e.g. models = {
        "llm": [
            "Qwen/Qwen2.5-7B-Instruct",
            "meta-llama/Llama-3.1-8B-Instruct",
            "Ministral-8B-Instruct-2410",
        ],
        "statistical": ['arima'],
    }

Optional:
* output_path: Where to store results
* datasets_path: Path to time series data
* series_ids: List of series IDs from FRED data
* n_datasets: Number of datasets to retrieve (used with filters)
* batch_size: Number of runs per inference
* train_ratio: Multiplier for training period  
* forecast_steps: Number of forecast steps
* context: Bool for whether to include context or not for LLMs
* available_gpu_ids: List of available GPU IDs to use. Tries to use all when not provided.
* level: logging level (default is logging.INFO)

Generating forecasts -

> make install
> source .venv/bin/activate
> python humun_benchmark/benchmark.py 

Results store. Uses .env + datetime string by default.

/workspace/pretraining/benchmarks/YYYYMMDD_HHMMSS/
  ├ benchmark.log
  ├ Qwen…parquet
  ├ meta-llama…parquet
  └ …

Calculating metrics -

from humun_benchmark.data.metrics import read_results, compute_all_metrics

paths = glob.glob(f"/workspace/pretraining/benchmarks/<folder_name>/*.parquet")
results = read_results(paths)
metrics = compute_all_metrics(results)
metrics['overall_metrics'] # pd.DataFrame of cross-dataset results for all models selected

About

Benchmarking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors