InferF: Declarative Factorization of AI/ML Inferences over Joins

Note: This repository documents the code and data for the InferF paper. It is developed using CactusDB as the backend.

InferF is a system for factorizing ML inference queries built on top of Meta's high-performance database engine, Velox.

Getting Started

Environment Setup

To setup environment for running the project, please refer to Setup_Instructions.md for more details.

Running Optimizers

To run Greedy, Genetic, and Exhaustive Search Optimizers, run the following commands:

Running Greedy Optimizer:

python velox/ml_functions/python/greedy.py --query_path path to the query --model_path path to the model --output_path path to save output plan

Running Genetic Optimizer:

python velox/ml_functions/python/genetic.py --query_path path to the query --model_path path to the model --output_path path to save output plan

Running Exhaustive Search Optimizer:

python velox/ml_functions/python/dynamic.py --query_path path to the query --model_path path to the model --output_path path to save output plan

Running Morpehus Extended and FL Extended Optimizers:

Similarly to the above optimizers, you can run Morpheus BU, Morpheus TD, FL BU, and FL TD baselines by running morpheus_bu.py, morpheus_td.py, fl_bu.py, and fl_td.py files respectively, with a command like the following. Change the Python file name.

python velox/ml_functions/python/morpheus_bu.py --query_path path to the query --model_path path to the model --output_path path to save output plan

An Example for Running Any of the Three Optimizers:

python velox/ml_functions/python/greedy.py --query_path resources/queries/10_1.txt --model_path resources/model/dummy_1000_32.pth --output_path resources/output_plan.txt

Synthetic Data Generation

python velox/ml_functions/python/synthetic_gen.py --data_path PATH_TO_BASE_DATA --config_path resources/data/table_configs.json --output_dir resources/data/synthetic_tables

In the above command, PATH_TO_BASE_DATA is the path to Epsilon or Bosch dataset.

Running Factorized Plans

Setting Up Datasets:

IMDB Dataset

Download the IMDB dataset from here and place the CSV files into resources/data/imdb directory.

NYC Dataset

NYC dataset available in this repository should be placed into resources/data directory.

TPC-DS Dataset

Generate the TPC-DS dataset with a scale factor of 1 and place all the files inside resources/data/tpcds directory.

Expedia and Flights Datasets

Download the zipped datasets from here. After unzipping the downloaded file, put the expedia and flights folders inside resources/data directory.

Running Synthetic Workload Queries:

After you build and compile the project, run the following command:

_build/release/velox/ml_functions/factorize_test PATH_TO_DATASET PATH_TO_QUERY PATH_TO_FACTORIZED_PLAN PATH_TO_MODEL

Here is an example:

_build/release/velox/ml_functions/factorize_test resources/data/imdb resources/queries/5_1.txt resources/plan_5_1.txt resources/model/dummy_1000_32.h5

In the above command, PATH_TO_FACTORIZED_PLAN will be generated from one of the optimizers such as Greedy and Genetic. For no factorization and full factorization, the labels in the generated plan can be updated as follows:

For no factorization: labels of all table scan nodes and join nodes can be set to 0.

For full factorization: labels of all table scan nodes should be set to 1 and join nodes should be set to 0.

Running Real-World Workload Queries:

After you build and compile the project, run the following command, similarly to the synthetic workload queries:

_build/release/velox/ml_functions/job_rewrite_test PATH_TO_DATASET PATH_TO_QUERY PATH_TO_FACTORIZED_PLAN PATH_TO_MODEL

Running LLM Queries in Supplementary Material:

Place the IMDB and MovieLens dataset in the resources/data folder and run the following command. It will execute all LLM queries with various versions of optimizations.

_build/release/velox/ml_functions/chatgpt_infef_test

Running Analytical Queries in InDM ML Systems:

To run the In-DB ML systems baselines, first load the IMDB and TPC-DS datasets into PostgreSQL database by running the following files:

python velox/ml_functions/python/indb_ml/load_imdb_postgres.py
python velox/ml_functions/python/indb_ml/load_tpcds_postgres.py
python velox/ml_functions/python/indb_ml/load_tpcds_extra_postgres.py

Execute DL Centric, EvaDB, PySpark and MADlib baselines of Q1 by running the following three Python files:

python velox/ml_functions/python/indb_ml/tpcds_two_tower_postgres.py
python velox/ml_functions/python/indb_ml/tpcds_two_tower_evadb.py
python velox/ml_functions/python/indb_ml/tpcds_two_tower_pyspark.py
python velox/ml_functions/python/indb_ml/tpcds_two_tower_madlib.py

Execute DL Centric, EvaDB, PySpark and MADlib baselines of Q2 by running the following three Python files:

python velox/ml_functions/python/indb_ml/tpcds_forecasting_postgres.py
python velox/ml_functions/python/indb_ml/tpcds_forecasting_evadb.py
python velox/ml_functions/python/indb_ml/tpcds_forecasting_pyspark.py
python velox/ml_functions/python/indb_ml/tpcds_forecasting_madlib.py

Execute DL Centric, EvaDB, PySpark and MADlib baselines of Q3 by running the following three Python files:

python velox/ml_functions/python/indb_ml/imdb_two_tower_postgres.py
python velox/ml_functions/python/indb_ml/imdb_two_tower_evadb.py
python velox/ml_functions/python/indb_ml/imdb_two_tower_pyspark.py
python velox/ml_functions/python/indb_ml/imdb_two_tower_madlib.py

Following the above commands, run the Python files with files names starting with 'expedia_' in the folder velox/ml_functions/python/indb_ml to execute the in-db baselines for Q4 (expedia workload query). Similarly, run the Python files with files names starting with 'flights_' in the same folder to execute the in-db baselines for Q5 (flights workload query).

BibTex for Citing the Work:

@misc{chowdhury2025inferfdeclarativefactorizationaiml,
      title={InferF: Declarative Factorization of AI/ML Inferences over Joins}, 
      author={Kanchan Chowdhury and Lixi Zhou and Lulu Xie and Xinwei Fu and Jia Zou},
      year={2025},
      eprint={2511.20489},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2511.20489}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.circleci		.circleci
.github		.github
CMake		CMake
data		data
docker-doc		docker-doc
docs		docs
imgs		imgs
pyvelox		pyvelox
resources		resources
scripts		scripts
static		static
third_party		third_party
velox		velox
website		website
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CODING_STYLE.md		CODING_STYLE.md
DEVELOP_GUIDE.md		DEVELOP_GUIDE.md
INSTALL_DEPENDENCIES.md		INSTALL_DEPENDENCIES.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_velox.md		README_velox.md
Setup_Instructions.md		Setup_Instructions.md
docker-compose.yml		docker-compose.yml
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InferF: Declarative Factorization of AI/ML Inferences over Joins

Getting Started

Environment Setup

Running Optimizers

Running Greedy Optimizer:

Running Genetic Optimizer:

Running Exhaustive Search Optimizer:

Running Morpehus Extended and FL Extended Optimizers:

An Example for Running Any of the Three Optimizers:

Synthetic Data Generation

Running Factorized Plans

Setting Up Datasets:

IMDB Dataset

NYC Dataset

TPC-DS Dataset

Expedia and Flights Datasets

Running Synthetic Workload Queries:

Running Real-World Workload Queries:

Running LLM Queries in Supplementary Material:

Running Analytical Queries in InDM ML Systems:

BibTex for Citing the Work:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

asu-cactus/InferF-ML-Inference-Factorization

Folders and files

Latest commit

History

Repository files navigation

InferF: Declarative Factorization of AI/ML Inferences over Joins

Getting Started

Environment Setup

Running Optimizers

Running Greedy Optimizer:

Running Genetic Optimizer:

Running Exhaustive Search Optimizer:

Running Morpehus Extended and FL Extended Optimizers:

An Example for Running Any of the Three Optimizers:

Synthetic Data Generation

Running Factorized Plans

Setting Up Datasets:

IMDB Dataset

NYC Dataset

TPC-DS Dataset

Expedia and Flights Datasets

Running Synthetic Workload Queries:

Running Real-World Workload Queries:

Running LLM Queries in Supplementary Material:

Running Analytical Queries in InDM ML Systems:

BibTex for Citing the Work:

About

Topics

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages