Note: This repository documents the code and data for the InferF paper. It is developed using CactusDB as the backend.
InferF is a system for factorizing ML inference queries built on top of Meta's high-performance database engine, Velox.
To setup environment for running the project, please refer to Setup_Instructions.md for more details.
To run Greedy, Genetic, and Exhaustive Search Optimizers, run the following commands:
python velox/ml_functions/python/greedy.py --query_path path to the query --model_path path to the model --output_path path to save output planpython velox/ml_functions/python/genetic.py --query_path path to the query --model_path path to the model --output_path path to save output planpython velox/ml_functions/python/dynamic.py --query_path path to the query --model_path path to the model --output_path path to save output planSimilarly to the above optimizers, you can run Morpheus BU, Morpheus TD, FL BU, and FL TD baselines by running morpheus_bu.py, morpheus_td.py, fl_bu.py, and fl_td.py files respectively, with a command like the following. Change the Python file name.
python velox/ml_functions/python/morpheus_bu.py --query_path path to the query --model_path path to the model --output_path path to save output planpython velox/ml_functions/python/greedy.py --query_path resources/queries/10_1.txt --model_path resources/model/dummy_1000_32.pth --output_path resources/output_plan.txtpython velox/ml_functions/python/synthetic_gen.py --data_path PATH_TO_BASE_DATA --config_path resources/data/table_configs.json --output_dir resources/data/synthetic_tablesIn the above command, PATH_TO_BASE_DATA is the path to Epsilon or Bosch dataset.
Download the IMDB dataset from here and place the CSV files into resources/data/imdb directory.
NYC dataset available in this repository should be placed into resources/data directory.
Generate the TPC-DS dataset with a scale factor of 1 and place all the files inside resources/data/tpcds directory.
Download the zipped datasets from here. After unzipping the downloaded file, put the expedia and flights folders inside resources/data directory.
After you build and compile the project, run the following command:
_build/release/velox/ml_functions/factorize_test PATH_TO_DATASET PATH_TO_QUERY PATH_TO_FACTORIZED_PLAN PATH_TO_MODELHere is an example:
_build/release/velox/ml_functions/factorize_test resources/data/imdb resources/queries/5_1.txt resources/plan_5_1.txt resources/model/dummy_1000_32.h5In the above command, PATH_TO_FACTORIZED_PLAN will be generated from one of the optimizers such as Greedy and Genetic. For no factorization and full factorization, the labels in the generated plan can be updated as follows:
For no factorization: labels of all table scan nodes and join nodes can be set to 0.
For full factorization: labels of all table scan nodes should be set to 1 and join nodes should be set to 0.
After you build and compile the project, run the following command, similarly to the synthetic workload queries:
_build/release/velox/ml_functions/job_rewrite_test PATH_TO_DATASET PATH_TO_QUERY PATH_TO_FACTORIZED_PLAN PATH_TO_MODELPlace the IMDB and MovieLens dataset in the resources/data folder and run the following command. It will execute all LLM queries with various versions of optimizations.
_build/release/velox/ml_functions/chatgpt_infef_testTo run the In-DB ML systems baselines, first load the IMDB and TPC-DS datasets into PostgreSQL database by running the following files:
python velox/ml_functions/python/indb_ml/load_imdb_postgres.py
python velox/ml_functions/python/indb_ml/load_tpcds_postgres.py
python velox/ml_functions/python/indb_ml/load_tpcds_extra_postgres.pyExecute DL Centric, EvaDB, PySpark and MADlib baselines of Q1 by running the following three Python files:
python velox/ml_functions/python/indb_ml/tpcds_two_tower_postgres.py
python velox/ml_functions/python/indb_ml/tpcds_two_tower_evadb.py
python velox/ml_functions/python/indb_ml/tpcds_two_tower_pyspark.py
python velox/ml_functions/python/indb_ml/tpcds_two_tower_madlib.pyExecute DL Centric, EvaDB, PySpark and MADlib baselines of Q2 by running the following three Python files:
python velox/ml_functions/python/indb_ml/tpcds_forecasting_postgres.py
python velox/ml_functions/python/indb_ml/tpcds_forecasting_evadb.py
python velox/ml_functions/python/indb_ml/tpcds_forecasting_pyspark.py
python velox/ml_functions/python/indb_ml/tpcds_forecasting_madlib.pyExecute DL Centric, EvaDB, PySpark and MADlib baselines of Q3 by running the following three Python files:
python velox/ml_functions/python/indb_ml/imdb_two_tower_postgres.py
python velox/ml_functions/python/indb_ml/imdb_two_tower_evadb.py
python velox/ml_functions/python/indb_ml/imdb_two_tower_pyspark.py
python velox/ml_functions/python/indb_ml/imdb_two_tower_madlib.pyFollowing the above commands, run the Python files with files names starting with 'expedia_' in the folder velox/ml_functions/python/indb_ml to execute the in-db baselines for Q4 (expedia workload query). Similarly, run the Python files with files names starting with 'flights_' in the same folder to execute the in-db baselines for Q5 (flights workload query).
@misc{chowdhury2025inferfdeclarativefactorizationaiml,
title={InferF: Declarative Factorization of AI/ML Inferences over Joins},
author={Kanchan Chowdhury and Lixi Zhou and Lulu Xie and Xinwei Fu and Jia Zou},
year={2025},
eprint={2511.20489},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2511.20489},
}