This repository contains the solution for the Hackapizza 2025 coding challenge, organized by DataPizza. The challenge involved developing a Generative AI-powered assistant to help intergalactic travelers navigate a complex culinary landscape.
As detailed in challenge_description.md, the core task was to build an AI system capable of:
- Interpreting natural language queries about food preferences and restrictions.
- Processing information from diverse sources like restaurant menus, blog posts, galactic laws, and cooking manuals.
- Suggesting appropriate dishes based on user requests.
- Verifying dish compliance with (simulated) galactic regulations.
- Utilizing Generative AI techniques such as Retrieval Augmented Generation (RAG) and AI Agents.
Our solution leverages a knowledge graph and language models to address the challenge. The system is built around three main Python scripts:
-
parsing.py:- Responsible for extracting structured information from various unstructured data sources, primarily PDF documents (e.g., restaurant menus, culinary manuals).
- Utilizes an OpenAI language model (e.g.,
gpt-4o-mini) to understand and parse the content of these documents. - Outputs the extracted data into JSON files, preparing it for ingestion into the knowledge graph.
-
graph_construction.py:- Builds a Neo4j graph database from the structured data generated by
parsing.pyand other provided data files (e.g., CSVs for planet distances). - Creates nodes representing entities like Restaurants, Dishes, Ingredients, Chefs, Planets, Culinary Techniques, and Licenses.
- Establishes relationships between these nodes to capture the complex connections within the culinary universe (e.g., a
ChefWORKS_ATaRestaurant, aDishCONTAINSanIngredient, aRestaurantIS_LOCATED_ONaPlanet).
- Builds a Neo4j graph database from the structured data generated by
-
graph_retrieval.py:- Implements the core RAG (Retrieval Augmented Generation) pipeline for answering user queries.
- Takes a natural language question as input (from
data/domande.csvfor evaluation). - Uses a large language model (e.g., DeepSeek or OpenAI models) to translate the natural language question into a Cypher query, tailored to the Neo4j graph schema.
- Executes the generated Cypher query against the Neo4j database to retrieve relevant dishes.
- Maps the retrieved dish names to their corresponding IDs using
data/Misc/dish_mapping.json. - Includes an evaluation component that calculates the Jaccard similarity between the system's output and a ground truth dataset (
solution/ground_truth_mapped.csv). - Saves a detailed report of the queries, generated Cypher, results, and performance metrics to
report/report.json.
.
├── data/ # Input data files (CSVs, PDFs, JSONs)
│ ├── Blogpost/
│ ├── Codice Galattico/
│ ├── Manuale/
│ ├── Misc/
│ ├── Ristoranti/
│ ├── domande.csv
│ └── ...
├── images/ # Images used in documentation
├── manual_licenses/ # Parsed license information from manuals
├── report/ # Output reports (e.g., report.json)
├── restaurant_licenses/ # Parsed license information for restaurants
├── restaurant_menu/ # Parsed menu information
├── restaurant_planet/ # Parsed planet information for restaurants
├── solution/ # Ground truth solution files
│ └── ground_truth_mapped.csv
├── .gitignore
├── challenge_description.md # Description of the Hackapizza challenge
├── graph_construction.py # Script to build the Neo4j graph
├── graph_retrieval.py # Script for querying the graph and RAG pipeline
├── parsing.py # Script for parsing input documents
├── requirements.txt # Python dependencies (ensure this is created/updated)
└── README.md # This file
-
Prerequisites:
- Python 3.x
- A running Neo4j instance.
- API keys for OpenAI and/or DeepSeek (depending on the LLM used in
graph_retrieval.pyandparsing.py).
-
Setup:
- Clone the repository.
- Install Python dependencies:
pip install -r requirements.txt
- Set up the following environment variables:
NEO4J_URI: The URI for your Neo4j instance (e.g.,bolt://localhost:7687)NEO4J_USERNAME: Your Neo4j usernameNEO4J_PASSWORD: Your Neo4j passwordOPENAI_API_KEY: Your OpenAI API key (used inparsing.pyand potentiallygraph_retrieval.py)DEEPSEEK_API_KEY: Your DeepSeek API key (used ingraph_retrieval.py)
-
Data Parsing:
- Configure the flags in
parsing.pyto select which documents to parse. - Run the parsing script:
python parsing.py
- This will generate JSON files in directories like
restaurant_menu/,restaurant_planet/,restaurant_licenses/, andmanual_licenses/.
- Configure the flags in
-
Graph Construction:
- Configure the flags in
graph_construction.pyto control which nodes and relationships are created/updated. - Run the graph construction script:
python graph_construction.py
- This will populate your Neo4j database.
- Configure the flags in
-
Querying and Evaluation:
- Ensure
data/domande.csvcontains the questions to be processed. - Run the graph retrieval and evaluation script:
python graph_retrieval.py
- This will process the questions, query the graph, and generate
report/report.jsonwith the results and Jaccard similarity scores. The script will also produce a CSV file ready for submission to the Kaggle competition.
- Ensure
- Retrieval Augmented Generation (RAG): The core of the
graph_retrieval.pyscript involves using an LLM to generate Cypher queries (generation) based on the user's question and the graph schema, then retrieving data from the Neo4j knowledge graph (retrieval) to answer the question. - LLM-based Data Extraction:
parsing.pyuses an LLM to understand and extract structured information from unstructured PDF documents. - Knowledge Graphs: A Neo4j graph is used to store and relate complex information about the culinary universe, enabling sophisticated querying.
This project demonstrates an approach to building a sophisticated AI assistant by combining the strengths of large language models and knowledge graphs.
