This repository contains Team Titos' solution to the Aggity challenge in UniversityHack 2024.
The project is centered on a bioprocess analytics problem: build a lot-level predictive pipeline from industrial production data and generate final submission files for unseen lots.
The final task in this repository is to predict PRODUCTO 1 for the test lots in the Cultivo final test sheet and export the results in the competition delivery format:
LOTE|PRODUCTO 1
24054|...
24055|...
...
The final pipeline implemented in src/prediccion.py takes the available production history, engineering signals, process kinetics, component movements, and environmental measurements, converts them into a single modeling table indexed by lote, selects features, trains the final model, and writes the submission text file.
This repo is intentionally split into two layers:
- A clean final-project view for public review.
- A phase-by-phase trace of the actual competition work.
srcprediccion.py: final end-to-end prediction pipelineexploracion.py: exploration-oriented pipeline, optional PostgreSQL export, wider modeling experiments
notebooksnotebook.ipynb: final notebook version kept for interactive review
docsbrief.pdf: challenge briefproceso_metodologia.pdf: methodology document from phase 1memoria.pdf: final written reportpresentacion_final.pdf: final presentation in PDFpresentacion_final.pptx: final presentation deckindex.html: GitHub Pages entrypoint for browser-friendly presentation viewing
phases/phase_1: final assets from phase 1phases/phase_2: final assets from phase 2phases/phase_3: final presentation assets
The phases/ directory is not redundant archive material. It preserves the actual competition structure and the official deliverables produced at each stage.
The final solution works with the datasets stored in phases/phase_2/data, which is the canonical input used by the top-level scripts.
train/of.xlsxManufacturing order information, later reduced to lot-level order and quantity context.train/fases_produccion.xlsxMulti-sheet production workbook with:PreinoculoInoculoCultivo final
test/fases_produccion_test.xlsxTest lots for final prediction.train/cineticos_ipc.xlsxKinetic measurements for:- inocula
- final cultures
- centrifugation
train/horas_inicio_fin_centrifugas.xlsxStart/end timestamps for centrifuge operations.train/movimientos_componentes.xlsxComponent movement records linked to lots.train/temperaturas_humedades.xlsxEnvironmental measurements used as contextual process features.
train/bioreactor/9 Excel files with reactor time-series data:- 3 small-scale bioreactors used for inoculation-related features
- 6 larger bioreactors used for culture-related features
train/centrifugadora/3 Excel files with centrifuge time-series data
The final production pipeline lives in src/prediccion.py. The script is long, but its structure is clear and intentionally staged.
The importar_* functions load every source independently:
- manufacturing orders
- each production stage
- small and large bioreactor logs
- kinetic tables
- centrifuge logs
- centrifuge timestamps
- component movements
- temperature and humidity
- test lots from the final culture sheet
At this stage, column names are normalized and key identifiers such as lote, id_of, id_bio, and id_cent are cast into stable types.
The preproces_* functions standardize and simplify each source before feature building. This includes:
- selecting relevant columns,
- fixing malformed numeric strings,
- converting timestamps,
- normalizing identifiers,
- deriving cleaner phase-specific views,
- reconstructing parent-child relations for culture lots through
pro_cp_padre.
The proces_* functions summarize time-series and event tables into lot-level features. This is one of the most important parts of the project.
Examples:
proces_bios(...)Aggregates bioreactor signals over the production interval associated with each lot.proces_cc_ino(...)andproces_cc_cp(...)Compute summary statistics from kinetic measurements during inoculation and final culture windows.proces_cc_cent(...)Aggregates centrifugation kinetics by lot, centrifuge, and operation number.proces_cent(...)Converts centrifuge event streams into lot-aligned process summaries.
These steps turn raw operational traces into statistics such as mean, max, min, std, and median for each relevant lot-phase pairing.
The merge stage uses lote as the main entity and combines:
- order-level information,
- pre-inoculation features,
- inoculation features,
- final culture features,
- centrifugation features,
- environmental features,
- inherited information from parent lots when required.
Important helper blocks:
premergear(...)Builds the unified lot universe.make_inoculo(...),make_cultivo(...),make_centri(...)Build stage-specific feature blocks.mergear_th(...)Adds environmental summaries aligned to operation windows.union(...)Produces the final wide modeling dataframe.
desunion(...) separates the unified table into:
- training rows with known
pro_cp_prod1 - test rows corresponding to the lots present in
fases_produccion_test.xlsx
The split is done by lot membership, not by random row slicing.
The final path applies several filters to reduce noise and stabilize the model:
statistical_select(...)handle_data_issues(...)lasso_select(...)rfr_select(...)
In practice, the script:
- removes low-value or unstable columns,
- handles missing values,
- applies outlier treatment,
- uses Lasso-based selection,
- then applies Random Forest RFECV-style selection on the reduced feature space.
The final training(...) function in src/prediccion.py keeps the production choice intentionally narrow:
- it fits
LassoCV - keeps the best
alpha - evaluates the resulting Lasso model with 5-fold cross-validation using RMSE and MAPE
Then predicting(...):
- fits the chosen model on all training data,
- predicts
PRODUCTO 1for the test lots, - writes the official submission file to:
phases/phase_2/deliverables/generated
Even though the final script uses Lasso only, src/exploracion.py still contains a broader benchmark function, multiple_training(...), with experiments across:
- Linear Regression
- Ridge
- Lasso
- Elastic Net
- SVR
- KNN
- Decision Tree
- Random Forest
- Gradient Boosting
- XGBoost
- LightGBM
- CatBoost
- a simple voting ensemble
That script is useful to understand the wider exploration space even though it is not the final public entrypoint.
If you only want the shortest path through the repo:
If you want to follow the competition chronologically:
The repository keeps the original final competition artifacts:
phases/phase_1/deliverables/submission.zipphases/phase_1/deliverables/Titos_UH2024.txtphases/phase_2/deliverables/submission.zipphases/phase_2/deliverables/Titos_UH2024.txtphases/phase_3/presentations/presentacion.pdfphases/phase_3/presentations/presentacion.pptx
The final presentation can be accessed in three ways:
- GitHub Pages entrypoint:
https://pabloferrergonzalez333.github.io/aggity/ - Final presentation PDF
- Final presentation PPTX
The GitHub Pages viewer is implemented in docs/index.html.
Install dependencies:
pip install -r requirements.txtMain entrypoint:
python src/prediccion.pyThis will generate a prediction file under:
phases/phase_2/deliverables/generated/
Some exploration flows include an optional PostgreSQL export through to_sql(...), mainly for external inspection and dashboarding. No credentials are stored in the repository.
To enable that path, copy .env.example to .env or define:
AGGITY_DB_HOSTAGGITY_DB_PORTAGGITY_DB_NAMEAGGITY_DB_USERAGGITY_DB_PASSWORD
If you are only interested in the predictive pipeline, this configuration is not required.
The repository currently declares the following Python dependencies in requirements.txt:
- pandas
- numpy
- scipy
- scikit-learn
- xgboost
- lightgbm
- catboost
- shap
- matplotlib
- SQLAlchemy
- psycopg2-binary
This project is released under the MIT License. See LICENSE.