Aggity | UniversityHack 2024

This repository contains Team Titos' solution to the Aggity challenge in UniversityHack 2024.

The project is centered on a bioprocess analytics problem: build a lot-level predictive pipeline from industrial production data and generate final submission files for unseen lots.

Challenge Goal

The final task in this repository is to predict PRODUCTO 1 for the test lots in the Cultivo final test sheet and export the results in the competition delivery format:

LOTE|PRODUCTO 1
24054|...
24055|...
...

The final pipeline implemented in src/prediccion.py takes the available production history, engineering signals, process kinetics, component movements, and environmental measurements, converts them into a single modeling table indexed by lote, selects features, trains the final model, and writes the submission text file.

What Is In This Repository

This repo is intentionally split into two layers:

A clean final-project view for public review.
A phase-by-phase trace of the actual competition work.

Final Project View

src
- prediccion.py: final end-to-end prediction pipeline
- exploracion.py: exploration-oriented pipeline, optional PostgreSQL export, wider modeling experiments
notebooks
- notebook.ipynb: final notebook version kept for interactive review
docs
- brief.pdf: challenge brief
- proceso_metodologia.pdf: methodology document from phase 1
- memoria.pdf: final written report
- presentacion_final.pdf: final presentation in PDF
- presentacion_final.pptx: final presentation deck
- index.html: GitHub Pages entrypoint for browser-friendly presentation viewing

Competition Traceability

phases/phase_1: final assets from phase 1
phases/phase_2: final assets from phase 2
phases/phase_3: final presentation assets

The phases/ directory is not redundant archive material. It preserves the actual competition structure and the official deliverables produced at each stage.

Available Data

The final solution works with the datasets stored in phases/phase_2/data, which is the canonical input used by the top-level scripts.

Core tabular sources

train/of.xlsx Manufacturing order information, later reduced to lot-level order and quantity context.
train/fases_produccion.xlsx Multi-sheet production workbook with:
- Preinoculo
- Inoculo
- Cultivo final
test/fases_produccion_test.xlsx Test lots for final prediction.
train/cineticos_ipc.xlsx Kinetic measurements for:
- inocula
- final cultures
- centrifugation
train/horas_inicio_fin_centrifugas.xlsx Start/end timestamps for centrifuge operations.
train/movimientos_componentes.xlsx Component movement records linked to lots.
train/temperaturas_humedades.xlsx Environmental measurements used as contextual process features.

Time-series process logs

train/bioreactor/ 9 Excel files with reactor time-series data:
- 3 small-scale bioreactors used for inoculation-related features
- 6 larger bioreactors used for culture-related features
train/centrifugadora/ 3 Excel files with centrifuge time-series data

Final Pipeline

The final production pipeline lives in src/prediccion.py. The script is long, but its structure is clear and intentionally staged.

1. Import

The importar_* functions load every source independently:

manufacturing orders
each production stage
small and large bioreactor logs
kinetic tables
centrifuge logs
centrifuge timestamps
component movements
temperature and humidity
test lots from the final culture sheet

At this stage, column names are normalized and key identifiers such as lote, id_of, id_bio, and id_cent are cast into stable types.

2. Preprocess

The preproces_* functions standardize and simplify each source before feature building. This includes:

selecting relevant columns,
fixing malformed numeric strings,
converting timestamps,
normalizing identifiers,
deriving cleaner phase-specific views,
reconstructing parent-child relations for culture lots through pro_cp_padre.

3. Process Operational Signals

The proces_* functions summarize time-series and event tables into lot-level features. This is one of the most important parts of the project.

Examples:

proces_bios(...) Aggregates bioreactor signals over the production interval associated with each lot.
proces_cc_ino(...) and proces_cc_cp(...) Compute summary statistics from kinetic measurements during inoculation and final culture windows.
proces_cc_cent(...) Aggregates centrifugation kinetics by lot, centrifuge, and operation number.
proces_cent(...) Converts centrifuge event streams into lot-aligned process summaries.

These steps turn raw operational traces into statistics such as mean, max, min, std, and median for each relevant lot-phase pairing.

4. Build The Modeling Table

The merge stage uses lote as the main entity and combines:

order-level information,
pre-inoculation features,
inoculation features,
final culture features,
centrifugation features,
environmental features,
inherited information from parent lots when required.

Important helper blocks:

premergear(...) Builds the unified lot universe.
make_inoculo(...), make_cultivo(...), make_centri(...) Build stage-specific feature blocks.
mergear_th(...) Adds environmental summaries aligned to operation windows.
union(...) Produces the final wide modeling dataframe.

5. Split Train And Test

desunion(...) separates the unified table into:

training rows with known pro_cp_prod1
test rows corresponding to the lots present in fases_produccion_test.xlsx

The split is done by lot membership, not by random row slicing.

6. Feature Selection And Data Quality Handling

The final path applies several filters to reduce noise and stabilize the model:

statistical_select(...)
handle_data_issues(...)
lasso_select(...)
rfr_select(...)

In practice, the script:

removes low-value or unstable columns,
handles missing values,
applies outlier treatment,
uses Lasso-based selection,
then applies Random Forest RFECV-style selection on the reduced feature space.

7. Train And Predict

The final training(...) function in src/prediccion.py keeps the production choice intentionally narrow:

it fits LassoCV
keeps the best alpha
evaluates the resulting Lasso model with 5-fold cross-validation using RMSE and MAPE

Then predicting(...):

fits the chosen model on all training data,
predicts PRODUCTO 1 for the test lots,
writes the official submission file to: phases/phase_2/deliverables/generated

Even though the final script uses Lasso only, src/exploracion.py still contains a broader benchmark function, multiple_training(...), with experiments across:

Linear Regression
Ridge
Lasso
Elastic Net
SVR
KNN
Decision Tree
Random Forest
Gradient Boosting
XGBoost
LightGBM
CatBoost
a simple voting ensemble

That script is useful to understand the wider exploration space even though it is not the final public entrypoint.

Files Worth Reading First

If you only want the shortest path through the repo:

If you want to follow the competition chronologically:

Deliverables Preserved In The Repo

The repository keeps the original final competition artifacts:

Presentation

The final presentation can be accessed in three ways:

GitHub Pages entrypoint: https://pabloferrergonzalez333.github.io/aggity/
Final presentation PDF
Final presentation PPTX

The GitHub Pages viewer is implemented in docs/index.html.

Running The Code

Install dependencies:

pip install -r requirements.txt

Main entrypoint:

python src/prediccion.py

This will generate a prediction file under:

phases/phase_2/deliverables/generated/

Optional PostgreSQL Export

Some exploration flows include an optional PostgreSQL export through to_sql(...), mainly for external inspection and dashboarding. No credentials are stored in the repository.

To enable that path, copy .env.example to .env or define:

AGGITY_DB_HOST
AGGITY_DB_PORT
AGGITY_DB_NAME
AGGITY_DB_USER
AGGITY_DB_PASSWORD

If you are only interested in the predictive pipeline, this configuration is not required.

Dependencies

The repository currently declares the following Python dependencies in requirements.txt:

pandas
numpy
scipy
scikit-learn
xgboost
lightgbm
catboost
shap
matplotlib
SQLAlchemy
psycopg2-binary

License

This project is released under the MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aggity | UniversityHack 2024

Challenge Goal

What Is In This Repository

Final Project View

Competition Traceability

Available Data

Core tabular sources

Time-series process logs

Final Pipeline

1. Import

2. Preprocess

3. Process Operational Signals

4. Build The Modeling Table

5. Split Train And Test

6. Feature Selection And Data Quality Handling

7. Train And Predict

Files Worth Reading First

Deliverables Preserved In The Repo

Presentation

Running The Code

Optional PostgreSQL Export

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
notebooks		notebooks
phases		phases
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Aggity | UniversityHack 2024

Challenge Goal

What Is In This Repository

Final Project View

Competition Traceability

Available Data

Core tabular sources

Time-series process logs

Final Pipeline

1. Import

2. Preprocess

3. Process Operational Signals

4. Build The Modeling Table

5. Split Train And Test

6. Feature Selection And Data Quality Handling

7. Train And Predict

Files Worth Reading First

Deliverables Preserved In The Repo

Presentation

Running The Code

Optional PostgreSQL Export

Dependencies

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages