Skip to content

PabloFerrerGonzalez333/aggity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aggity | UniversityHack 2024

This repository contains Team Titos' solution to the Aggity challenge in UniversityHack 2024.

The project is centered on a bioprocess analytics problem: build a lot-level predictive pipeline from industrial production data and generate final submission files for unseen lots.

Challenge Goal

The final task in this repository is to predict PRODUCTO 1 for the test lots in the Cultivo final test sheet and export the results in the competition delivery format:

LOTE|PRODUCTO 1
24054|...
24055|...
...

The final pipeline implemented in src/prediccion.py takes the available production history, engineering signals, process kinetics, component movements, and environmental measurements, converts them into a single modeling table indexed by lote, selects features, trains the final model, and writes the submission text file.

What Is In This Repository

This repo is intentionally split into two layers:

  1. A clean final-project view for public review.
  2. A phase-by-phase trace of the actual competition work.

Final Project View

Competition Traceability

The phases/ directory is not redundant archive material. It preserves the actual competition structure and the official deliverables produced at each stage.

Available Data

The final solution works with the datasets stored in phases/phase_2/data, which is the canonical input used by the top-level scripts.

Core tabular sources

  • train/of.xlsx Manufacturing order information, later reduced to lot-level order and quantity context.
  • train/fases_produccion.xlsx Multi-sheet production workbook with:
    • Preinoculo
    • Inoculo
    • Cultivo final
  • test/fases_produccion_test.xlsx Test lots for final prediction.
  • train/cineticos_ipc.xlsx Kinetic measurements for:
    • inocula
    • final cultures
    • centrifugation
  • train/horas_inicio_fin_centrifugas.xlsx Start/end timestamps for centrifuge operations.
  • train/movimientos_componentes.xlsx Component movement records linked to lots.
  • train/temperaturas_humedades.xlsx Environmental measurements used as contextual process features.

Time-series process logs

  • train/bioreactor/ 9 Excel files with reactor time-series data:
    • 3 small-scale bioreactors used for inoculation-related features
    • 6 larger bioreactors used for culture-related features
  • train/centrifugadora/ 3 Excel files with centrifuge time-series data

Final Pipeline

The final production pipeline lives in src/prediccion.py. The script is long, but its structure is clear and intentionally staged.

1. Import

The importar_* functions load every source independently:

  • manufacturing orders
  • each production stage
  • small and large bioreactor logs
  • kinetic tables
  • centrifuge logs
  • centrifuge timestamps
  • component movements
  • temperature and humidity
  • test lots from the final culture sheet

At this stage, column names are normalized and key identifiers such as lote, id_of, id_bio, and id_cent are cast into stable types.

2. Preprocess

The preproces_* functions standardize and simplify each source before feature building. This includes:

  • selecting relevant columns,
  • fixing malformed numeric strings,
  • converting timestamps,
  • normalizing identifiers,
  • deriving cleaner phase-specific views,
  • reconstructing parent-child relations for culture lots through pro_cp_padre.

3. Process Operational Signals

The proces_* functions summarize time-series and event tables into lot-level features. This is one of the most important parts of the project.

Examples:

  • proces_bios(...) Aggregates bioreactor signals over the production interval associated with each lot.
  • proces_cc_ino(...) and proces_cc_cp(...) Compute summary statistics from kinetic measurements during inoculation and final culture windows.
  • proces_cc_cent(...) Aggregates centrifugation kinetics by lot, centrifuge, and operation number.
  • proces_cent(...) Converts centrifuge event streams into lot-aligned process summaries.

These steps turn raw operational traces into statistics such as mean, max, min, std, and median for each relevant lot-phase pairing.

4. Build The Modeling Table

The merge stage uses lote as the main entity and combines:

  • order-level information,
  • pre-inoculation features,
  • inoculation features,
  • final culture features,
  • centrifugation features,
  • environmental features,
  • inherited information from parent lots when required.

Important helper blocks:

  • premergear(...) Builds the unified lot universe.
  • make_inoculo(...), make_cultivo(...), make_centri(...) Build stage-specific feature blocks.
  • mergear_th(...) Adds environmental summaries aligned to operation windows.
  • union(...) Produces the final wide modeling dataframe.

5. Split Train And Test

desunion(...) separates the unified table into:

  • training rows with known pro_cp_prod1
  • test rows corresponding to the lots present in fases_produccion_test.xlsx

The split is done by lot membership, not by random row slicing.

6. Feature Selection And Data Quality Handling

The final path applies several filters to reduce noise and stabilize the model:

  • statistical_select(...)
  • handle_data_issues(...)
  • lasso_select(...)
  • rfr_select(...)

In practice, the script:

  • removes low-value or unstable columns,
  • handles missing values,
  • applies outlier treatment,
  • uses Lasso-based selection,
  • then applies Random Forest RFECV-style selection on the reduced feature space.

7. Train And Predict

The final training(...) function in src/prediccion.py keeps the production choice intentionally narrow:

  • it fits LassoCV
  • keeps the best alpha
  • evaluates the resulting Lasso model with 5-fold cross-validation using RMSE and MAPE

Then predicting(...):

Even though the final script uses Lasso only, src/exploracion.py still contains a broader benchmark function, multiple_training(...), with experiments across:

  • Linear Regression
  • Ridge
  • Lasso
  • Elastic Net
  • SVR
  • KNN
  • Decision Tree
  • Random Forest
  • Gradient Boosting
  • XGBoost
  • LightGBM
  • CatBoost
  • a simple voting ensemble

That script is useful to understand the wider exploration space even though it is not the final public entrypoint.

Files Worth Reading First

If you only want the shortest path through the repo:

  1. docs/memoria.pdf
  2. docs/presentacion_final.pdf
  3. src/prediccion.py
  4. notebooks/notebook.ipynb

If you want to follow the competition chronologically:

  1. phases/phase_1
  2. phases/phase_2
  3. phases/phase_3

Deliverables Preserved In The Repo

The repository keeps the original final competition artifacts:

Presentation

The final presentation can be accessed in three ways:

The GitHub Pages viewer is implemented in docs/index.html.

Running The Code

Install dependencies:

pip install -r requirements.txt

Main entrypoint:

python src/prediccion.py

This will generate a prediction file under:

phases/phase_2/deliverables/generated/

Optional PostgreSQL Export

Some exploration flows include an optional PostgreSQL export through to_sql(...), mainly for external inspection and dashboarding. No credentials are stored in the repository.

To enable that path, copy .env.example to .env or define:

  • AGGITY_DB_HOST
  • AGGITY_DB_PORT
  • AGGITY_DB_NAME
  • AGGITY_DB_USER
  • AGGITY_DB_PASSWORD

If you are only interested in the predictive pipeline, this configuration is not required.

Dependencies

The repository currently declares the following Python dependencies in requirements.txt:

  • pandas
  • numpy
  • scipy
  • scikit-learn
  • xgboost
  • lightgbm
  • catboost
  • shap
  • matplotlib
  • SQLAlchemy
  • psycopg2-binary

License

This project is released under the MIT License. See LICENSE.

About

UniversityHack 2024 challenge solution focused on lot-level industrial analytics and predictive modeling.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors