Usage Guide

How to use the hotel_management project for typical workflows.

Notes on Orchestration

IMPORTANT: All of the provided orchestrators should be used with caution, preferably not outside of development and testing (major side-effects are possible)
Orchestrators were created almost exclusively for dev/test convenience and efficiency in single-owner, single-dev work
skip-if-existing flag determines whether an orchestrator will run the given pipelines if it notices at least one snapshot in the target location (e.g. if set to true and feature_store/{feature_set_name}/{feature_set_version}/{snapshot_id}/ already exists, it will not freeze that feature set; otherwise it will - new snapshot gets created)

Included Artifacts and Configs

This repo comes with a few artifacts for quick checking of what to expect with various pipelines. These include:

original dataset from kaggle
two sets of generated data (10k and 20k rows respectively)
interim and processed datasets related to the previous three
feature sets related to the mentioned processed datasets
four full experiment runs in test env
- search + train + evaluate + explain
- included models -> cancellation global v1, adr city_hotel_online_ta v1, no_show city_hotel v1
one promoted and one staging cancellation global model
one inference run and one monitoring run on cancellation global

This repo also includes pre-defined configs for 14 different models, optimized to likely perform well. It also includes some of the other configs, all of which can be found in the configs/ directory. This enables the user to start training some models immediately, while not occupying their local workspace with too many artifacts.

Docker

If you use docker, parts of the remainder of this file will be less relevant to you. Read regardless for more clarity.

For quick use with docker:

Run the following command whenever the code is updated:

docker compose build

Add --no-cache when you want to rebuild the image from scratch:

docker compose build --no-cache

Run this command to operate the ml workflow from your browser:

docker compose up

Simply press ctrl+c stop running the container
- Ideally avoid doing this while pipelines are mid-execution
Backend is now on localhost:8000 by default
Frontend is now on localhost:8050 by default

ML Service

Overview

Code within the ml_service/ folder provides Dash + FastAPI apps that can be used to:
- Run pipelines
- Create and store configs
  - Includes validation to ensure proper quality
- Run scripts
- Read docs
- View json and yaml files
- View directory structure

Configurations

Supported

Currently supports the following configs:

model specs + search + training
- configs/model_specs/{problem_type}/{segment}{model_version}.yaml
- configs/search/{problem_type}/{segment}{model_version}.yaml
- configs/train/{problem_type}/{segment}{model_version}.yaml
data (interim + processed)
- configs/data/interim/{dataset_name}/{dataset_version}.yaml
- configs/data/processed/{dataset_name}/{dataset_version}.yaml
feature set configs (feature registry)
- configs/feature_registry/features.yaml
pipeline configs
- configs/pipelines/{data_type}/{algorithm}/{pipeline_version}.yaml
promotion thresholds
- configs/promotion/thresholds.yaml
snapshot bindings (via a script)
- configs/snapshot_bindings_registry/bindings.yaml

Not Supported

Currently does not support the following configs:

defaults:
- configs/defaults/global.yaml
- configs/defaults/{algorithm}.yaml
environment overlay (configs):
- configs/env/{env_name}.yaml

Reasoning

The supported defaults are written more often, require lineage with timestamp, and are versioned
The unsupported configs are not meant to be altered, do not require lineage, and are not versioned

Instructions

In order to use the ml service:

Launch the backend with

uvicorn ml_service.backend.main:app --reload

Launch the frontend with

python -m ml_service.frontend.app

Open the dashboard in your browser at the specified port and use it.

Examples

Pipelines:

Modeling Configs:

Running Pipelines

Use CLI commands with python scripts found in pipelines/
This section includes a brief overview of the pipelines
The diagrams describing each pipeline in more detail (e.g. which artifacts are used and produced at each step) can be viewed in the architecture overview
More architecture information in general is located in the docs/architecture folder as well

Data Preprocessing

The pipelines/data/register_raw_snapshot.py pipeline registers raw data it finds in data/raw/{dataset_name}/{dataset_version}/{snapshot_id}/data.{format}, based on cli arguments
The pipelines/data/build_interim_dataset.py pipeline builds an interim dataset from one of the raw data snapshots, based on the interaction between cli arguments and configs from configs/data/interim/{dataset_name}/{dataset_version}.yaml
The pipelines/data/build_processed_dataset.py pipeline builds a processed dataset from one of the interim datasets, based on the interaction between cli arguments and configs from configs/data/processed/{dataset_name}/{dataset_version}.yaml
The pipelines/orchestration/data/execute_all_data_preprocessing.py orchestrator executes all of the three pipelines for all of the available raw snapshots and interim and processed configs

Feature Set Freezing

The pipelines/features/freeze.py pipeline freezes a feature set based on the interaction of cli arguments with feature registry (configs/feature_registry/features.yaml)
The pipelines/orchestration/features/freeze_all_feature_sets.py orchestrator freezes all of the feature sets found in the feature registry (configs/feature_registry/features.yaml)

Experiments

Search (hyperparameter searching)

The pipelines/search/search.py pipeline performs a hyperparameter search for a given model, based on the interaction between cli arguments and resolved configs (a graph in the architecture overview shows how configs are resolved at runtime)
A search run defines an experiment (one search = one experiment)

Runners

Training

The pipelines/runners/train.py pipeline performs a training run, based on the interaction between cli arguments and resolved configs (a graph in the architecture overview shows how configs are resolved at runtime)
A training run is canonical for evaluation and explainability runs

Evaluation

The pipelines/runners/evaluate.py pipeline performs an evaluation run, based on cli arguments

Explainability

The pipelines/runners/explain.py pipeline performs an explainability run, based on cli arguments

Orchestration

The pipelines/orchestration/experiments/execute_experiment_with_latest.py orchestrator executes search.py, train.py, evalute.py and explain.py in sequence by defaulting to the latest experiment id for all runners, and the latest train run id for evaluation and explainability runs
The pipelines/orchestration/experiments/execute_all_experiments_with_latest.py orchestrator executes execute_experiment_with_latest.py for all of the models, based on file structure within configs/model_specs, such that problem type + segment + model version = one model.

Promotion

The pipelines/promotion/promote.py pipeline stages or promotes a model, and archives the previous one (if promotion occurs), based on cli arguments and predefined thresholds (configs/promotion/thresholds.yaml)

Post-promotion

Inference

The pipelines/post_promotion/infer.py pipeline runs inference using a defined trained model or pipeline with defined snapshot bindings, based on cli arguments

Monitoring

The pipelines/post_promotion/monitor.py pipeline monitors model performance, based on cli arguments and relevant artifacts

The Grand Orchestrator

The pipelines/orchestration/master/run_all_workflows.py executes execute_all_data_preprocessing.py, freeze_all_feature_sets.py and execute_all_experiments_with_latest.py in sequence

Artifacts

All data-related artifacts can be found in data/
All feature-set-related artifacts can be found in feature_store/
All experiment-related artifacts can be found in experiments/
All promotion-related artifacts, as well as the model registry and archive, can be found in model_registry/

Scripts

Use CLI commands with python scripts found in scripts/
This section describes what each script does

Generators

The generate_cols_for_row_id_fingerprint.py script generates a fingerprint that ensures consistency in generating row_id of hotel_bookings
- Impossible to ensure perfect consistency in python code alone, but acts as an additional sanity check
- Good enough for local individual or small team use
The generate_fake_data.py script generates fake data that can then be used by pipelines.
- The data, along with the synthesizer_metadata and a quality_report, is saved in data/raw/{dataset_name}/{dataset_version}/{dataset_snapshot}/.
- The trained model can be saved in synthesizers/snapshot_id/, named ctgan_model.pkl by default, and then reused, which greatly reduces the scripts' runtime (by up to 99% - training is expensive).
- Data is stored in csv format by default. Alter the script if needs evolved.
- The script is not modularized, as it is not considered to be a core part of the repo, and the repo comes with some pre-generated synthetic data, so the need for the script is not high.
- May be modularized in the future.
- The relationships between columns are likely not captured accurately, but that is considered acceptable at this stage.
  - Adding relationship logic would increase complexity with questionable justification for it.
  - The generated data is expected to be used for experimenting, rather than training production models.
- This script requires extra setup steps to use, as mentioned in setup.md
The generate_operator_hash.py script generates an operator hash, which is needed when writing into the feature registry.
- Ensure that the operators exist, and write them in proper format (e.g. TotalStay)
- In CLI, separate the operators with a space character (e.g. --operators TotalStay ArrivalDate)
- In GUI, use commas for separation (e.g. TotalStay, ArrivalDate)
The generate_snapshot_binding.py script generates a new snapshot of snapshot bindings in the snapshot binding registry.
- It always writes the latest snapshot for each existing dataset and feature set.
- Alter the results manually if you need older snapshots for specific datasets and/or feature sets.

Quality Scripts

These scripts are used by the pre-commit hook, as well as GitHub Actions CI, to ensure code quality.
The check_import_layers.py script checks import layers and dependencies across the codebase to enforce architectural boundaries (specified in boundaries.md)
The check_naming_conventions.py script checks the naming conventions across the codebase.
- In order to satisfy the requirements:
  - use snake_case for modules and functions
  - use PascalCase for classes
  - do not prefix module names with _ (except __init__ and __all__)
- The script also allows for ignoring certain folders, especially tests/

Logging

All individual data pipelines' logs can be found in data/
All individual features pipelines' logs can be found in feature_store/
All individual experiment-related pipelines' logs can be found in experiments/
All individual promotion pipelines' logs can be found in model_registry/
All orchestration logs can be found in orchestration_logs/
Logging level is defined through CLI
Expect detailed, informative logs from individual pipelines
Expect high-level, helpful logs from orchestration pipelines
Each pipeline run logs to a new location that is logically easy and intuitive to find
All scripts' logs can be found in scripts_logs/

EDA

See notebooks/EDA_and_Data_Preparation.ipynb for initial exploration

Configurations

All configs are defined exclusively within configs/
All configs have to respect the existing file structure - otherwise the pipelines will not work
Naming of datasets, feature sets, versions, problem types and segments needs to be consistent across all configs
All configs are currently required to be in yaml
New change in anything that needs to be defined in configs -> new version
The versioning format across the repository is v{integer} (e.g. v1, v2, v3), and it is important to respect this format in order for everything to work properly
Whenever you see "version" in the context of configs within this repo, assume the v{integer} format

Defaults

Define global and algorithm-specific defaults in configs/defaults/global.yaml or configs/defaults/{algorithm_name}.yaml
These configs are meant to be defined once and never changed
The repo comes with some predefined configs - feel free to change them for your individual use-cases

Data Configs

Define configs for build_interim_dataset.py runs in configs/data/interim/{dataset_name}/{dataset_version}.yaml
Define configs for build_processed_dataset.py runs in configs/data/processed/{dataset_name}/{dataset_version}.yaml

Feature Registry

Define configs for freeze.py in configs/feature_registry/features.py
Generate the operator hash with scripts/generators/generate_operator_hash.py

Snapshot Bindings

Define snapshot bindings for various pipelines in configs/snapshot_bindings_registry/bindings.yaml

Model-specific Configs

Every model needs exactly three configs defined - model specs, search and training
Each of these should follow the exact same file structure in order for the pipelines to work as expected
The expected nesting is configs/{current}/{problem_type}/{segment}/{model_version}.yaml, where current = model_specs, search or training
Problem type can be cancellation, no_show, lead_time, etc.
Segment can be global, city_hotel, resort_hotel_online_ta, etc. - use clear abbrevations
Model specs are foundational for each model, while search and training configs help define what is relevant for search and training runs respectively (check the architecture overview to understand how configs resolve at runtime)
The repo comes with predefined model-specific configs for 14 models, spanning 7 problem types
The predefined configs to not guarantee the most optimal results, but are considered to be a good starting point - adjust them as you wish

Environment Configs

Define environment configs in configs/env/{environment}.yaml, where environment = dev, prod or test
The pipelines will only recognize dev, prod and test as valid names
It is not crucial to have these configs defined, but they are useful, as they come last in config resolution
Since they override all of the other configs, it is important to be mindful of what is included in these configs
Their expected use is primarily convenience in dev/test environment and assurance of quality in prod environment
The repo comes with some predefined configs for each of the environments - feel free to adjust them to your use-cases

Pipeline Configs

Define pipeline configs in configs/pipelines/{data_type}/{algorithm}/{pipeline_version}.yaml
Data type can currently only be tabular; time-series is a planned implementation
These configs define the logic used within an sklearn Pipeline that wraps some models
Model specs define which pipeline version (if any) will be used

Promotion Configs

Define promotion thresholds for each model in configs/promotion/thresholds.yaml
These configs are used by promote.py
Thresholds can be changed, but changing them too often may be considered a bad business practice
The repo comes with predefined thresholds for the 14 included models
Thresholds are subjective, so you are encouraged to define them on your own

FilesExpand file tree

usage.md

Latest commit

History

usage.md

File metadata and controls

Usage Guide

Notes on Orchestration

Included Artifacts and Configs

Docker

ML Service

Overview

Configurations

Supported

Not Supported

Reasoning

Instructions

Examples

Pipelines:

Modeling Configs:

Running Pipelines

Data Preprocessing

Feature Set Freezing

Experiments

Search (hyperparameter searching)

Runners

Training

Evaluation

Explainability

Orchestration

Promotion

Post-promotion

Inference

Monitoring

The Grand Orchestrator

Artifacts

Scripts

Generators

Quality Scripts

Logging

EDA

Configurations

Defaults

Data Configs

Feature Registry

Snapshot Bindings

Model-specific Configs

Environment Configs

Pipeline Configs

Promotion Configs