How to use the hotel_management project for typical workflows.
- IMPORTANT: All of the provided orchestrators should be used with caution, preferably not outside of development and testing (major side-effects are possible)
- Orchestrators were created almost exclusively for dev/test convenience and efficiency in single-owner, single-dev work
skip-if-existingflag determines whether an orchestrator will run the given pipelines if it notices at least one snapshot in the target location (e.g. if set to true andfeature_store/{feature_set_name}/{feature_set_version}/{snapshot_id}/already exists, it will not freeze that feature set; otherwise it will - new snapshot gets created)
This repo comes with a few artifacts for quick checking of what to expect with various pipelines. These include:
- original dataset from kaggle
- two sets of generated data (10k and 20k rows respectively)
- interim and processed datasets related to the previous three
- feature sets related to the mentioned processed datasets
- four full experiment runs in test env
- search + train + evaluate + explain
- included models -> cancellation global v1, adr city_hotel_online_ta v1, no_show city_hotel v1
- one promoted and one staging cancellation global model
- one inference run and one monitoring run on cancellation global
This repo also includes pre-defined configs for 14 different models, optimized to likely perform well.
It also includes some of the other configs, all of which can be found in the configs/ directory.
This enables the user to start training some models immediately, while not occupying their local
workspace with too many artifacts.
If you use docker, parts of the remainder of this file will be less relevant to you. Read regardless for more clarity.
For quick use with docker:
- Run the following command whenever the code is updated:
docker compose build- Add
--no-cachewhen you want to rebuild the image from scratch:
docker compose build --no-cache- Run this command to operate the ml workflow from your browser:
docker compose up-
Simply press
ctrl+cstop running the container- Ideally avoid doing this while pipelines are mid-execution
-
Backend is now on localhost:8000 by default
-
Frontend is now on localhost:8050 by default
- Code within the
ml_service/folder providesDash+FastAPIapps that can be used to:- Run pipelines
- Create and store configs
- Includes validation to ensure proper quality
- Run scripts
- Read docs
- View json and yaml files
- View directory structure
Currently supports the following configs:
- model specs + search + training
configs/model_specs/{problem_type}/{segment}{model_version}.yamlconfigs/search/{problem_type}/{segment}{model_version}.yamlconfigs/train/{problem_type}/{segment}{model_version}.yaml
- data (interim + processed)
configs/data/interim/{dataset_name}/{dataset_version}.yamlconfigs/data/processed/{dataset_name}/{dataset_version}.yaml
- feature set configs (feature registry)
configs/feature_registry/features.yaml
- pipeline configs
configs/pipelines/{data_type}/{algorithm}/{pipeline_version}.yaml
- promotion thresholds
configs/promotion/thresholds.yaml
- snapshot bindings (via a script)
configs/snapshot_bindings_registry/bindings.yaml
Currently does not support the following configs:
- defaults:
configs/defaults/global.yamlconfigs/defaults/{algorithm}.yaml
- environment overlay (configs):
configs/env/{env_name}.yaml
- The supported defaults are written more often, require lineage with timestamp, and are versioned
- The unsupported configs are not meant to be altered, do not require lineage, and are not versioned
In order to use the ml service:
- Launch the backend with
uvicorn ml_service.backend.main:app --reload- Launch the frontend with
python -m ml_service.frontend.app- Open the dashboard in your browser at the specified port and use it.
- Use CLI commands with python scripts found in
pipelines/ - This section includes a brief overview of the pipelines
- The diagrams describing each pipeline in more detail (e.g. which artifacts are used and produced at each step) can be viewed in the architecture overview
- More architecture information in general is located in the
docs/architecturefolder as well
- The
pipelines/data/register_raw_snapshot.pypipeline registers raw data it finds indata/raw/{dataset_name}/{dataset_version}/{snapshot_id}/data.{format}, based on cli arguments - The
pipelines/data/build_interim_dataset.pypipeline builds an interim dataset from one of the raw data snapshots, based on the interaction between cli arguments and configs fromconfigs/data/interim/{dataset_name}/{dataset_version}.yaml - The
pipelines/data/build_processed_dataset.pypipeline builds a processed dataset from one of the interim datasets, based on the interaction between cli arguments and configs fromconfigs/data/processed/{dataset_name}/{dataset_version}.yaml - The
pipelines/orchestration/data/execute_all_data_preprocessing.pyorchestrator executes all of the three pipelines for all of the available raw snapshots and interim and processed configs
- The
pipelines/features/freeze.pypipeline freezes a feature set based on the interaction of cli arguments with feature registry (configs/feature_registry/features.yaml) - The
pipelines/orchestration/features/freeze_all_feature_sets.pyorchestrator freezes all of the feature sets found in the feature registry (configs/feature_registry/features.yaml)
- The
pipelines/search/search.pypipeline performs a hyperparameter search for a given model, based on the interaction between cli arguments and resolved configs (a graph in the architecture overview shows how configs are resolved at runtime) - A search run defines an experiment (one search = one experiment)
- The
pipelines/runners/train.pypipeline performs a training run, based on the interaction between cli arguments and resolved configs (a graph in the architecture overview shows how configs are resolved at runtime) - A training run is canonical for evaluation and explainability runs
- The
pipelines/runners/evaluate.pypipeline performs an evaluation run, based on cli arguments
- The
pipelines/runners/explain.pypipeline performs an explainability run, based on cli arguments
- The
pipelines/orchestration/experiments/execute_experiment_with_latest.pyorchestrator executessearch.py,train.py,evalute.pyandexplain.pyin sequence by defaulting to the latest experiment id for all runners, and the latest train run id for evaluation and explainability runs - The
pipelines/orchestration/experiments/execute_all_experiments_with_latest.pyorchestrator executesexecute_experiment_with_latest.pyfor all of the models, based on file structure withinconfigs/model_specs, such that problem type + segment + model version = one model.
- The
pipelines/promotion/promote.pypipeline stages or promotes a model, and archives the previous one (if promotion occurs), based on cli arguments and predefined thresholds (configs/promotion/thresholds.yaml)
- The
pipelines/post_promotion/infer.pypipeline runs inference using a defined trained model or pipeline with defined snapshot bindings, based on cli arguments
- The
pipelines/post_promotion/monitor.pypipeline monitors model performance, based on cli arguments and relevant artifacts
- The
pipelines/orchestration/master/run_all_workflows.pyexecutesexecute_all_data_preprocessing.py,freeze_all_feature_sets.pyandexecute_all_experiments_with_latest.pyin sequence
- All data-related artifacts can be found in
data/ - All feature-set-related artifacts can be found in
feature_store/ - All experiment-related artifacts can be found in
experiments/ - All promotion-related artifacts, as well as the model registry and archive, can be found in
model_registry/
- Use CLI commands with python scripts found in
scripts/ - This section describes what each script does
- The
generate_cols_for_row_id_fingerprint.pyscript generates a fingerprint that ensures consistency in generating row_id ofhotel_bookings- Impossible to ensure perfect consistency in python code alone, but acts as an additional sanity check
- Good enough for local individual or small team use
- The
generate_fake_data.pyscript generates fake data that can then be used by pipelines.- The
data, along with thesynthesizer_metadataand aquality_report, is saved indata/raw/{dataset_name}/{dataset_version}/{dataset_snapshot}/. - The trained model can be saved in
synthesizers/snapshot_id/, namedctgan_model.pklby default, and then reused, which greatly reduces the scripts' runtime (by up to 99% - training is expensive). - Data is stored in
csvformat by default. Alter the script if needs evolved. - The script is not modularized, as it is not considered to be a core part of the repo, and the repo comes with some pre-generated synthetic data, so the need for the script is not high.
- May be modularized in the future.
- The relationships between columns are likely not captured accurately, but that is considered acceptable at this stage.
- Adding relationship logic would increase complexity with questionable justification for it.
- The generated data is expected to be used for experimenting, rather than training production models.
- This script requires extra setup steps to use, as mentioned in setup.md
- The
- The
generate_operator_hash.pyscript generates an operator hash, which is needed when writing into the feature registry.- Ensure that the operators exist, and write them in proper format (e.g. TotalStay)
- In CLI, separate the operators with a space character (e.g. --operators TotalStay ArrivalDate)
- In GUI, use commas for separation (e.g. TotalStay, ArrivalDate)
- The
generate_snapshot_binding.pyscript generates a new snapshot of snapshot bindings in the snapshot binding registry.- It always writes the latest snapshot for each existing dataset and feature set.
- Alter the results manually if you need older snapshots for specific datasets and/or feature sets.
- These scripts are used by the
pre-commithook, as well asGitHub Actions CI, to ensure code quality. - The
check_import_layers.pyscript checks import layers and dependencies across the codebase to enforce architectural boundaries (specified in boundaries.md) - The
check_naming_conventions.pyscript checks the naming conventions across the codebase.- In order to satisfy the requirements:
- use
snake_caseformodulesandfunctions - use
PascalCaseforclasses - do not prefix
module nameswith_(except__init__and__all__)
- use
- The script also allows for ignoring certain folders, especially
tests/
- In order to satisfy the requirements:
- All individual data pipelines' logs can be found in
data/ - All individual features pipelines' logs can be found in
feature_store/ - All individual experiment-related pipelines' logs can be found in
experiments/ - All individual promotion pipelines' logs can be found in
model_registry/ - All orchestration logs can be found in
orchestration_logs/ - Logging level is defined through CLI
- Expect detailed, informative logs from individual pipelines
- Expect high-level, helpful logs from orchestration pipelines
- Each pipeline run logs to a new location that is logically easy and intuitive to find
- All scripts' logs can be found in
scripts_logs/
- See
notebooks/EDA_and_Data_Preparation.ipynbfor initial exploration
- All configs are defined exclusively within
configs/ - All configs have to respect the existing file structure - otherwise the pipelines will not work
- Naming of datasets, feature sets, versions, problem types and segments needs to be consistent across all configs
- All configs are currently required to be in
yaml - New change in anything that needs to be defined in configs -> new version
- The versioning format across the repository is v{integer} (e.g. v1, v2, v3), and it is important to respect this format in order for everything to work properly
- Whenever you see "version" in the context of configs within this repo, assume the v{integer} format
- Define global and algorithm-specific defaults in
configs/defaults/global.yamlorconfigs/defaults/{algorithm_name}.yaml - These configs are meant to be defined once and never changed
- The repo comes with some predefined configs - feel free to change them for your individual use-cases
- Define configs for
build_interim_dataset.pyruns inconfigs/data/interim/{dataset_name}/{dataset_version}.yaml - Define configs for
build_processed_dataset.pyruns inconfigs/data/processed/{dataset_name}/{dataset_version}.yaml
- Define configs for
freeze.pyinconfigs/feature_registry/features.py - Generate the operator hash with
scripts/generators/generate_operator_hash.py
- Define snapshot bindings for various pipelines in
configs/snapshot_bindings_registry/bindings.yaml
- Every model needs exactly three configs defined - model specs, search and training
- Each of these should follow the exact same file structure in order for the pipelines to work as expected
- The expected nesting is
configs/{current}/{problem_type}/{segment}/{model_version}.yaml, where current =model_specs,searchortraining - Problem type can be cancellation, no_show, lead_time, etc.
- Segment can be global, city_hotel, resort_hotel_online_ta, etc. - use clear abbrevations
- Model specs are foundational for each model, while search and training configs help define what is relevant for search and training runs respectively (check the architecture overview to understand how configs resolve at runtime)
- The repo comes with predefined model-specific configs for 14 models, spanning 7 problem types
- The predefined configs to not guarantee the most optimal results, but are considered to be a good starting point - adjust them as you wish
- Define environment configs in
configs/env/{environment}.yaml, where environment = dev, prod or test - The pipelines will only recognize dev, prod and test as valid names
- It is not crucial to have these configs defined, but they are useful, as they come last in config resolution
- Since they override all of the other configs, it is important to be mindful of what is included in these configs
- Their expected use is primarily convenience in dev/test environment and assurance of quality in prod environment
- The repo comes with some predefined configs for each of the environments - feel free to adjust them to your use-cases
- Define pipeline configs in
configs/pipelines/{data_type}/{algorithm}/{pipeline_version}.yaml - Data type can currently only be tabular; time-series is a planned implementation
- These configs define the logic used within an sklearn Pipeline that wraps some models
- Model specs define which pipeline version (if any) will be used
- Define promotion thresholds for each model in
configs/promotion/thresholds.yaml - These configs are used by
promote.py - Thresholds can be changed, but changing them too often may be considered a bad business practice
- The repo comes with predefined thresholds for the 14 included models
- Thresholds are subjective, so you are encouraged to define them on your own

