A repository for the final submission in the Computing for Data Science class at BSE.
Group members:
• Hannes Schiemann
• Julian Romero
• Lucia Sauer
• Moritz Peist
- api - FastAPI with one endpoint to make predictions.
- app - Streamlit application folder for demonstration.
- data - Contains the training data.
- custom_library
- src
- fifa_library
- eda - Quick exploratory analysis of the data.
- model - Whole pipeline including feature creation, preprocessing, model fitting, hyperparameter tuning, model evaluation, and prediction.
- tests - Unit tests for preprocessing and feature creation.
- fifa_library
- src
Ensure you have Docker Desktop installed on your system. It provides the Docker environment required to build and run the services.
-
Open your terminal and navigate to the project folder containing the
docker-compose.ymlfile. -
Execute the following command:
docker-compose up
This command will build the Docker images (if not already built) and start the services.
Once the services are running, you can access them via your browser:
-
Streamlit App:
Access the Streamlit application at http://localhost:8501. -
FastAPI Swagger UI:
Explore the FastAPI endpoints using the Swagger UI at http://localhost:8000/docs.
The following is going to outline the design philosophy, scalability considerations, and best practices for contributing to the project's library. It also provides guidelines for adapting the library to work with different versions of the FIFA dataset.
The project is structured around a pipeline of well-defined, modular preprocessors. Each preprocessor is responsible for a specific set of transformations on the input data. By adhering to clear interface contracts and separation of concerns, the library remains extensible and maintainable as it evolves.
To be able to run main.ipynb notebook, it is necesary to create a new environment, install the requirements.txt and also the custom FIFA library. For installing the custom library locally, follow these steps:
-
Open a terminal and navigate to the
custom_librarydirectory:cd custom_library -
Install the library using
pip:pip install .
This will install the library and make its modules available for use in your Python environment. Ensure that your Python environment is activated before running the command.
The data set is from EA's football computer game FIFA 2021. The aim is to predict the position of the player. The initial data set had a fine-granular structure of 24 positions. For simplicity, we reduced the number of positions to 8 reflecting the key positions in the game:
1. Modularity:
Each transformation step should be encapsulated in its own class. This ensures:
- Clear separation of responsibilities.
- Easier debugging, testing, and maintenance.
- The ability to mix and match preprocessors as needed.
2. Scalability:
The codebase is expected to grow in terms of:
- Number of Preprocessors: As new feature engineering ideas are introduced, they can be implemented as new classes or integrated into existing ones.
- Variety of Features: The FeatureCreation class uses feature flags to enable or disable certain transformations. Adding a new feature involves:
- Defining it in a separate method.
- Adding a feature flag.
- Updating get_feature_names_out and the input validation logic accordingly.
- Number of Models and Metrics: Although the FeatureCreation class focuses on feature engineering, the same principles apply across the pipeline. Models and metrics should follow a similar, modular approach.
3. Flexibility and Extensibility:
The FeatureCreation class allows customization through:
- Feature Flags: Users can enable or disable sets of transformations.
- Traits Mapping: Users can provide a custom traits_mapping dictionary to handle different trait sets or adapt to changes in the dataset’s trait definitions.
4. Consistency with scikit-learn Conventions:
By inheriting from BaseEstimator and TransformerMixin , the FeatureCreation class integrates seamlessly into a sklearn pipeline. This ensures:
- Familiar APIs for users.
- Compatibility with tools like GridSearchCV , Pipeline , and FeatureUnion .
- Easy and fast extensibility and scalin-up or down of any steps within the pipeline.
• Feature Flags:
The class accepts a feature_flags dictionary at initialization. This dictionary controls which features are computed. For example:
feature_flags = {
"ratio_features": True,
"foot_pace": True,
"wide_player": True,
"playmaker": True,
"traits": True,
}In general we expect the structure of player stats not to change dramatically between versions, so in principle the library should still be applicable. However, some new attributes may be introduced, old ones removed, or the names could be changed. To adapt the library to a new version:
• Check for Renamed or Missing Columns: Review the dataset schema. If attributes like power_stamina or movement_balance are renamed or removed, you must:
- Update the FeatureCreation class’s input validation (_validate_feature_requirements) to reflect the new names.
- Adjust the feature computation methods to use the correct columns.
- or turn off the feature creation involving non-existant features
• Player_traits: We expect the encoding of player_traits to be especially sensible to changing the version of FIFA. In case new traits are added or old ones removed, the mapping of these traits can be updated and passed as a dictionary
We only test on preprocessing and feature engineering for now and reach a 100% coverage. To run the tests:
-
Navigate to the folder custom_library
-
Execute the following command in the terminal:
pytest --cov --cov-report=html:coverage_re
-
This outputs a coverage report in HTML format under coverage_re