Consist of two parts:
- basic ETL job that transforms a CSV file into parquet format
- job that uses the parquet files to compute some basic analytics
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Python (3.8), Apache Spark 3.1+ (Hadoop 3.0+)
Data from https://www.kaggle.com/open-powerlifting/powerlifting-database/download
To run tests pandas, pytest are needed
Follow links:
- Install Apache Spark on Windows 10 using WSL
- Install Hadoop Single Cluster
- Hadoop 3.2.2 source
- pandas or use: pip install pandas, pip install pyspark-pandas
- pytest
Example of the run:
pyspark < src/job_etl.py > log/job_etl.log 2>&1
pyspark < src/job_analytics.py > log/job_analytics.log 2>&1
To run the automated tests for this system, pandas and pytest have to be installed and configured
Tests are focusing on evaluate ETL transformation. Analytic calculations checks are focusing on calculation the same output using different formulas/algorithms that should get the same result. To check calculation requires output files/log inspection or add more automatic tests. Example of the test run:
pytest < tests/test_wlConvert.py > log/test_wlConvert.log 2>&1
Copy folders and directories to target system should work
SemVer for versioning.
- Radovan Jablonovsky - Initial work - Spark
See also the list of contributors who participated in this project.
This project is licensed under the BSD License - see the LICENSE.md file for details
- Hat tip to data analytic community, google, github