Spark challenge for ETL and Analytics

Consist of two parts:

basic ETL job that transforms a CSV file into parquet format
job that uses the parquet files to compute some basic analytics

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Python (3.8), Apache Spark 3.1+ (Hadoop 3.0+)
Data from https://www.kaggle.com/open-powerlifting/powerlifting-database/download
To run tests pandas, pytest are needed

Installing

Follow links:

Install Apache Spark on Windows 10 using WSL
Install Hadoop Single Cluster
Hadoop 3.2.2 source
pandas or use: pip install pandas, pip install pyspark-pandas
pytest

Example of the run:

pyspark < src/job_etl.py > log/job_etl.log 2>&1
pyspark < src/job_analytics.py > log/job_analytics.log 2>&1

Running the tests

To run the automated tests for this system, pandas and pytest have to be installed and configured

Break down into end to end tests

Tests are focusing on evaluate ETL transformation. Analytic calculations checks are focusing on calculation the same output using different formulas/algorithms that should get the same result. To check calculation requires output files/log inspection or add more automatic tests. Example of the test run:

pytest < tests/test_wlConvert.py > log/test_wlConvert.log 2>&1

Deployment

Copy folders and directories to target system should work

Versioning

SemVer for versioning.

Authors

Radovan Jablonovsky - Initial work - Spark

See also the list of contributors who participated in this project.

License

This project is licensed under the BSD License - see the LICENSE.md file for details

Acknowledgments

Hat tip to data analytic community, google, github

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark challenge for ETL and Analytics

Getting Started

Prerequisites

Installing

Running the tests

Break down into end to end tests

Deployment

Versioning

Authors

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spark challenge for ETL and Analytics

Getting Started

Prerequisites

Installing

Running the tests

Break down into end to end tests

Deployment

Versioning

Authors

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages