Skip to content

rjablonovsky/spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark challenge for ETL and Analytics

Consist of two parts:

  1. basic ETL job that transforms a CSV file into parquet format
  2. job that uses the parquet files to compute some basic analytics

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Python (3.8), Apache Spark 3.1+ (Hadoop 3.0+)
Data from https://www.kaggle.com/open-powerlifting/powerlifting-database/download
To run tests pandas, pytest are needed

Installing

Follow links:

Example of the run:

pyspark < src/job_etl.py > log/job_etl.log 2>&1
pyspark < src/job_analytics.py > log/job_analytics.log 2>&1

Running the tests

To run the automated tests for this system, pandas and pytest have to be installed and configured

Break down into end to end tests

Tests are focusing on evaluate ETL transformation. Analytic calculations checks are focusing on calculation the same output using different formulas/algorithms that should get the same result. To check calculation requires output files/log inspection or add more automatic tests. Example of the test run:

pytest < tests/test_wlConvert.py > log/test_wlConvert.log 2>&1

Deployment

Copy folders and directories to target system should work

Versioning

SemVer for versioning.

Authors

  • Radovan Jablonovsky - Initial work - Spark

See also the list of contributors who participated in this project.

License

This project is licensed under the BSD License - see the LICENSE.md file for details

Acknowledgments

  • Hat tip to data analytic community, google, github

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages