This is the companion repository for the Data Analysis with Python and PySpark book (Manning, 2022). It contains the source code and data download scripts, when pertinent.
With Databricks offering free access of most important functionalities, you can now avoid installing (and paying) for your own version. I've created a notebook/file you can use to get all the data in tables and volumes. Five minutes and you're ready to work through the code examples, no fuss!
Just clone the repository in databricks and open the data download notebook.
The complete data set for the book hovers at around ~1GB. Because of this, I
moved the data sources to another repository to
avoid cloning a gigantic repository just to get the code. The book assumes the data is under
./data.
If you encounter mistakes in the book manuscript (including the printed source code), please use the Manning platform to provide feedback.
When I execute a *.py file in my PyCharm IDE it has the directory containing that file as the root of the execution.
Therefore, the root of execution would be ~/git/DataAnalysisWithPythonAndPySpark/code/Chxx in my configuration.
The book, however, assumes the root of the project ~/git/DataAnalysisWithPythonAndPySpark to be the root of the
execution.
Hence, we change the relative path to a data resource from
./data/$specific_data_dir to ../../data/$specific_data_dir.
e.g. in src/Ch04/checkpoint.py
DIRECTORY = "../../data/broadcast_logs"instead of
DIRECTORY = "./data/broadcast_logs"and whenever we want to execute a *.py file from the bash terminal, we go into the directory, which contains the
python file, e.g.:
~/git/DataAnalysisWithPythonAndPySpark$ cd src/Ch04
~/git/DataAnalysisWithPythonAndPySpark/src/Ch04$ spark-submit ./checkpoint.py