Data Analysis with Python and PySpark

Data Analysis with Python and PySpark

This is the companion repository for the Data Analysis with Python and PySpark book (Manning, 2022). It contains the source code and data download scripts, when pertinent.

NEW (June 2025): Databricks Free

With Databricks offering free access of most important functionalities, you can now avoid installing (and paying) for your own version. I've created a notebook/file you can use to get all the data in tables and volumes. Five minutes and you're ready to work through the code examples, no fuss!

Just clone the repository in databricks and open the data download notebook.

Get the data (old version, still works)

The complete data set for the book hovers at around ~1GB. Because of this, I moved the data sources to another repository to avoid cloning a gigantic repository just to get the code. The book assumes the data is under ./data.

Mistakes or omissions

If you encounter mistakes in the book manuscript (including the printed source code), please use the Manning platform to provide feedback.

Note on relative paths and program execution

When I execute a *.py file in my PyCharm IDE it has the directory containing that file as the root of the execution.

Therefore, the root of execution would be ~/git/DataAnalysisWithPythonAndPySpark/code/Chxx in my configuration. The book, however, assumes the root of the project ~/git/DataAnalysisWithPythonAndPySpark to be the root of the execution.

Hence, we change the relative path to a data resource from ./data/$specific_data_dir to ../../data/$specific_data_dir.

e.g. in src/Ch04/checkpoint.py

DIRECTORY = "../../data/broadcast_logs"

instead of

DIRECTORY = "./data/broadcast_logs"

and whenever we want to execute a *.py file from the bash terminal, we go into the directory, which contains the python file, e.g.:

~/git/DataAnalysisWithPythonAndPySpark$ cd src/Ch04
~/git/DataAnalysisWithPythonAndPySpark/src/Ch04$ spark-submit ./checkpoint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis with Python and PySpark

NEW (June 2025): Databricks Free

Get the data (old version, still works)

Mistakes or omissions

Note on relative paths and program execution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
doc		doc
src		src
.gitignore		.gitignore
DownloadsDatabricksFree.py		DownloadsDatabricksFree.py
README.md		README.md
dasci_environment_20221223.yml		dasci_environment_20221223.yml
dasci_environment_20230123_updated_linux-laptop.yml		dasci_environment_20230123_updated_linux-laptop.yml
dasci_environment_20230123_updated_willem-Latitude-5590.yml		dasci_environment_20230123_updated_willem-Latitude-5590.yml
dasci_environment_20230123_willem-Latitude-5590.yml		dasci_environment_20230123_willem-Latitude-5590.yml
ds311_env_linux-laptop_no-builds.yml		ds311_env_linux-laptop_no-builds.yml
ds311_env_linux-laptop_no-builds_after-update_--all.yml		ds311_env_linux-laptop_no-builds_after-update_--all.yml
ds311_env_willem-latitude-5590_no-builds_after-update-4-sf4ds.yml		ds311_env_willem-latitude-5590_no-builds_after-update-4-sf4ds.yml
ds311_env_willem-lattitude-5590_no-builds.yml		ds311_env_willem-lattitude-5590_no-builds.yml
ds311_env_willem_Lattitude-5590.yml		ds311_env_willem_Lattitude-5590.yml
ds311_env_willem_Lattitude-5590_after_wget_install.yml		ds311_env_willem_Lattitude-5590_after_wget_install.yml
ds311_env_willem_Lattitude-5590_no-builds.yml		ds311_env_willem_Lattitude-5590_no-builds.yml
ds312_env_--no-builds-20260328.yml		ds312_env_--no-builds-20260328.yml
ds312_mint-22_--no-builds-20250127.yml		ds312_mint-22_--no-builds-20250127.yml
ds312_mint-22_--no-builds-20250420.yml		ds312_mint-22_--no-builds-20250420.yml
ds314_env_--no-builds_20260510.yml		ds314_env_--no-builds_20260510.yml
environment-dasci-20230207_install_python-wget_linux-laptop.yml		environment-dasci-20230207_install_python-wget_linux-laptop.yml
environment-dasci-20230207_install_wget.yml		environment-dasci-20230207_install_wget.yml
environment-dasci-20230207_install_wget_linux-laptop.yml		environment-dasci-20230207_install_wget_linux-laptop.yml
environment-dasci-20230207_update_--all.yml		environment-dasci-20230207_update_--all.yml

Folders and files

Latest commit

History

Repository files navigation

Data Analysis with Python and PySpark

NEW (June 2025): Databricks Free

Get the data (old version, still works)

Mistakes or omissions

Note on relative paths and program execution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages