This project sets up a Docker-based environment that includes Postgres, Spark cluster with 2 workers, and a Jupyter Notebook instance. It allows you to perform data processing and analysis using PySpark and interact with a Postgres database seamlessly.
To get started with this project, follow the steps below:
-
Clone the repository:
git clone https://github.com/donbigode/spark-container-for-elt
cd spark-container-for-elt
-
Docker Compose: Before running the environment, make sure you have Docker and Docker Compose installed on your system. If not, you can install Docker from the official website: https://www.docker.com/get-started
-
Update the Dockerfiles: If you are using the Spark Dockerfile provided in this repository, no changes are needed. However, if you prefer to use your own Spark Dockerfile, ensure it includes the necessary dependencies.
-
Update the Docker Compose: The
docker-compose.ymlfile should be updated with your desired configurations, such as database credentials, usernames, passwords, etc. You can also modify the ports, volume mappings, or any other settings to suit your needs. -
Run the Environment: Once you've made the necessary updates, run the following command to set up the environment:
docker-compose up -dThis will start the Postgres database, Spark cluster with 2 workers, and Jupyter Notebook instance.
-
Access Services:
- Postgres: localhost:5432
- Spark Master UI: localhost:8080
- Jupyter Notebook: localhost:8888
- Use the token provided during Jupyter startup to access the notebook. (docker logs jupyter_notebook is the command in the bash to receive it)
-
Data and Code Deployment:
- Place any inbound data you wish to process in the
inbound/directory. - Write your PySpark code in the
spark_scripts/directory, using thepyspark/directory for code deployment.
- Place any inbound data you wish to process in the
-
Cleaning Up: To stop and remove the containers (but retain the data in volumes), run:
docker-compose down
Here is the expected directory structure of the project:
project_folder/
├── inbound/
│ └── sample_data.csv
│
├── postgres/
│ └── Dockerfile
│
├── spark/
│ └── Dockerfile
│
├── spark_scripts/src
│ └── file_operations.py
│ └── main.py
│ └── requirements.txt
|
├── docker-compose.yml
└── Readme.md
- Ensure that the necessary data and code are placed in their respective directories (
inbound/andspark_scripts/) before running the Spark jobs in the Jupyter Notebook. - If you make any changes to the Dockerfiles or the Docker Compose file, remember to rebuild the containers using
docker-compose up -d --build.
Happy data processing and analysis with Spark and Jupyter Notebook!
Replace `your_username` and `your_project` with your actual GitHub username and project name, respectively. You can also customize any other parts of the `Readme.md` file as needed to suit your project.