Project

This project sets up a Docker-based environment that includes Postgres, Spark cluster with 2 workers, and a Jupyter Notebook instance. It allows you to perform data processing and analysis using PySpark and interact with a Postgres database seamlessly.

Getting Started

To get started with this project, follow the steps below:

Clone the repository:

git clone https://github.com/donbigode/spark-container-for-elt

cd spark-container-for-elt
Docker Compose: Before running the environment, make sure you have Docker and Docker Compose installed on your system. If not, you can install Docker from the official website: https://www.docker.com/get-started
Update the Dockerfiles: If you are using the Spark Dockerfile provided in this repository, no changes are needed. However, if you prefer to use your own Spark Dockerfile, ensure it includes the necessary dependencies.
Update the Docker Compose: The docker-compose.yml file should be updated with your desired configurations, such as database credentials, usernames, passwords, etc. You can also modify the ports, volume mappings, or any other settings to suit your needs.
Run the Environment: Once you've made the necessary updates, run the following command to set up the environment:
```
docker-compose up -d
```
This will start the Postgres database, Spark cluster with 2 workers, and Jupyter Notebook instance.
Access Services:
- Postgres: localhost:5432
- Spark Master UI: localhost:8080
- Jupyter Notebook: localhost:8888
  - Use the token provided during Jupyter startup to access the notebook. (docker logs jupyter_notebook is the command in the bash to receive it)
Data and Code Deployment:
- Place any inbound data you wish to process in the inbound/ directory.
- Write your PySpark code in the spark_scripts/ directory, using the pyspark/ directory for code deployment.
Cleaning Up: To stop and remove the containers (but retain the data in volumes), run:
```
docker-compose down
```

Directory Structure

Here is the expected directory structure of the project:

   project_folder/
├── inbound/
│   └── sample_data.csv
│
├── postgres/
│   └── Dockerfile
│
├── spark/
│   └── Dockerfile
│
├── spark_scripts/src
│   └── file_operations.py
│   └── main.py
│   └── requirements.txt
|
├── docker-compose.yml
└── Readme.md

Note

Ensure that the necessary data and code are placed in their respective directories (inbound/ and spark_scripts/) before running the Spark jobs in the Jupyter Notebook.
If you make any changes to the Dockerfiles or the Docker Compose file, remember to rebuild the containers using docker-compose up -d --build.

Happy data processing and analysis with Spark and Jupyter Notebook!


Replace `your_username` and `your_project` with your actual GitHub username and project name, respectively. You can also customize any other parts of the `Readme.md` file as needed to suit your project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project

Getting Started

Directory Structure

Note

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
inbound		inbound
postgres		postgres
spark-scripts/src		spark-scripts/src
spark		spark
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Project

Getting Started

Directory Structure

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages