Skip to content

donbigode/spark-container-for-elt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project

This project sets up a Docker-based environment that includes Postgres, Spark cluster with 2 workers, and a Jupyter Notebook instance. It allows you to perform data processing and analysis using PySpark and interact with a Postgres database seamlessly.

Getting Started

To get started with this project, follow the steps below:

  1. Clone the repository:

    git clone https://github.com/donbigode/spark-container-for-elt

    cd spark-container-for-elt

  2. Docker Compose: Before running the environment, make sure you have Docker and Docker Compose installed on your system. If not, you can install Docker from the official website: https://www.docker.com/get-started

  3. Update the Dockerfiles: If you are using the Spark Dockerfile provided in this repository, no changes are needed. However, if you prefer to use your own Spark Dockerfile, ensure it includes the necessary dependencies.

  4. Update the Docker Compose: The docker-compose.yml file should be updated with your desired configurations, such as database credentials, usernames, passwords, etc. You can also modify the ports, volume mappings, or any other settings to suit your needs.

  5. Run the Environment: Once you've made the necessary updates, run the following command to set up the environment:

    docker-compose up -d
    

    This will start the Postgres database, Spark cluster with 2 workers, and Jupyter Notebook instance.

  6. Access Services:

    • Postgres: localhost:5432
    • Spark Master UI: localhost:8080
    • Jupyter Notebook: localhost:8888
      • Use the token provided during Jupyter startup to access the notebook. (docker logs jupyter_notebook is the command in the bash to receive it)
  7. Data and Code Deployment:

    • Place any inbound data you wish to process in the inbound/ directory.
    • Write your PySpark code in the spark_scripts/ directory, using the pyspark/ directory for code deployment.
  8. Cleaning Up: To stop and remove the containers (but retain the data in volumes), run:

    docker-compose down
    

Directory Structure

Here is the expected directory structure of the project:

   project_folder/
├── inbound/
│   └── sample_data.csv
│
├── postgres/
│   └── Dockerfile
│
├── spark/
│   └── Dockerfile
│
├── spark_scripts/src
│   └── file_operations.py
│   └── main.py
│   └── requirements.txt
|
├── docker-compose.yml
└── Readme.md

Note

  • Ensure that the necessary data and code are placed in their respective directories (inbound/ and spark_scripts/) before running the Spark jobs in the Jupyter Notebook.
  • If you make any changes to the Dockerfiles or the Docker Compose file, remember to rebuild the containers using docker-compose up -d --build.

Happy data processing and analysis with Spark and Jupyter Notebook!


Replace `your_username` and `your_project` with your actual GitHub username and project name, respectively. You can also customize any other parts of the `Readme.md` file as needed to suit your project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors