The goal of this project is to build an ETL pipeline for a database hosted on Amazon Redshift. The pipeline loads data from Amazon S3 into staging tables on Redshift and executes SQL statements to create analytics tables from these staging tables.
In this step, we create all required SQL queries in the sql_queries.py file. These queries include:
- Creating Fact and Dimension tables
- Loading data from S3 into staging tables
- Inserting data into final tables from staging tables
- Retrieving data ready for analysis
In this step, we write a Python script called create_tables.py to execute table creation queries.
In this step, we write a Python script called etl.py to load data from Amazon S3 to staging tables on Redshift and insert loaded data into final tables.
In this step, we write a Python script called analysis.py to retrieve data from final tables for analysis.
In this step, we write a Jupyter Notebook called project_exe.ipynb that helps execute all the previous scripts step by step.
- Ensure that all required Python packages are installed (e.g.,
psycopg2,pandas,boto3). - Configure the
dwh.cfgfile with your Amazon Redshift cluster and AWS credentials. - Run the
create_tables.pyscript to create the necessary tables in Redshift. - Run the
etl.pyscript to load data from S3 into staging tables and insert it into final tables. - Run the
analysis.pyscript to retrieve data from final tables for analysis.
sql_queries.py: Contains SQL queries for creating tables, loading data, and inserting data into final tables.create_tables.py: Python script to create tables in Redshift.etl.py: Python script to load data from S3 and insert it into final tables.analysis.py: Python script to retrieve data from final tables for analysis.project_exe.ipynb: Jupyter Notebook to execute the scripts step by step.dwh.cfg: Configuration file containing Redshift cluster and AWS credentials.README.md: Documentation explaining the project, how to run the scripts, and the files in the repository.