diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..cf0f486 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +.env +__pycache__/ +*.pyc +.ipynb_checkpoints/ \ No newline at end of file diff --git a/lab-sql-python-connection.ipynb b/lab-sql-python-connection.ipynb new file mode 100644 index 0000000..970e704 --- /dev/null +++ b/lab-sql-python-connection.ipynb @@ -0,0 +1,857 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "3b49dda7", + "metadata": {}, + "source": [ + "# Connecting Python to SQL Lab\n", + "\n", + "## Introduction\n", + "\n", + "Welcome to the Connecting Python to SQL lab!\n", + "\n", + "In this lab, you will be working with the Sakila database on movie rentals. Specifically, you will be practicing how to do basic SQL queries using Python. By connecting Python to SQL, you can leverage the power of both languages to efficiently manipulate and analyze large datasets.\n", + "\n", + "Throughout this lab, you will practice how to use Python to retrieve and manipulate data stored in the Sakila database using SQL queries. Let's get started!" + ] + }, + { + "cell_type": "markdown", + "id": "2119b713", + "metadata": {}, + "source": [ + "## Challenge\n", + "\n", + "In this lab, the objective is to identify the customers who were active in both May and June, and how did their activity differ between months.\n", + "\n", + "To achieve this, follow these steps:" + ] + }, + { + "cell_type": "markdown", + "id": "5204279a", + "metadata": {}, + "source": [ + "## Step 1: Establish a connection between Python and the Sakila database\n", + "\n", + "Establish a connection between Python and the Sakila database." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "172c1c89", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "from sqlalchemy import create_engine\n", + "from sqlalchemy import text\n", + "\n", + "load_dotenv()\n", + "\n", + "DB_USER = os.getenv(\"DB_USER\")\n", + "DB_PASSWORD = os.getenv(\"DB_PASSWORD\")\n", + "DB_HOST = os.getenv(\"DB_HOST\")\n", + "DB_PORT = os.getenv(\"DB_PORT\")\n", + "DB_NAME = os.getenv(\"DB_NAME\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d817753e", + "metadata": {}, + "outputs": [], + "source": [ + "connection_string = f\"mysql+pymysql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}\"\n", + "\n", + "engine = create_engine(connection_string)" + ] + }, + { + "cell_type": "markdown", + "id": "87448b69", + "metadata": {}, + "source": [ + "## Step 2: Create the `rentals_month` function\n", + "\n", + "Write a Python function called `rentals_month` that retrieves rental data for a given month and year, passed as parameters, from the Sakila database as a Pandas DataFrame.\n", + "\n", + "The function should take in three parameters:\n", + "\n", + "- `engine`: an object representing the database connection engine to be used to establish a connection to the Sakila database.\n", + "- `month`: an integer representing the month for which rental data is to be retrieved.\n", + "- `year`: an integer representing the year for which rental data is to be retrieved.\n", + "\n", + "The function should execute a SQL query to retrieve the rental data for the specified month and year from the `rental` table in the Sakila database, and return it as a pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "82ec8103", + "metadata": {}, + "outputs": [], + "source": [ + "def rentals_month(engine, month, year):\n", + " start_date = f\"{year}-{month:02d}-01\"\n", + " end_date = pd.to_datetime(start_date) + pd.DateOffset(months=1)\n", + " end_date = end_date.strftime(\"%Y-%m-%d\")\n", + "\n", + " query = text(\"\"\"\n", + " SELECT *\n", + " FROM rental\n", + " WHERE rental_date >= :start_date\n", + " AND rental_date < :end_date;\n", + " \"\"\")\n", + "\n", + " rentals_df = pd.read_sql(\n", + " query,\n", + " engine,\n", + " params={\n", + " \"start_date\": start_date,\n", + " \"end_date\": end_date\n", + " }\n", + " )\n", + "\n", + " return rentals_df" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "1809b2d2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
rental_idrental_dateinventory_idcustomer_idreturn_datestaff_idlast_update
012005-05-24 22:53:303671302005-05-26 22:04:3012006-02-15 21:30:53
122005-05-24 22:54:3315254592005-05-28 19:40:3312006-02-15 21:30:53
232005-05-24 23:03:3917114082005-06-01 22:12:3912006-02-15 21:30:53
342005-05-24 23:04:4124523332005-06-03 01:43:4122006-02-15 21:30:53
452005-05-24 23:05:2120792222005-06-02 04:33:2112006-02-15 21:30:53
\n", + "
" + ], + "text/plain": [ + " rental_id rental_date inventory_id customer_id \\\n", + "0 1 2005-05-24 22:53:30 367 130 \n", + "1 2 2005-05-24 22:54:33 1525 459 \n", + "2 3 2005-05-24 23:03:39 1711 408 \n", + "3 4 2005-05-24 23:04:41 2452 333 \n", + "4 5 2005-05-24 23:05:21 2079 222 \n", + "\n", + " return_date staff_id last_update \n", + "0 2005-05-26 22:04:30 1 2006-02-15 21:30:53 \n", + "1 2005-05-28 19:40:33 1 2006-02-15 21:30:53 \n", + "2 2005-06-01 22:12:39 1 2006-02-15 21:30:53 \n", + "3 2005-06-03 01:43:41 2 2006-02-15 21:30:53 \n", + "4 2005-06-02 04:33:21 1 2006-02-15 21:30:53 " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "may_rentals = rentals_month(engine, 5, 2005)\n", + "\n", + "may_rentals.head()" + ] + }, + { + "cell_type": "markdown", + "id": "95eb1504", + "metadata": {}, + "source": [ + "## Step 3: Create the `rental_count_month` function\n", + "\n", + "Develop a Python function called `rental_count_month` that takes the DataFrame provided by `rentals_month` as input along with the month and year and returns a new DataFrame containing the number of rentals made by each `customer_id` during the selected month and year.\n", + "\n", + "The function should also include the month and year as parameters and use them to name the new column according to the month and year.\n", + "\n", + "For example, if the input month is `05` and the year is `2005`, the column name should be `\"rentals_05_2005\"`.\n", + "\n", + "Hint: Consider making use of pandas `groupby()`." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "4d674e13", + "metadata": {}, + "outputs": [], + "source": [ + "def rental_count_month(rentals_df, month, year):\n", + " column_name = f\"rentals_{month:02d}_{year}\"\n", + "\n", + " rentals_count_df = (\n", + " rentals_df\n", + " .groupby(\"customer_id\")\n", + " .size()\n", + " .reset_index(name=column_name)\n", + " )\n", + "\n", + " return rentals_count_df" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5b91c735", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customer_idrentals_05_2005
012
121
232
353
463
\n", + "
" + ], + "text/plain": [ + " customer_id rentals_05_2005\n", + "0 1 2\n", + "1 2 1\n", + "2 3 2\n", + "3 5 3\n", + "4 6 3" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "may_count = rental_count_month(may_rentals, 5, 2005)\n", + "\n", + "may_count.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "1ade84c2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(520, 2)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "may_count.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "f3232e68", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customer_idrentals_06_2005
017
121
234
346
455
\n", + "
" + ], + "text/plain": [ + " customer_id rentals_06_2005\n", + "0 1 7\n", + "1 2 1\n", + "2 3 4\n", + "3 4 6\n", + "4 5 5" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "june_rentals = rentals_month(engine, 6, 2005)\n", + "\n", + "june_count = rental_count_month(june_rentals, 6, 2005)\n", + "\n", + "june_count.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "7424db71", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(590, 2)" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "june_count.shape" + ] + }, + { + "cell_type": "markdown", + "id": "d4a0f849", + "metadata": {}, + "source": [ + "## Step 4: Create the `compare_rentals` function\n", + "\n", + "Create a Python function called `compare_rentals` that takes two DataFrames as input containing the number of rentals made by each customer in different months and years.\n", + "\n", + "The function should return a combined DataFrame with a new `difference` column, which is the difference between the number of rentals in the two months." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "7fd70a12", + "metadata": {}, + "outputs": [], + "source": [ + "def compare_rentals(df1, df2):\n", + " comparison_df = pd.merge(\n", + " df1,\n", + " df2,\n", + " on=\"customer_id\",\n", + " how=\"inner\"\n", + " )\n", + "\n", + " first_month_col = df1.columns[1]\n", + " second_month_col = df2.columns[1]\n", + "\n", + " comparison_df[\"difference\"] = (\n", + " comparison_df[second_month_col] - comparison_df[first_month_col]\n", + " )\n", + "\n", + " return comparison_df" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "793a53bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customer_idrentals_05_2005rentals_06_2005difference
01275
12110
23242
35352
46341
\n", + "
" + ], + "text/plain": [ + " customer_id rentals_05_2005 rentals_06_2005 difference\n", + "0 1 2 7 5\n", + "1 2 1 1 0\n", + "2 3 2 4 2\n", + "3 5 3 5 2\n", + "4 6 3 4 1" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "comparison_df = compare_rentals(may_count, june_count)\n", + "\n", + "comparison_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "154acd8e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customer_idrentals_05_2005rentals_06_2005difference
3864541109
389457198
178213198
248295198
2327187
\n", + "
" + ], + "text/plain": [ + " customer_id rentals_05_2005 rentals_06_2005 difference\n", + "386 454 1 10 9\n", + "389 457 1 9 8\n", + "178 213 1 9 8\n", + "248 295 1 9 8\n", + "23 27 1 8 7" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
customer_idrentals_05_2005rentals_06_2005difference
17320761-5
23027462-4
21025051-4
16719851-4
50959662-4
\n", + "
" + ], + "text/plain": [ + " customer_id rentals_05_2005 rentals_06_2005 difference\n", + "173 207 6 1 -5\n", + "230 274 6 2 -4\n", + "210 250 5 1 -4\n", + "167 198 5 1 -4\n", + "509 596 6 2 -4" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(comparison_df.sort_values(\"difference\", ascending=False).head())\n", + "display(comparison_df.sort_values(\"difference\", ascending=True).head())" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "01a96b96", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(512, 4)" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "comparison_df.shape" + ] + }, + { + "cell_type": "markdown", + "id": "d59e50f5", + "metadata": {}, + "source": [ + "## Conclusions\n", + "\n", + "After comparing May and June 2005, we can see that rental activity was higher in June.\n", + "\n", + "In May, there were **520 active customers**, while in June there were **590 active customers**. When comparing both months together, **512 customers were active in both May and June**. This means that most of the customers who rented movies in May also rented again in June.\n", + "\n", + "To compare the activity between both months, I created the \"difference\" column as:\n", + "\n", + "\"rentals_06_2005\" - \"rentals_05_2005\"\n", + "\n", + "This helped me understand whether each customer rented more, less, or the same number of movies in June compared to May.\n", + "\n", + "A positive value means that the customer rented more movies in June, a value of zero means that the activity stayed the same, and a negative value means that the customer rented fewer movies in June.\n", + "\n", + "Looking at the results, some customers clearly increased their activity in June. For example, a few customers went from only one rental in May to nine or ten rentals in June. There were also some customers who rented less in June, but overall the results suggest that June was a stronger month for rentals.\n", + "\n", + "In conclusion, the main finding is that customer activity increased from May to June, both in terms of the number of active customers and the rental activity of many customers who were active in both months.\n" + ] + }, + { + "cell_type": "markdown", + "id": "f19af788", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}