From 2c88138354bd901ff1af69534a923e7092028c29 Mon Sep 17 00:00:00 2001 From: salvadorciaurriz-lang Date: Mon, 18 May 2026 15:20:18 +0200 Subject: [PATCH] Create lab_connecting_python_to_sql_SOLVED_ES.ipynb --- lab_connecting_python_to_sql_SOLVED_ES.ipynb | 554 +++++++++++++++++++ 1 file changed, 554 insertions(+) create mode 100644 lab_connecting_python_to_sql_SOLVED_ES.ipynb diff --git a/lab_connecting_python_to_sql_SOLVED_ES.ipynb b/lab_connecting_python_to_sql_SOLVED_ES.ipynb new file mode 100644 index 0000000..7b88129 --- /dev/null +++ b/lab_connecting_python_to_sql_SOLVED_ES.ipynb @@ -0,0 +1,554 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b0d81d3e", + "metadata": {}, + "source": [ + "# LAB | Connecting Python to SQL\n", + "\n", + "## Objetivo\n", + "\n", + "En este lab conectamos **Python** con una base de datos SQL, concretamente la base de datos **Sakila**, para analizar alquileres de películas.\n", + "\n", + "El objetivo principal es:\n", + "\n", + "> Identificar los clientes que estuvieron activos en mayo y junio, y comparar cómo cambió su actividad entre ambos meses.\n", + "\n", + "Trabajaremos con:\n", + "\n", + "- `sqlalchemy` para crear la conexión con MySQL.\n", + "- `pandas` para traer consultas SQL como DataFrames.\n", + "- `groupby()` para contar alquileres por cliente.\n", + "- `merge()` para comparar la actividad entre dos meses." + ] + }, + { + "cell_type": "markdown", + "id": "b510d66b", + "metadata": {}, + "source": [ + "# 1. Importar librerías" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75121f78", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from sqlalchemy import create_engine" + ] + }, + { + "cell_type": "markdown", + "id": "df827af6", + "metadata": {}, + "source": [ + "# 2. Crear conexión con Sakila\n", + "\n", + "Antes de ejecutar este bloque, asegúrate de tener:\n", + "\n", + "1. MySQL funcionando.\n", + "2. La base de datos `sakila` cargada.\n", + "3. Tu usuario y contraseña correctos.\n", + "\n", + "La estructura general de conexión es:\n", + "\n", + "```python\n", + "mysql+pymysql://usuario:contraseña@host/base_de_datos\n", + "```\n", + "\n", + "Si tu contraseña tiene caracteres especiales, puede darte error. En ese caso, cámbiala temporalmente por una más simple o usa codificación URL." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b67b6702", + "metadata": {}, + "outputs": [], + "source": [ + "# Cambia estos datos por los de tu ordenador\n", + "username = \"root\"\n", + "password = \"password\"\n", + "host = \"localhost\"\n", + "database = \"sakila\"\n", + "\n", + "connection_string = f\"mysql+pymysql://{username}:{password}@{host}/{database}\"\n", + "\n", + "engine = create_engine(connection_string)" + ] + }, + { + "cell_type": "markdown", + "id": "14183020", + "metadata": {}, + "source": [ + "## 2.1 Probar conexión\n", + "\n", + "Esta consulta sirve para comprobar que Python se conecta correctamente a Sakila." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f56924c", + "metadata": {}, + "outputs": [], + "source": [ + "# Test de conexión\n", + "test_query = '''\n", + "SELECT *\n", + "FROM rental\n", + "LIMIT 5;\n", + "'''\n", + "\n", + "pd.read_sql(test_query, engine)" + ] + }, + { + "cell_type": "markdown", + "id": "ce86ac60", + "metadata": {}, + "source": [ + "# 3. Función `rentals_month`\n", + "\n", + "## Enunciado\n", + "\n", + "Crear una función llamada `rentals_month` que reciba:\n", + "\n", + "- `engine`: conexión a la base de datos.\n", + "- `month`: mes que queremos consultar.\n", + "- `year`: año que queremos consultar.\n", + "\n", + "La función debe devolver un DataFrame con los alquileres de ese mes y año desde la tabla `rental`.\n", + "\n", + "## Lógica SQL\n", + "\n", + "Usamos:\n", + "\n", + "```sql\n", + "MONTH(rental_date)\n", + "YEAR(rental_date)\n", + "```\n", + "\n", + "para filtrar por fecha." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c68981a5", + "metadata": {}, + "outputs": [], + "source": [ + "def rentals_month(engine, month, year):\n", + " query = f'''\n", + " SELECT *\n", + " FROM rental\n", + " WHERE MONTH(rental_date) = {month}\n", + " AND YEAR(rental_date) = {year};\n", + " '''\n", + "\n", + " rentals_df = pd.read_sql(query, engine)\n", + "\n", + " return rentals_df" + ] + }, + { + "cell_type": "markdown", + "id": "35518149", + "metadata": {}, + "source": [ + "## 3.1 Obtener alquileres de mayo y junio de 2005\n", + "\n", + "En Sakila, los datos de alquiler suelen estar principalmente en 2005, por eso usamos:\n", + "\n", + "- mayo de 2005\n", + "- junio de 2005" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da1c33b4", + "metadata": {}, + "outputs": [], + "source": [ + "may_rentals = rentals_month(engine, 5, 2005)\n", + "june_rentals = rentals_month(engine, 6, 2005)\n", + "\n", + "display(may_rentals.head())\n", + "display(june_rentals.head())\n", + "\n", + "print(\"May rentals shape:\", may_rentals.shape)\n", + "print(\"June rentals shape:\", june_rentals.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "fe4fe668", + "metadata": {}, + "source": [ + "# 4. Función `rental_count_month`\n", + "\n", + "## Enunciado\n", + "\n", + "Crear una función llamada `rental_count_month` que reciba:\n", + "\n", + "- un DataFrame de alquileres,\n", + "- `month`,\n", + "- `year`.\n", + "\n", + "La función debe devolver un nuevo DataFrame con:\n", + "\n", + "- `customer_id`\n", + "- número de alquileres realizados por ese cliente en ese mes\n", + "\n", + "La columna de conteo debe llamarse con el formato:\n", + "\n", + "```python\n", + "rentals_05_2005\n", + "```\n", + "\n", + "## Método usado\n", + "\n", + "Usamos:\n", + "\n", + "```python\n", + "groupby(\"customer_id\").size()\n", + "```\n", + "\n", + "para contar cuántos registros tiene cada cliente." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d05cb52", + "metadata": {}, + "outputs": [], + "source": [ + "def rental_count_month(rentals_df, month, year):\n", + " column_name = f\"rentals_{month:02d}_{year}\"\n", + "\n", + " rental_count_df = (\n", + " rentals_df\n", + " .groupby(\"customer_id\")\n", + " .size()\n", + " .reset_index(name=column_name)\n", + " )\n", + "\n", + " return rental_count_df" + ] + }, + { + "cell_type": "markdown", + "id": "3d8738c3", + "metadata": {}, + "source": [ + "## 4.1 Contar alquileres por cliente en mayo y junio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "71ef4d50", + "metadata": {}, + "outputs": [], + "source": [ + "may_rental_count = rental_count_month(may_rentals, 5, 2005)\n", + "june_rental_count = rental_count_month(june_rentals, 6, 2005)\n", + "\n", + "display(may_rental_count.head())\n", + "display(june_rental_count.head())\n", + "\n", + "print(\"Customers active in May:\", may_rental_count.shape[0])\n", + "print(\"Customers active in June:\", june_rental_count.shape[0])" + ] + }, + { + "cell_type": "markdown", + "id": "4cd6ddbd", + "metadata": {}, + "source": [ + "# 5. Función `compare_rentals`\n", + "\n", + "## Enunciado\n", + "\n", + "Crear una función llamada `compare_rentals` que reciba dos DataFrames con conteos de alquileres por cliente en distintos meses.\n", + "\n", + "La función debe devolver un DataFrame combinado con una nueva columna:\n", + "\n", + "```python\n", + "difference\n", + "```\n", + "\n", + "Esta columna representa la diferencia entre los alquileres del segundo mes y los alquileres del primer mes.\n", + "\n", + "## Decisión metodológica\n", + "\n", + "Usamos `merge()` con `how=\"inner\"` porque el objetivo del lab es identificar clientes activos en **ambos meses**.\n", + "\n", + "Si usáramos `outer`, incluiríamos también clientes activos solo en uno de los meses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e392d71c", + "metadata": {}, + "outputs": [], + "source": [ + "def compare_rentals(df_month_1, df_month_2):\n", + " comparison_df = pd.merge(\n", + " df_month_1,\n", + " df_month_2,\n", + " on=\"customer_id\",\n", + " how=\"inner\"\n", + " )\n", + "\n", + " month_1_column = comparison_df.columns[1]\n", + " month_2_column = comparison_df.columns[2]\n", + "\n", + " comparison_df[\"difference\"] = (\n", + " comparison_df[month_2_column] - comparison_df[month_1_column]\n", + " )\n", + "\n", + " return comparison_df" + ] + }, + { + "cell_type": "markdown", + "id": "8f4050b4", + "metadata": {}, + "source": [ + "## 5.1 Comparar mayo y junio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ef1e2b5e", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_comparison = compare_rentals(may_rental_count, june_rental_count)\n", + "\n", + "rentals_comparison.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "727fadc8", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_comparison.shape" + ] + }, + { + "cell_type": "markdown", + "id": "d04deec8", + "metadata": {}, + "source": [ + "# 6. Análisis de resultados\n", + "\n", + "Ahora analizamos:\n", + "\n", + "- clientes activos en ambos meses,\n", + "- clientes que alquilaron más en junio que en mayo,\n", + "- clientes que alquilaron menos,\n", + "- clientes sin cambios." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe4bcbd0", + "metadata": {}, + "outputs": [], + "source": [ + "total_active_both_months = rentals_comparison.shape[0]\n", + "\n", + "more_active_in_june = rentals_comparison[rentals_comparison[\"difference\"] > 0]\n", + "less_active_in_june = rentals_comparison[rentals_comparison[\"difference\"] < 0]\n", + "same_activity = rentals_comparison[rentals_comparison[\"difference\"] == 0]\n", + "\n", + "print(\"Clientes activos en ambos meses:\", total_active_both_months)\n", + "print(\"Clientes con más alquileres en junio:\", more_active_in_june.shape[0])\n", + "print(\"Clientes con menos alquileres en junio:\", less_active_in_june.shape[0])\n", + "print(\"Clientes con la misma actividad:\", same_activity.shape[0])" + ] + }, + { + "cell_type": "markdown", + "id": "5ea0c3a8", + "metadata": {}, + "source": [ + "## 6.1 Clientes con mayor subida de actividad" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc0c0bdb", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_comparison.sort_values(by=\"difference\", ascending=False).head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "4023ac3b", + "metadata": {}, + "source": [ + "## 6.2 Clientes con mayor bajada de actividad" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "613c4a2f", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_comparison.sort_values(by=\"difference\", ascending=True).head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "8f8392d6", + "metadata": {}, + "source": [ + "## 6.3 Estadísticas descriptivas de la diferencia" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4dbecdc", + "metadata": {}, + "outputs": [], + "source": [ + "rentals_comparison[\"difference\"].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "35bcbfa2", + "metadata": {}, + "source": [ + "# 7. Visualización sencilla\n", + "\n", + "Aunque el lab se centra en conexión Python-SQL y manipulación con pandas, una gráfica sencilla ayuda a interpretar la distribución de cambios entre meses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87c91e8a", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.figure(figsize=(8, 5))\n", + "plt.hist(rentals_comparison[\"difference\"], bins=15)\n", + "\n", + "plt.title(\"Difference in rentals: June 2005 vs May 2005\")\n", + "plt.xlabel(\"Difference\")\n", + "plt.ylabel(\"Number of customers\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "a1915ac2", + "metadata": {}, + "source": [ + "# 8. Conclusiones\n", + "\n", + "En este lab hemos conectado Python con SQL y hemos usado pandas para transformar los resultados.\n", + "\n", + "## Funciones creadas\n", + "\n", + "1. `rentals_month(engine, month, year)` \n", + " Devuelve los alquileres de un mes y año concretos.\n", + "\n", + "2. `rental_count_month(rentals_df, month, year)` \n", + " Cuenta cuántos alquileres hizo cada cliente en ese mes.\n", + "\n", + "3. `compare_rentals(df_month_1, df_month_2)` \n", + " Combina dos meses y calcula la diferencia de actividad.\n", + "\n", + "## Interpretación\n", + "\n", + "La columna `difference` se interpreta así:\n", + "\n", + "- Valor positivo: el cliente alquiló más en junio que en mayo.\n", + "- Valor negativo: el cliente alquiló menos en junio que en mayo.\n", + "- Valor cero: el cliente mantuvo el mismo nivel de actividad.\n", + "\n", + "Este flujo permite responder a la pregunta del lab: identificar clientes activos en ambos meses y comparar su actividad entre mayo y junio." + ] + }, + { + "cell_type": "markdown", + "id": "5cdcb5a6", + "metadata": {}, + "source": [ + "# 9. Versión alternativa usando consulta SQL agregada\n", + "\n", + "Esta versión no es necesaria para cumplir el lab, pero muestra cómo hacer parte del conteo directamente desde SQL.\n", + "\n", + "La lógica principal del lab se mantiene con pandas, pero SQL también puede agrupar con `GROUP BY`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3ed50661", + "metadata": {}, + "outputs": [], + "source": [ + "query_may_count = '''\n", + "SELECT customer_id,\n", + " COUNT(*) AS rentals_05_2005\n", + "FROM rental\n", + "WHERE MONTH(rental_date) = 5\n", + " AND YEAR(rental_date) = 2005\n", + "GROUP BY customer_id;\n", + "'''\n", + "\n", + "may_count_sql = pd.read_sql(query_may_count, engine)\n", + "\n", + "may_count_sql.head()" + ] + }, + { + "cell_type": "markdown", + "id": "6c22d311", + "metadata": {}, + "source": [ + "# 10. Checklist final\n", + "\n", + "- Conexión a Sakila creada.\n", + "- Función `rentals_month` completada.\n", + "- Función `rental_count_month` completada.\n", + "- Función `compare_rentals` completada.\n", + "- Comparación entre mayo y junio realizada.\n", + "- Conclusiones añadidas." + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +}