This project is designed to process, clean, standardise, cluster, and visualise survey data from multiple Excel files. It was originally used to process responses from specific questionnaires completed by my classmates during my Minería de Datos course.
The project provides tools to:
- Import, clean, and standardise raw data.
- Export the processed data to CSV files and optionally upload it to Google BigQuery for further analysis (this is how I worked with this data on Looker Studio).
- Perform clustering using multiple algorithms (KMeans, DBSCAN, Agglomerative Clustering, and Gaussian Mixture Models).
- Visualise clustering results interactively in a Jupyter Notebook.
-
process.pyThe main script that orchestrates the data processing workflow. It:- Imports raw data from Excel files.
- Cleans and standardises the data.
- Exports the processed data to CSV files.
- Optionally uploads the data to Google BigQuery if a configuration file is provided.
-
cluster.pyThe main script for clustering. It:- Builds a master dataframe by merging selected features from multiple questionnaires.
- Scales the data for clustering.
- Performs clustering using KMeans, DBSCAN, Agglomerative Clustering, and Gaussian Mixture Models.
- Exports the clustering results to a CSV file.
-
requirements.txtLists the Python packages required to run the project.
-
import_data.pyHandles the import of Excel files from thedata/import/xlsxfolder. It reads the files into pandas DataFrames and cleans column names. -
clean_data.pyCleans the imported data by:- Removing duplicate rows based on account numbers.
- Cleaning text fields (e.g., removing accents, punctuation, and normalising whitespace).
- Dropping unnecessary columns like email addresses.
-
standardise_data.pyStandardises the data based on predefined question types. It processes numeric, boolean, time, and categorical data to ensure consistency across all datasets. -
export_data.pyExports the processed data:- To CSV files in the
data/exportfolder. - To Google BigQuery, if configured, using the
pandas-gbqlibrary.
- To CSV files in the
-
question_types.pyContains a predefined list of question types for each questionnaire. These types guide the standardisation process. This list corresponds to the questionnaires this project was originally developed for.
-
analysis.ipynbA Jupyter Notebook for analysing clustering statistics. It:- Allows you to specify which statistics to show (mean, median, standard deviation, variance, range).
- Allows you to select the method whose clusters are to be analysed.
-
cluster_stats.pyContains functions to compute various statistics (mean, median, standard deviation, variance, range) for clustering results. -
prepare_data.pyPrepares the master dataframe for clustering by merging selected features from multiple questionnaires and scaling the data. -
selected_features.pyContains a list of selected features for each questionnaire. These features are used to build the master dataframe for clustering.
-
kmeans.pyImplements KMeans clustering using scikit-learn. -
dbscan.pyImplements DBSCAN clustering using scikit-learn. Includes adjustable parameters forepsandmin_samples. -
agglomerative.pyImplements Agglomerative Clustering using scikit-learn. -
gmm.pyImplements Gaussian Mixture Model clustering using scikit-learn.
A Jupyter Notebook for visualising clustering results. It:
- Allows you to specify which features to use for the x and y axes.
- Supports optional convex hull visualisation for clusters.
- Visualises results for all clustering methods (KMeans, DBSCAN, Agglomerative Clustering, and Gaussian Mixture Models).
-
Clone the repository:
git clone <repository-url> cd Tablero-Analitico
-
Create a virtual environment (optional but recommended):
python -m venv .venv .venv\Scripts\activate # on Windows
-
Install the required packages:
pip install -r requirements.txt
Run python process.py with the following considerations:
-
Input Files: Place the raw Excel files in the
data/import/xlsxfolder. The script will automatically detect and process all.xlsxfiles in this directory. -
Processing: The script performs the following steps:
- Cleans and standardises the data.
- Removes duplicates and unnecessary columns.
- Applies transformations based on question types.
-
Output Files: Processed data is exported as CSV files to the
data/exportfolder.
Run python cluster.py with the following considerations:
-
Build Master Dataframe: The
cluster.pyscript merges selected features from multiple questionnaires into a single dataframe. -
Scale Data: The numeric features are scaled using
StandardScalerto ensure proper clustering. -
Clustering Algorithms: The following clustering algorithms are applied:
- KMeans: Groups data into a predefined number of clusters.
- DBSCAN: Identifies clusters based on density, with adjustable
epsandmin_samplesparameters. - Agglomerative Clustering: Performs hierarchical clustering.
- Gaussian Mixture Models (GMM): Fits data to a mixture of Gaussian distributions.
-
Export Results: The clustering results are exported to
data/export/clustering/clustering_results.csv. The file includes the original features and cluster labels for each method.
- Open the
visualisation.ipynbnotebook. - Load the clustering results from
data/export/clustering/clustering_results.csv. - Use the
visualise_all_clusters()function to visualise the clusters:- Specify the features to use for the x and y axes.
- Enable or disable convex hull visualisation for clusters.
- Open the
analysis.ipynbnotebook. - Load the clustering results from
data/export/clustering/clustering_results.csv. - Use the
generate_cluster_stats()function to explore cluster statistics:- Specify the statistics to be shown (mean, median, standard deviation, variance, range).
- Specify the method whose clusters are to be analysed.
To upload the processed data to Google BigQuery, follow these steps:
-
Create a JSON configuration file with the following structure:
{ "project_id": "your-google-cloud-project-id", "dataset_id": "your-dataset-id", "table_names": ["table1", "table2", "table3"] } -
Run the
process.pyscript and pass the configuration file as a command-line argument:python process.py path/to/gbq_config.json
-
The script will validate the configuration and upload the processed data to the specified BigQuery tables.
- Ensure your Google Cloud credentials are properly set up before using the BigQuery integration. In my case, I just had to try to upload to BigQuery for the first time and then I was asked to authenticate in my browser.
- The script assumes a specific folder structure for input and output files. Modify the paths in the code if necessary.
- Use the
visualisation.ipynbandanalysis.ipynbnotebooks for interactive exploration of clustering results and statistics.