- Problem Statement
- Datasets Used
- Repository Structure and Purpose
- Data Storage
- Notebooks
- Analysis & Machine Learning Approach
- Collaboration Guidelines
- Notes and Limitations
This project analyzes Aadhaar enrolment and authentication datasets to uncover societal trends, regional disparities, operational stress signals, and short-term predictive indicators. The analysis combines exploratory data analysis, simple and explainable machine learning techniques, and an administrative dashboard to support data-driven decision-making and improved service delivery.
Aadhaar enrolment and update services generate large volumes of data across regions, districts, and PIN codes. However, existing reporting systems primarily provide retrospective summaries, offering limited visibility into underlying societal trends, emerging risks, or future demand. This makes it difficult for administrators to anticipate service pressure, understand behavioural patterns, or plan resources proactively.
- Enrolment and update demand is unevenly distributed across regions and PIN codes.
- Sudden spikes and abnormal patterns are often detected only after service disruption.
- Capacity planning is largely reactive, leading to operational stress and longer wait times.
These challenges highlight the need for insights that go beyond static counts and enable early intervention.
This project uses UIDAI-provided datasets:
-
Aadhaar Enrolment Dataset
- Age groups: 0–5, 5–17, 18+
- Geographic levels: State, District, Pincode
-
Aadhaar Biometric Authentication Dataset
- Authentication counts by age group and region
-
Aadhaar Demographic Authentication Dataset
- Fallback authentication usage by age group and region
The project is organized as follows:
UIDAI Data Hackathon - 2026/
├── data/
│ ├── processed/
│ │ ├── analysis/
│ │ ├── cleaned/
│ │ │ ├── biometric_clean.csv
│ │ │ ├── demographic_clean.csv
│ │ │ └── enrolment_clean.csv
│ │ ├── forecasts/
│ │ └── interim/
│ │ ├── biometric_raw_merged.csv
│ │ ├── demographic_raw_merged.csv
│ │ └── enrolment_raw_merged.csv
│ └── raw/
│ ├── biometric/
│ │ ├── biometric1.csv
│ │ ├── biometric2.csv
│ │ ├── biometric3.csv
│ │ └── biometric4.csv
│ ├── demographic/
│ │ ├── demographic1.csv
│ │ ├── demographic2.csv
│ │ ├── demographic3.csv
│ │ ├── demographic4.csv
│ │ ├── demographic5.csv
│ └── enrolment/
│ ├── enrolment1.csv
│ ├── enrolment2.csv
│ ├── enrolment3.csv
├── models/
│ └── prophet/
├── Notebooks/
│ ├── state_wise_cleaning/
│ ├── 01_data_loading.ipynb
│ ├── 02_enrolment_cleaning.ipynb
│ ├── 03_biometric_cleaning.ipynb
│ ├── 04_demographic_cleaning.ipynb
│ ├── 05_create_final_datasets.ipynb
│ ├── 06_enrolment_visuals.ipynb
│ └── 07_demand_forecasting_prophet.ipynb
└── README.md
Folder Descriptions:
data/raw/: Original UIDAI CSV files, organized by type (biometric, demographic, enrolment). Never modify these files.data/processed/interim/: Merged raw datasets, used as intermediate files during processing.data/processed/cleaned/: Cleaned and final datasets, ready for analysis.data/processed/analysis/: Folder for analysis results (e.g., correlations, stats).data/processed/forecasts/: Folder for forecast outputs.models/: Contains predictive models and scripts (e.g., Prophet) for forecasting.Notebooks/: All Jupyter notebooks for data loading, cleaning, and analysis, including state-wise cleaning logic.README.md: Project overview and documentation.
This folder contains only datasets.
Contains processed datasets organized into subfolders:
- cleaned/: Final datasets ready for analysis. Files:
biometric_clean.csvdemographic_clean.csvenrolment_clean.csv
- interim/: Intermediate files generated during processing.
Purpose: Used directly for analysis and visualization.
- Original Aadhaar CSV files as provided
- Files are kept unchanged
- Never edit or delete files here
Purpose: Preserve the original data for reference and reproducibility.
All analysis is performed using Jupyter Notebooks inside this folder.
Current notebooks:
- 01_data_loading.ipynb
Reads raw CSV files and prepares them for processing. - 02_enrolment_cleaning.ipynb
Cleans and preprocesses the Aadhaar enrolment dataset. - 03_biometric_cleaning.ipynb
Cleans and preprocesses the biometric authentication data. - 04_demographic_cleaning.ipynb
Cleans and preprocesses the demographic authentication data.
Rule: One notebook should have one clear responsibility.
The project follows a structured analytical workflow:
- Analyze historical Aadhaar data to understand user behaviour and regional patterns.
- Identify trends and anomalies through time-based and statistical analysis.
- Forecast future enrolment and update demand using interpretable ML models.
- Convert insights into advisory recommendations for proactive decision-making.
- Use VS Code with Jupyter Notebook support
- Use relative file paths
- Do not modify raw data
- Avoid editing the same notebook simultaneously
- Use GitHub or shared storage for collaboration
- Analysis is performed on aggregated data and does not represent individual behavior
- Forecasts are short-term and assume continuation of historical trends
- External socio-economic factors are not explicitly modeled
- All methods prioritize explainability and responsible use of data
This structure ensures:
- Clean separation of data, analysis, and reporting
- Easy collaboration
- Reproducibility
- Alignment with hackathon evaluation criteria