This repository contains a secure, hybrid risk-aware tokenization system built with Python, Flask, and MongoDB. The system dynamically evaluates the sensitivity of input data using a combination of rule-based logic and machine learning, and tokenizes it securely using a vault architecture.
- Hybrid Risk Evaluation Engine:
- Rule-Based Pre-Flight: Instantly detects explicit sensitive patterns using Regex (e.g., PAN cards, Indian Aadhaar numbers, Credit Cards, Emails, and Phone Numbers).
- Machine Learning Fallback: Evaluates non-explicit text using a trained machine learning model (
model.pklandvectorizer.pkl), applying a confidence threshold.
- Secure Vault Architecture:
- Separates non-sensitive application data (
records) from highly sensitive, encrypted source data (vault).
- Separates non-sensitive application data (
- Selective Encryption:
- Uses
cryptography.fernetto securely encrypt data classified asHIGHorMEDIUMrisk before it hits the database.
- Uses
- Role-Based Access Control (RBAC) & Detokenization:
- Supports role-based detokenization (e.g.,
Admingets full decryption,Analystgets masked data,Useris denied).
- Supports role-based detokenization (e.g.,
- Audit Logging:
- Maintains strict
audit_logstracking detokenization attempts, usernames, roles, and timestamps.
- Maintains strict
- Python 3.8+
- MongoDB (running locally on default port
27017)
-
Clone the repository:
git clone https://github.com/anchalbha/Risk-Aware-Vault-based-Data-Tokenization-System.git cd Risk-Aware-Vault-based-Data-Tokenization-System -
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the requirements:
pip install -r requirements.txt
-
Environment Variables: Create a
.envfile in the root directory and add a Fernet encryption key:ENCRYPTION_KEY=your_generated_fernet_key_here
(You can generate a key using:
from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())) -
Train the ML Model: If
model.pklandvectorizer.pklare not present or need updating, generate the dataset and run the training script:python dataset_generation.py python train_model.py
-
Start MongoDB: Ensure your local MongoDB instance is running.
-
Run the Flask Application:
python app.py
The web server will start at
http://127.0.0.1:5000/. -
Interacting with the UI:
- Navigate to
http://127.0.0.1:5000/in your browser. - Enter text into the form to evaluate its risk and safely tokenize it.
- Navigate to
-
Detokenization API: You can securely retrieve data via the
/detokenizeendpoint using tools likecurlor Postman.curl -X POST http://127.0.0.1:5000/detokenize \ -H "Content-Type: application/json" \ -d '{"token": "YOUR_TOKEN_HERE", "username": "alice"}'(
alicemaps to Admin,bobmaps to Analyst)
- Rule-Based Engine (
hybrid_predict_risk): Regex patterns catch highly sensitive structured data first. - Machine Learning Model: NLP-based fallback classification.
- Vault Isolation: Actual encrypted payloads sit safely in the
vaultcollection, disconnected from application usage identifiers. - Endpoint (
/detokenize): Ensures requests are rigorously audited before decrypting text based on the mappedtoken.
MIT License