Guided Universal Adherence & Regulatory Document Intelligence Assistant Network
MSc AI in Biosciences Dissertation Project
Queen Mary University of London
Author: Yusuf Mohammed
Supervisor: Mohammed Elbadawi
Executive Summary • System Features • Architecture • Installation • API Documentation • Configuration • Deployment • Citation
GUARDIAN is a production-ready privacy-first multi-tenant pharmaceutical compliance analysis system developed as part of an MSc dissertation exploring the integration of artificial intelligence into electronic laboratory notebooks. The system revolutionises pharmaceutical protocol validation against European Pharmacopoeia standards through a zero-trust architecture where all user data remains exclusively in the user's personal Google Drive, making data breaches mathematically impossible whilst providing enterprise-grade AI compliance analysis.
GUARDIAN addresses the critical challenge of pharmaceutical compliance whilst maintaining absolute data privacy through a revolutionary architecture:
- Zero-trust implementation: Backend servers store no user data, only authentication tokens and metadata
- Complete user control: All documents, analyses, and results persist exclusively in user's Google Drive
- Session-based processing: Temporary RAM processing with automatic cleanup ensures no data persistence
- Multi-tenant isolation: Complete separation between users with session-based vector databases
- Zero User Data Storage: Backend never stores documents, analysis results, or chat history
- Google Drive Integration: All user content persists exclusively in user's Drive
- PostgreSQL Metadata Only: Database contains only auth tokens and file metadata
- Temporary RAM Processing: Documents processed in memory with immediate cleanup
- Complete User Isolation: Multi-tenant architecture with zero data leakage
- 40+ REST API Endpoints across 7 functional areas
- Google OAuth 2.0 authentication with encrypted token storage
- Session-Based Vector Databases loaded on-demand from user's Drive
- AI-Powered Compliance Analysis against European Pharmacopoeia standards
- Multi-Format Document Support (PDF, DOCX, TXT) with intelligent chunking
- FAISS Vector Search with user-isolated similarity matching
- LLM Integration for contextual compliance feedback
- Professional Report Generation with clustering visualisations
- Automatic Session Management with Drive backup and cleanup
- Modern SvelteKit 5 with TypeScript and static site generation
- Real-Time Processing Updates with WebSocket integration
- Interactive Chat Interface for compliance guidance
- Visual Analytics Dashboard with Plotly.js visualisations
- Drag-and-Drop Upload with progress tracking
- Responsive Design optimised for all devices
- Professional UI/UX tailored for pharmaceutical industry
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ SvelteKit │────▶│ Flask Backend │────▶│ PostgreSQL │
│ Frontend │ │ (Port 5051) │ │ (Metadata) │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ │
│ Google Drive │
│ (User Data) │
│ │
└──────────────────┘
guardian/
├── backend/ # Flask API server
│ ├── api/ # REST API endpoints
│ │ ├── routes/ # Route handlers (auth, analysis, etc.)
│ │ ├── schemas/ # Pydantic request/response models
│ │ └── middleware/ # Authentication, validation, error handling
│ ├── core/ # ML components
│ │ ├── ml/ # Embedding models, vector databases
│ │ └── processors/ # Document processing, chunking
│ ├── models/ # PostgreSQL database models
│ ├── services/ # Business logic layer
│ │ ├── auth/ # Authentication services
│ │ └── *.py # Document, analysis, report services
│ ├── integrations/ # External integrations
│ │ ├── google/ # OAuth 2.0, Drive API
│ │ └── llm/ # LLM client and services
│ └── config/ # Configuration management
├── frontend/ # SvelteKit application
│ ├── src/
│ │ ├── lib/
│ │ │ ├── components/ # UI components
│ │ │ ├── services/ # API client
│ │ │ ├── stores/ # State management
│ │ │ └── types/ # TypeScript definitions
│ │ └── routes/ # Page components
│ └── static/ # Static assets
├── docker/ # Docker configuration
├── scripts/ # Deployment scripts
└── docs/ # Architecture documentation
- Privacy by Design: User data never persists on servers
- Multi-Tenant Isolation: Complete separation between users at all layers
- Session-Based Processing: Temporary in-memory processing with cleanup
- Google Drive Persistence: All user data stored in personal Drive
- Microservice-Ready: Clean service boundaries for future scaling
- Python 3.10-3.11 (Python 3.12+ has compatibility issues)
- Node.js 18+ with npm
- PostgreSQL 14+ for metadata storage
- Docker & Docker Compose (optional for containerised deployment)
- Google Cloud Project with OAuth 2.0 credentials configured
git clone https://github.com/yusufmo1/guardian.git
cd guardian# Create Virtual Environment
cd backend
python3.10 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install Dependencies
pip install -r requirements.txt
# Configure Environment
cp .env.example .env
# Edit .env with your configuration (see Configuration section)
# Start PostgreSQL with Docker
docker run --name guardian-postgres \
-e POSTGRES_DB=guardian \
-e POSTGRES_USER=guardian \
-e POSTGRES_PASSWORD=guardian_pass \
-p 5432:5432 -d postgres:14
# Run backend server (from guardian/backend directory)
python main.py # Runs on http://localhost:5051
# Alternative: Run from guardian root directory
cd .. # Go back to guardian root
python -m backend.main # Runs on http://localhost:5051# Install Dependencies
cd ../frontend
npm install
# Start development server
npm run dev # Runs on http://localhost:3000Open http://localhost:3000 in your browser. The frontend will proxy API requests to the backend.
# Automated deployment
cd guardian
./scripts/deploy.sh
# Or using Docker Compose
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d# Initialize Google OAuth flow
POST /auth/google/initiate
Content-Type: application/json
{
"redirect_uri": "http://localhost:3000/auth/callback"
}
# Validate session
GET /auth/validate
Cookie: session_token=<token>
# Get user profile
GET /auth/user
Cookie: session_token=<token># Initialize user session
POST /api/session/initialize
Cookie: session_token=<token>
# Upload document
POST /api/session/documents/upload
Cookie: session_token=<token>
Content-Type: multipart/form-data
file: <file>
# Analyse Protocol
POST /api/session/analyze
Cookie: session_token=<token>
Content-Type: application/json
{
"query": "Analyse this protocol for GMP compliance",
"document_ids": ["doc_123"],
"analysis_type": "comprehensive"
}
# Generate report
POST /api/reports/generate
Cookie: session_token=<token>
Content-Type: application/json
{
"analysis_id": "analysis_123",
"format": "pdf",
"include_visualizations": true
}import requests
# Authenticate
session = requests.Session()
response = session.post('http://localhost:5051/auth/google/callback',
json={'code': google_auth_code})
# Initialise Session
session.post('http://localhost:5051/api/session/initialize')
# Upload document
with open('protocol.pdf', 'rb') as f:
files = {'file': f}
response = session.post('http://localhost:5051/api/session/documents/upload',
files=files)
document_id = response.json()['document']['id']
# Analyse for Compliance
analysis = session.post('http://localhost:5051/api/session/analyze',
json={
'query': 'Check GMP compliance',
'document_ids': [document_id]
})# Database Configuration
DATABASE_URL=postgresql://guardian:guardian_pass@localhost:5432/guardian
# Google OAuth 2.0 (from Google Cloud Console)
GOOGLE_CLIENT_ID=your-client-id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-client-secret
GOOGLE_REDIRECT_URI=http://localhost:3000/auth/callback
# Security
SECRET_KEY=your-super-secret-key-for-production
SESSION_SECRET=your-session-secret-key
SESSION_DURATION_HOURS=24
# LLM Integration
LLM_API_URL=http://localhost:1234/v1/chat/completions
LLM_MODEL_NAME=qwen3-30b-a3b@q4_M
# ML Configuration
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
EMBEDDING_DEVICE=cpu # cpu | cuda | mps
# Server Configuration
FLASK_ENV=development
API_HOST=localhost
API_PORT=5051- Create a project in Google Cloud Console
- Enable Google Drive API and Google+ API
- Create OAuth 2.0 credentials (Web application type)
- Add authorised redirect URIs:
http://localhost:3000/auth/callback(development)https://yourdomain.com/auth/callback(production)
- Download credentials and add to
.env
- Document Processing: Intelligent chunking for pharmaceutical documents
- Embedding Generation: SentenceTransformer with GPU acceleration
- Vector Storage: FAISS indexes stored in user's Google Drive
- Similarity Search: Cosine similarity with configurable thresholds
- LLM Analysis: Context-aware compliance checking with structured prompts
- OAuth 2.0: Industry-standard authentication
- Token Encryption: Fernet encryption for Google tokens
- Session Management: Secure session tokens with configurable expiry
- CORS Protection: Configurable cross-origin policies
- Input Validation: Pydantic models for all API endpoints
- SQL Injection Prevention: SQLAlchemy ORM with parameterised queries
- Framework: SvelteKit 5.34.7 with TypeScript 5.8.3
- Build Tool: Vite for optimised development
- Styling: CSS custom properties design system
- Icons: Lucide Svelte with TypeScript-safe wrapper system
- Charts: Plotly.js for data visualisations
- State Management: Svelte stores with reactive patterns
- Embedding Generation: ~100 documents/minute (CPU), ~500 documents/minute (GPU)
- Vector Search: <100ms for 10k documents
- Report Generation: 2-5 seconds for comprehensive PDF
- Frontend Build: <10 seconds production build
- API Response Time: <200ms average (excluding ML operations)
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down# Build and deploy
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d --build
# Health check
./scripts/health-check.sh
# View metrics
docker-compose exec backend curl http://localhost:9090/metrics- Backend Services: Service layer in
backend/services/, routing inbackend/api/routes/ - Frontend Components: Component library in
frontend/src/lib/components/with reactive store management - Database Models: SQLAlchemy ORM models in
backend/models/with migration support - API Validation: Pydantic schemas in
backend/api/schemas/for request/response validation
This system was developed as one of two complementary implementations demonstrating AI integration into electronic laboratory notebooks. Together with the SMILES2SPEC molecular spectral prediction system, it forms the technical foundation of the dissertation "Integrating AI into Electronic Lab Notebooks" submitted for the MSc AI in Biosciences programme at Queen Mary University of London.
For academic use of this work, please cite:
@mastersthesis{mohammed2025guardian,
title = {GUARDIAN: Privacy-First Pharmaceutical Compliance Analysis System for Electronic Laboratory Notebooks},
author = {Mohammed, Yusuf},
year = {2025},
school = {Queen Mary University of London},
department = {MSc AI in Biosciences},
supervisor = {Elbadawi, Mohammed},
note = {MSc Dissertation Project: Integrating AI into Electronic Lab Notebooks}
}Developed as part of MSc AI in Biosciences dissertation at Queen Mary University of London (2025)