This comprehensive guide will walk you through setting up the CADS Research Visualization System from scratch. Follow these steps carefully to ensure a successful installation.
Before starting the installation, ensure you have the following prerequisites:
- Operating System: macOS, Linux, or Windows with WSL2
- Python: Version 3.8 or higher
- Node.js: Version 16 or higher (for development tools)
- Git: For repository management
- Modern Web Browser: Chrome, Firefox, Safari, or Edge
- Supabase Account: For PostgreSQL database hosting
- OpenAlex API Access: Free with email registration
- Groq API Key: Optional, for AI-powered theme generation
- Sentry Account: Optional, for error monitoring
- RAM: Minimum 8GB, recommended 16GB
- Storage: At least 5GB free space
- CPU: Multi-core processor recommended for ML processing
# Clone the repository
git clone https://github.com/your-org/cads-research-visualization.git
cd cads-research-visualization
# Verify repository structure
ls -laExpected output:
drwxr-xr-x cads/ # Core data processing pipeline
drwxr-xr-x visuals/ # Interactive visualization dashboard
drwxr-xr-x database/ # Database schema and migrations
drwxr-xr-x scripts/ # Utility scripts
drwxr-xr-x docs/ # Documentation
drwxr-xr-x data/ # Data storage
drwxr-xr-x tests/ # Test suite
-rw-r--r-- README.md # Main documentation
-rw-r--r-- .env # Environment variables
# Check Python version (must be 3.8+)
python3 --version
# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Navigate to CADS directory
cd cads
# Install Python dependencies
pip install -r requirements.txtVerify installation:
# Test imports
python3 -c "import pandas, numpy, sklearn, sentence_transformers, umap, hdbscan; print('✅ All ML libraries installed successfully')"- Go to supabase.com and create an account
- Create a new project
- Wait for the project to be ready (2-3 minutes)
- Go to Settings → Database
- Copy the connection string
# Copy environment template
cp .env.example .env
# Edit environment file
nano .env # or use your preferred editorRequired environment variables:
# Database Connection (from Supabase)
DATABASE_URL=postgresql://postgres:[password]@[host]:5432/postgres
# API Configuration
OPENALEX_EMAIL=your_email@domain.com
# Optional: AI Theme Generation
GROQ_API_KEY=your_groq_api_key
# Optional: ML Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2
UMAP_N_NEIGHBORS=15
HDBSCAN_MIN_CLUSTER_SIZE=5# Run database migration
python3 ../scripts/migration/execute_cads_migration.pyExpected output:
🗄️ Creating CADS database tables...
✅ Connected to database successfully
✅ Created cads_researchers table
✅ Created cads_works table
✅ Created cads_topics table
✅ Database setup completed successfully
Verify database setup:
# Check database connection
python3 ../scripts/utilities/check_cads_data_location.py# Process CADS research data from OpenAlex
python3 ../scripts/processing/process_cads_with_openalex_ids.pyThis will:
- Read CADS faculty list from
data/cads.txt - Search OpenAlex for matching researchers
- Collect research papers and metadata
- Store data in the database
Expected processing time: 5-10 minutes
# Migrate data to CADS-specific tables
python3 ../scripts/processing/migrate_cads_data_to_cads_tables.pyThis creates CADS-specific copies of the data for processing.
# Execute the complete ML pipeline
python3 process_data.pyThis will:
- Generate semantic embeddings (384-dimensional vectors)
- Perform UMAP dimensionality reduction
- Execute HDBSCAN clustering
- Generate AI-powered cluster themes
- Create visualization data files
Expected processing time: 5-10 minutes
# Navigate to visualization directory
cd ../visuals/public
# Verify data files exist
ls -la data/Expected files:
-rw-r--r-- visualization-data.json # Complete dataset
-rw-r--r-- visualization-data.json.gz # Compressed version
-rw-r--r-- cluster_themes.json # AI-generated themes
-rw-r--r-- cluster_themes.json.gz # Compressed version
-rw-r--r-- clustering_results.json # Clustering results
-rw-r--r-- clustering_results.json.gz # Compressed version
-rw-r--r-- search-index.json # Search index
-rw-r--r-- search-index.json.gz # Compressed version
# Start local web server
python3 -m http.server 8000
# Open in browser
open http://localhost:8000 # macOS
# or visit http://localhost:8000 manually# Navigate back to project root
cd ../../
# Run comprehensive test suite
python3 tests/run_tests.py --all# Run visual integration test
python3 tests/visualization/test_visual_integration.pyThis will:
- Test data file integrity
- Start a local server
- Verify visualization loads correctly
- Provide manual testing checklist
# Test database connectivity
python3 tests/database/test_connection.py# Test complete ML pipeline
python3 tests/pipeline/test_full_pipeline.py| Variable | Required | Default | Description |
|---|---|---|---|
DATABASE_URL |
Yes | - | PostgreSQL connection string |
OPENALEX_EMAIL |
Yes | - | Email for OpenAlex API access |
GROQ_API_KEY |
No | - | API key for AI theme generation |
EMBEDDING_MODEL |
No | all-MiniLM-L6-v2 |
Sentence transformer model |
UMAP_N_NEIGHBORS |
No | 15 |
UMAP neighbors parameter |
HDBSCAN_MIN_CLUSTER_SIZE |
No | 5 |
Minimum cluster size |
LOG_LEVEL |
No | INFO |
Logging level (DEBUG, INFO, WARNING, ERROR) |
For better performance on large datasets:
# Increase UMAP neighbors for better global structure
export UMAP_N_NEIGHBORS=30
# Decrease minimum cluster size for more granular clusters
export HDBSCAN_MIN_CLUSTER_SIZE=3
# Enable debug logging for troubleshooting
export LOG_LEVEL=DEBUGProblem: pip install fails with compilation errors
Solution:
# Update pip and setuptools
pip install --upgrade pip setuptools wheel
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install python3-dev build-essential
# Install system dependencies (macOS)
xcode-select --installProblem: Cannot connect to Supabase database
Solutions:
# Test connection manually
python3 -c "import psycopg2; print('✅ psycopg2 working')"
# Check environment variables
echo $DATABASE_URL
# Verify Supabase project is active
# Go to Supabase dashboard and check project statusProblem: API requests failing or rate limited
Solutions:
# Verify email is set
echo $OPENALEX_EMAIL
# Test API access
curl "https://api.openalex.org/works?filter=author.display_name:john+smith&mailto=$OPENALEX_EMAIL"
# Add delays between requests if rate limited
export OPENALEX_DELAY=1 # 1 second delay between requestsProblem: Out of memory during UMAP/HDBSCAN processing
Solutions:
# Reduce dataset size for testing
export MAX_WORKS=1000
# Use smaller embedding model
export EMBEDDING_MODEL=all-MiniLM-L6-v2
# Increase system swap space
sudo swapon --showProblem: Blank page or JavaScript errors
Solutions:
# Check data files exist
ls -la visuals/public/data/
# Test with simple HTTP server
cd visuals/public
python3 -m http.server 8001
# Check browser console for errors
# Open Developer Tools → Console- Check logs: Enable debug logging with
export LOG_LEVEL=DEBUG - Run tests: Use the test suite to identify specific issues
- Check documentation: Review component-specific README files
- Verify prerequisites: Ensure all system requirements are met
After successful installation:
- Explore the visualization: Open http://localhost:8000 and test all features
- Review documentation: Read the User Guide for detailed usage instructions
- Set up monitoring: Configure Sentry integration for production use
- Configure CI/CD: Set up GitHub Actions for automated testing
- Customize data: Add your own research data or modify the CADS faculty list
- Update research data:
python3 scripts/processing/process_cads_with_openalex_ids.py - Regenerate visualizations:
python3 cads/process_data.py - Run test suite:
python3 tests/run_tests.py --all
- Update Python dependencies:
pip install -r requirements.txt --upgrade - Review database performance and optimize queries
- Check for new OpenAlex data and API changes
- Add new CADS faculty to
data/cads.txt - Update visualization themes and styling
- Scale database resources based on usage
🎉 Installation Complete!
Your CADS Research Visualization System is now ready for use. Visit http://localhost:8000 to start exploring research data and patterns.
For additional help, see the Troubleshooting Guide or User Guide.