This directory contains all data files for the CADS Research Visualization project, organized by processing stage and usage type. Data flows from raw sources through processing stages to final visualization formats.
data/
βββ README.md # This documentation
βββ raw/ # Raw, unprocessed data files
β βββ [original data files]
βββ processed/ # Processed and analyzed data
β βββ cads_search_patterns.json
β βββ cluster_themes.json
β βββ clustering_results.json
β βββ visualization-data.json
βββ search/ # Search index files
βββ search-index.json
Raw Data β Processing Pipeline β Processed Data β Visualization
β β β β
OpenAlex CADS Pipeline JSON Files Dashboard
Database Embeddings Clusters Display
Purpose: Original, unprocessed data files
Contents:
- Source data from OpenAlex API
- Database exports
- Original research datasets
- Backup files
Characteristics:
- Immutable - never modified after creation
- Source of truth for all processing
- May contain sensitive or unfiltered information
Purpose: Cleaned, analyzed, and structured data ready for visualization
Key Files:
- Purpose: Search patterns and keywords for CADS research
- Structure: JSON array of search terms and patterns
- Usage: Powers search functionality in visualization
- Size: ~50KB
- Purpose: AI-generated descriptions for research clusters
- Structure: JSON object mapping cluster IDs to theme descriptions
- Usage: Provides human-readable cluster names
- Generated by: Groq AI based on cluster content
- Size: ~100KB
- Purpose: HDBSCAN clustering assignments for research works
- Structure: JSON array with work IDs and cluster assignments
- Usage: Groups similar research works for visualization
- Generated by: CADS pipeline clustering process
- Size: ~500KB
- Purpose: Complete dataset formatted for web visualization
- Structure: JSON object with works, researchers, and metadata
- Usage: Primary data source for interactive dashboard
- Generated by: CADS pipeline final output
- Size: ~2-5MB
Purpose: Optimized data structures for search functionality
Key Files:
- Purpose: Pre-built search index for fast text search
- Structure: Inverted index mapping terms to document IDs
- Usage: Enables real-time search in visualization
- Generated by: Search indexing process
- Size: ~1-2MB
All data files use JSON format for:
- Cross-platform compatibility
- Easy parsing in JavaScript
- Human-readable structure
- Version control friendly
Large files may be stored in compressed format:
.json.gz- Gzip compressed JSON- Automatic decompression in pipeline
- ~70% size reduction for large datasets
- UTF-8 encoding for all text files
- Unicode support for international characters
- Consistent line endings (LF)
-
Raw Data Collection
# Collect from OpenAlex API python3 scripts/processing/process_cads_with_openalex_ids.py -
Data Processing
# Run CADS pipeline python3 cads/process_data.py -
Search Index Generation
# Generate search indexes python3 scripts/utilities/generate_search_index.py
To update data manually:
# Backup existing data
cp -r data/ data_backup_$(date +%Y%m%d)/
# Run pipeline to regenerate
python3 cads/process_data.py
# Verify new data
python3 scripts/utilities/verify_data_integrity.pyThe pipeline includes automatic validation:
- Schema validation: Ensures correct JSON structure
- Completeness checks: Verifies all required fields
- Consistency checks: Cross-references between files
- Size validation: Checks for reasonable file sizes
| Metric | Expected Value | Description |
|---|---|---|
| Works with embeddings | 100% | All works have semantic vectors |
| Cluster coverage | >95% | Works assigned to clusters |
| Search index coverage | 100% | All works in search index |
| Theme generation | >90% | Clusters have AI-generated themes |
- No personal data: Only public research information
- Anonymized where needed: Personal identifiers removed
- Public research only: All data from public sources
- Read-only for visualization: Dashboard only reads data
- Write access for pipeline: Only processing scripts modify data
- Backup procedures: Regular backups of processed data
-
Data Updates (Monthly)
# Update research data python3 scripts/processing/process_cads_with_openalex_ids.py -
Index Rebuilding (As needed)
# Rebuild search indexes python3 scripts/utilities/rebuild_search_index.py -
Quality Checks (Weekly)
# Verify data integrity python3 scripts/utilities/verify_data_integrity.py
# Remove old backup files (older than 30 days)
find data/ -name "*_backup_*" -mtime +30 -delete
# Compress large files
gzip data/processed/visualization-data.json
# Verify compression
gunzip -t data/processed/visualization-data.json.gz# Check data file sizes
du -sh data/*/*.json
# Monitor growth over time
ls -lah data/processed/# Count records in data files
jq 'length' data/processed/clustering_results.json
jq 'keys | length' data/processed/cluster_themes.json
jq '.works | length' data/processed/visualization-data.json# Regenerate missing files
python3 cads/process_data.py# Validate JSON syntax
jq . data/processed/visualization-data.json > /dev/null# Compress large files
gzip data/processed/visualization-data.json# Check file modification times
ls -la data/processed/# Restore from backup
cp -r data_backup_YYYYMMDD/* data/
# Regenerate from database
python3 cads/process_data.py
# Verify restoration
python3 scripts/utilities/verify_data_integrity.py- Pipeline generates processed data files
- Automatic validation and quality checks
- Consistent file formats and structures
- Dashboard reads processed data files
- Real-time updates when data changes
- Optimized formats for web display
- Search indexes enable fast text search
- Pre-computed indexes for performance
- Regular index updates with data changes
π Organized data structure supporting reliable CADS research visualization!