Skip to content

Latest commit

Β 

History

History
305 lines (224 loc) Β· 7.57 KB

File metadata and controls

305 lines (224 loc) Β· 7.57 KB

CADS Data Directory

πŸ“Š Overview

This directory contains all data files for the CADS Research Visualization project, organized by processing stage and usage type. Data flows from raw sources through processing stages to final visualization formats.

πŸ“ Directory Structure

data/
β”œβ”€β”€ README.md                    # This documentation
β”œβ”€β”€ raw/                         # Raw, unprocessed data files
β”‚   └── [original data files]
β”œβ”€β”€ processed/                   # Processed and analyzed data
β”‚   β”œβ”€β”€ cads_search_patterns.json
β”‚   β”œβ”€β”€ cluster_themes.json
β”‚   β”œβ”€β”€ clustering_results.json
β”‚   └── visualization-data.json
└── search/                      # Search index files
    └── search-index.json

πŸ”„ Data Flow

Raw Data β†’ Processing Pipeline β†’ Processed Data β†’ Visualization
   ↓              ↓                    ↓              ↓
OpenAlex      CADS Pipeline      JSON Files     Dashboard
Database      Embeddings         Clusters       Display

πŸ“‚ Data Categories

πŸ—ƒοΈ Raw Data (raw/)

Purpose: Original, unprocessed data files

Contents:

  • Source data from OpenAlex API
  • Database exports
  • Original research datasets
  • Backup files

Characteristics:

  • Immutable - never modified after creation
  • Source of truth for all processing
  • May contain sensitive or unfiltered information

πŸ”§ Processed Data (processed/)

Purpose: Cleaned, analyzed, and structured data ready for visualization

Key Files:

cads_search_patterns.json

  • Purpose: Search patterns and keywords for CADS research
  • Structure: JSON array of search terms and patterns
  • Usage: Powers search functionality in visualization
  • Size: ~50KB

cluster_themes.json

  • Purpose: AI-generated descriptions for research clusters
  • Structure: JSON object mapping cluster IDs to theme descriptions
  • Usage: Provides human-readable cluster names
  • Generated by: Groq AI based on cluster content
  • Size: ~100KB

clustering_results.json

  • Purpose: HDBSCAN clustering assignments for research works
  • Structure: JSON array with work IDs and cluster assignments
  • Usage: Groups similar research works for visualization
  • Generated by: CADS pipeline clustering process
  • Size: ~500KB

visualization-data.json

  • Purpose: Complete dataset formatted for web visualization
  • Structure: JSON object with works, researchers, and metadata
  • Usage: Primary data source for interactive dashboard
  • Generated by: CADS pipeline final output
  • Size: ~2-5MB

πŸ” Search Data (search/)

Purpose: Optimized data structures for search functionality

Key Files:

search-index.json

  • Purpose: Pre-built search index for fast text search
  • Structure: Inverted index mapping terms to document IDs
  • Usage: Enables real-time search in visualization
  • Generated by: Search indexing process
  • Size: ~1-2MB

πŸ“ˆ Data Specifications

File Formats

All data files use JSON format for:

  • Cross-platform compatibility
  • Easy parsing in JavaScript
  • Human-readable structure
  • Version control friendly

Compression

Large files may be stored in compressed format:

  • .json.gz - Gzip compressed JSON
  • Automatic decompression in pipeline
  • ~70% size reduction for large datasets

Encoding

  • UTF-8 encoding for all text files
  • Unicode support for international characters
  • Consistent line endings (LF)

πŸ”§ Data Generation

Pipeline Process

  1. Raw Data Collection

    # Collect from OpenAlex API
    python3 scripts/processing/process_cads_with_openalex_ids.py
  2. Data Processing

    # Run CADS pipeline
    python3 cads/process_data.py
  3. Search Index Generation

    # Generate search indexes
    python3 scripts/utilities/generate_search_index.py

Manual Data Updates

To update data manually:

# Backup existing data
cp -r data/ data_backup_$(date +%Y%m%d)/

# Run pipeline to regenerate
python3 cads/process_data.py

# Verify new data
python3 scripts/utilities/verify_data_integrity.py

πŸ“Š Data Quality

Validation Checks

The pipeline includes automatic validation:

  • Schema validation: Ensures correct JSON structure
  • Completeness checks: Verifies all required fields
  • Consistency checks: Cross-references between files
  • Size validation: Checks for reasonable file sizes

Quality Metrics

Metric Expected Value Description
Works with embeddings 100% All works have semantic vectors
Cluster coverage >95% Works assigned to clusters
Search index coverage 100% All works in search index
Theme generation >90% Clusters have AI-generated themes

πŸ”’ Data Security

Sensitive Information

  • No personal data: Only public research information
  • Anonymized where needed: Personal identifiers removed
  • Public research only: All data from public sources

Access Control

  • Read-only for visualization: Dashboard only reads data
  • Write access for pipeline: Only processing scripts modify data
  • Backup procedures: Regular backups of processed data

πŸ“ Data Maintenance

Regular Tasks

  1. Data Updates (Monthly)

    # Update research data
    python3 scripts/processing/process_cads_with_openalex_ids.py
  2. Index Rebuilding (As needed)

    # Rebuild search indexes
    python3 scripts/utilities/rebuild_search_index.py
  3. Quality Checks (Weekly)

    # Verify data integrity
    python3 scripts/utilities/verify_data_integrity.py

Cleanup Procedures

# Remove old backup files (older than 30 days)
find data/ -name "*_backup_*" -mtime +30 -delete

# Compress large files
gzip data/processed/visualization-data.json

# Verify compression
gunzip -t data/processed/visualization-data.json.gz

πŸ” Data Analysis

File Size Monitoring

# Check data file sizes
du -sh data/*/*.json

# Monitor growth over time
ls -lah data/processed/

Content Analysis

# Count records in data files
jq 'length' data/processed/clustering_results.json
jq 'keys | length' data/processed/cluster_themes.json
jq '.works | length' data/processed/visualization-data.json

🚨 Troubleshooting

Common Issues

1. Missing Data Files

# Regenerate missing files
python3 cads/process_data.py

2. Corrupted JSON Files

# Validate JSON syntax
jq . data/processed/visualization-data.json > /dev/null

3. Large File Sizes

# Compress large files
gzip data/processed/visualization-data.json

4. Outdated Data

# Check file modification times
ls -la data/processed/

Recovery Procedures

# Restore from backup
cp -r data_backup_YYYYMMDD/* data/

# Regenerate from database
python3 cads/process_data.py

# Verify restoration
python3 scripts/utilities/verify_data_integrity.py

πŸ”— Integration

With CADS Pipeline

  • Pipeline generates processed data files
  • Automatic validation and quality checks
  • Consistent file formats and structures

With Visualization Dashboard

  • Dashboard reads processed data files
  • Real-time updates when data changes
  • Optimized formats for web display

With Search System

  • Search indexes enable fast text search
  • Pre-computed indexes for performance
  • Regular index updates with data changes

πŸ“Š Organized data structure supporting reliable CADS research visualization!