Welcome to MATRIX! This repository contains our drug repurposing platform which includes data science pipelines, infrastructure, and documentation.
MATRIX is a drug repurposing platform organized as a monorepo containing machine learning pipelines, infrastructure as code, applications, and services. The repository uses uv's workspace feature for efficient multi-package development with infrastructure, machine learning pipelines, shared libraries, applications and services.
/pipelines/matrix- 𧬠Main Drug Repurposing ML Pipeline - Main ML pipeline for drug repurposing using Kedro framework/libs/- π Shared Libraries:matrix-auth/- Authentication and environment utilitiesmatrix-fabricator/- Data fabrication and generation toolsmatrix-gcp-datasets/- GCP integration and Spark utilitiesmatrix-mlflow-utils/- MLflow integration and metric utilities
/infra- ποΈ Infrastructure as Code (IaC) configuration - Infrastructure as Code using Terraform/Terragrunt for GCP deployment/services- βοΈ Supporting Services and APIs - Supporting services (KG dashboard, MOA visualizer, synonymizer, etc.)/docs- π Documentation site generation
The repository uses uv's workspace feature for efficient multi-package development:
- Root
pyproject.toml: Defines the workspace configuration - Individual packages: Each directory with a
pyproject.tomlis a separate package - Shared dependencies: Common dependencies managed at the workspace level
- Local development: Libraries automatically linked in editable mode
Ready to get started? Go to our Getting Started section
Start Development:
make setup # check for dependencies and install precommit hooks
cd pipelines/matrix #
make # run full integration test locallySetup and Installation:
make install # Install dependencies with uv
cd pipelines/matrix
make fetch_secrets # Fetch secrets from GCP Secret ManagerTesting:
make fast_test # Quick tests with testmon
make full_test # Complete test suite
make integration_test # Integration tests using fabricated data. Services in Docker, pipeline not
make docker_test # Full E2E test, pipeline also in dockerπ‘ use
make docker_test TARGET_PLATFORM=linux/arm64on ARM machines for better performance
Linting and Formatting:
Run these at the root of the repo.
make format # Fix code formatting with ruff
make precommit # Run pre-commit hooks
uv run ruff check . --fix # Direct ruff usageRunning Pipelines:
Inside of pipelines/matrix/
uv run kedro run --env test -p test # Run test pipeline
uv run kedro run -p fabricator --env test # Run fabricator pipeline
make compose_up # Start Docker services
make integration_test # Run integration testsDocker Operations:
make docker_build # Build Docker image
make docker_push # Push to registry
make compose_up # Start services
make compose_down # Stop servicesUses Terragrunt with Terraform:
cd infra/deployments/hub/dev # Navigate to specific environment
terragrunt validate # Validate Terraform files
terragrunt plan # Plan changes
terragrunt apply # Apply changes- Kedro Framework: Structured ML pipelines with data catalog and parameter management
- Apache Spark: Large-scale data preprocessing with PySpark
- Neo4j: Graph database for knowledge graph storage and querying
- MLflow: Experiment tracking and model management
- Docker Compose: Local development environment orchestration
- Ingestion: Raw data from multiple knowledge graph sources (RTX-KG2, ROBOKOP)
- Integration: Merging and normalizing knowledge graphs
- Preprocessing: Node normalization and data cleaning
- Embeddings: Graph embeddings generation for ML features
- Matrix Generation: Drug-disease association matrices
- Modeling: Machine learning model training and evaluation
- Inference: Generating predictions and visualizations
- Python 3.11+ with
uvfor dependency management - Kedro for pipeline structure and data catalog
- PySpark for distributed data processing
- Pandera for data validation
- FastAPI for API services
- Pydantic for settings and data validation
- Joblib for caching expensive computations
- Unit Tests: Individual component testing with pytest
- Integration Tests: Full pipeline testing with Docker services
- Spark Tests: Distributed processing validation
- GivenWhenThen: Test organization format for clarity
- Local: Development with Docker Compose
- Sample: Subset data for quick testing
- Test: Full test environment
- Cloud: Production GCP environment
- Use Google-style Python docstrings
- Functional programming preferred
- Cache expensive functions with joblib
- Comments explain "why" not "what" or changes between versions
- Use
terragrunt validatefor Terraform changes in respective folders
Run Full Pipeline:
make integration_test # Full pipeline with servicesDebug Pipeline:
make compose_up # Start services
make wipe_neo # Clear Neo4j data
uv run kedro run --env test # Run specific pipelineInfrastructure Changes:
cd infra/deployments/hub/dev
terragrunt validate # Validate before changes
terragrunt plan # Review changes
terragrunt apply # Apply if approved- never push to main (you can't anyways but you also should not try)
- never rm -rf anything that is not git versioned
- when working on a feature, use a new branch before committing
Please visit our Documentation Page for all details regarding the infrastructure, the repurposing pipeline or evaluation metrics.
We welcome and encourage all external contributions! Please see our Contributing Guide for detailed information on how to contribute to the MATRIX project.
- MATRIX disease list - Repo to manage the MATRIX disease list.
- MATRIX drug list - Repo to manage the MATRIX drug list.
Note both of these will eventually be merged into this monorepo.
Important: The "Every Cure" name, logo, and related trademarks are the exclusive property of Every Cure. Contributors and users of this open-source project are not authorized to use the Every Cure brand, logo, or trademarks in any way that suggests endorsement, affiliation, or sponsorship without explicit written permission from Every Cure.
This project is open source and available under the terms of its license, but the Every Cure brand and trademarks remain protected. Please respect these intellectual property rights.
If you are using MacOS, please run brew install cmake (assuming you have brew installed). This should fix the problem. For Windows, please download cmake and install it from: https://cmake.org/download/