This pipeline reproduces the results from the paper: "Machine Learning Recovers Corrupted Pharmaceutical 3D Printing Formulation Data"
The pipeline implements Denoising Autoencoders (DAEs) to impute missing values in pharmaceutical formulation data. The dataset contains 1,570+ formulations across 382 ingredients with ~99% sparsity (zeros for unused ingredients).
- MPS Support: Automatically uses Apple Silicon GPU (MPS) for training on Mac
- Reproducible: Uses fixed random seeds (42, 50, 100) for statistical robustness
- Comprehensive: Tests 243 DAE experiments and 90 KNN experiments
- Publication-Ready: Generates plots matching the paper figures
- Masked Loss: Loss computed only on artificially corrupted values
- Multiple Baselines: Includes KNN and zero imputation for comparison
- Parallel Execution: KNN experiments run in parallel for efficiency
The Denoising Autoencoder uses an overcomplete architecture:
Input (382 dims)
→ FC + BatchNorm + LeakyReLU (hidden_dim)
→ FC + BatchNorm + LeakyReLU (latent_dim)
→ FC + Sigmoid (382 dims)
Key design choices:
- Overcomplete:
hidden_dim = latent_dim = neuron_size(256/512/1024) - Denoising: Gaussian noise (σ=0.1) added during training
- Masked loss: Only corrupted values contribute to loss (not entire reconstruction)
Training uses a two-step corruption process:
- Masking: Randomly select values based on missingness rate (1%, 5%, 10%)
- Zeroing + Noise: Set masked values to zero, then add Gaussian noise (σ=0.1) to all values
The model learns to denoise and impute the masked values while ignoring the ~99% of naturally-occurring zeros (unused ingredients).
pip install -r requirements.txtTest all methods with reduced parameters (approximately 10 minutes total):
python run_all.py --quick-testThis runs a complete pipeline: DAE + KNN + Baselines + Comparisons
Or test individual methods:
# DAE only (~5 min)
python run_dae.py --quick-test
# KNN only (~10 seconds)
python run_knn.py --quick-test
# Baseline only (~5 seconds)
python run_baseline.py --quick-testComplete Pipeline - All experiments and comparisons:
python run_all.pyThis runs all methods and generates comparisons automatically.
Individual Methods:
DAE Experiments - Reproduce all paper results (approximately 6-8 hours on Apple Silicon):
python run_dae.pyThis runs 243 experiments:
- 3 missingness rates (1%, 5%, 10%)
- 3 learning rates (10⁻¹, 10⁻³, 10⁻⁵)
- 3 neuron sizes (256, 512, 1024)
- 4 epoch settings (100, 500, 1000, 1200)
- 3 seeds per experiment
KNN Experiments - Classical ML baseline (approximately 15-30 minutes):
python run_knn.pyThis runs 90 experiments in parallel:
- 3 missingness rates (1%, 5%, 10%)
- 5 n-neighbors (3, 5, 10, 20, 50)
- 2 weights (uniform, distance)
- 3 metrics (euclidean, manhattan, cosine)
- 3 seeds per experiment
Baseline Experiments - Naive zero imputation (approximately 1 minute):
python run_baseline.pyAfter running experiments, generate comparison plots:
python run_comparison.pyThis creates:
- R² comparison bar chart (DAE vs KNN vs Zero)
- Performance vs. time scatter plots
- Detailed comparison table
Run specific configurations:
DAE:
# Test only 1% missingness
python run_dae.py --missingness 0.01
# Test specific learning rates
python run_dae.py --learning-rates 0.001 0.00001
# Test specific neuron sizes and epochs
python run_dae.py --neuron-sizes 512 1024 --epochs 500 1000
# Use different seeds
python run_dae.py --seeds 1 2 3 4 5KNN:
# Test specific K values
python run_knn.py --k-neighbors 5 10 20
# Test specific weights and metrics
python run_knn.py --weights uniform distance --metrics euclidean
# Test specific missingness rates
python run_knn.py --missingness 0.01 0.05
# Run sequentially (not parallel)
python run_knn.py --no-parallelBaselines:
# Test specific missingness rates
python run_baseline.py --missingness 0.01 0.05 0.10
# Use different seeds
python run_baseline.py --seeds 42 50 100Skip already-completed experiments:
# Run all but skip DAE (if already completed)
python run_all.py --skip-dae
# Skip multiple methods
python run_all.py --skip-knn --skip-baselineGenerate comparisons only:
python run_all.py --comparison-onlyGenerate plots only (skip training):
python run_dae.py --skip-trainingDAE/
├── data/
│ └── material_name_smilesRemoved.csv # Formulation dataset
├── src/
│ ├── config.py # Configuration classes
│ ├── common/ # Shared utilities
│ │ ├── data_preprocessing.py # Data loading, normalization, corruption
│ │ └── visualization.py # Common plotting utilities
│ ├── dae/ # Denoising Autoencoder
│ │ ├── model.py # DAE architecture & device selection
│ │ ├── train.py # Training loop with masked loss
│ │ ├── evaluate.py # Metrics computation
│ │ ├── plots.py # DAE-specific visualizations
│ │ └── experiments.py # DAE experiment orchestration
│ ├── knn/ # K-Nearest Neighbors baseline
│ │ ├── imputation.py # KNN imputer
│ │ ├── evaluate.py # KNN metrics & timing
│ │ ├── plots.py # KNN visualizations
│ │ └── experiments.py # KNN experiment orchestration
│ ├── baselines/ # Additional baseline methods
│ │ ├── zero_imputer.py # Naive zero-filling baseline
│ │ ├── evaluate.py # Baseline metrics
│ │ ├── plots.py # Baseline visualizations
│ │ └── experiments.py # Baseline experiment orchestration
│ └── comparison/ # Cross-method comparisons
│ └── plots.py # DAE vs KNN vs Zero comparisons
├── results/
│ ├── dae/ # DAE outputs
│ │ ├── models/ # Trained model checkpoints
│ │ ├── metrics/ # Metrics & predictions
│ │ ├── plots/ # DAE figures
│ │ └── summary.json # All DAE results
│ ├── knn/ # KNN outputs
│ │ ├── metrics/ # KNN metrics
│ │ ├── predictions/ # KNN predictions
│ │ ├── plots/ # KNN figures
│ │ └── summary.json # All KNN results
│ ├── baselines/ # Baseline outputs
│ │ ├── metrics/
│ │ ├── predictions/
│ │ ├── plots/
│ │ └── summary.json
│ └── comparisons/ # Method comparison outputs
│ ├── method_comparison.png
│ ├── performance_vs_time.png
│ └── comparison_table.txt
├── requirements.txt # Python dependencies
├── run_all.py # Master orchestrator (all experiments)
├── run_dae.py # DAE experiments entry point
├── run_knn.py # KNN experiments entry point
├── run_baseline.py # Baseline experiments entry point
├── run_comparison.py # Comparison generation entry point
├── CLAUDE.md # Developer documentation
└── README.md # This file
Saved to results/dae/models/:
- Filename format:
miss{rate}_lr{lr}_n{neurons}_ep{epochs}_seed{seed}.pt - Example:
miss0.01_lr0.001_n512_ep1000_seed42.pt
Saved to results/dae/metrics/:
- Loss histories:
*_loss.json - Aggregated metrics:
*_metrics.json - Predictions:
*_predictions.npz
Saved to results/knn/:
- Metrics:
metrics/miss{rate}_k{neighbors}_{weights}_{metric}_metrics.json - Predictions:
predictions/miss{rate}_k{neighbors}_{weights}_{metric}_predictions.npz
DAE plots saved to results/dae/plots/:
- Loss curves:
loss_curves_{pct}pct.png - R² bar plots:
r2_bars_{pct}pct.png - Predicted vs Truth:
pred_vs_truth_1pct.png
KNN plots saved to results/knn/plots/:
- K-neighbor comparison:
k_comparison_{pct}pct.png - Weights comparison:
weights_comparison_{pct}pct.png
Method comparisons saved to results/comparisons/:
- Bar chart:
method_comparison.png - Performance vs time:
performance_vs_time.png - Comparison table:
comparison_table.txt
The paper reports these R² scores for DAE imputation:
| Missing Data | Best R² (mean ± std) | Configuration |
|---|---|---|
| 1% | 0.94 ± 0.03 | 1024 neurons, 1200 epochs, LR=10⁻³ |
| 5% | 0.48 ± 0.09 | 256 neurons, lower epochs, LR=10⁻³ |
| 10% | 0.37 ± 0.05 | 256 neurons, lower epochs, LR=10⁻³ |
-
Learning rate has the strongest effect on DAE performance
- 10⁻³ performs best across all missingness levels
- 10⁻¹ and 10⁻⁵ produce negative R² scores (worse than baseline)
-
Larger models work better for low missingness (1%)
- 1024 neurons optimal for 1% missing
-
Smaller models work better for high missingness (5-10%)
- 256 neurons optimal for higher missingness
- Suggests smaller models generalize better with less signal
-
Method comparison
- Run both DAE and KNN experiments to compare neural network vs. classical ML approaches
- Use
python run_comparison.pyto create side-by-side performance analysis
- CPU: Any modern CPU (2+ cores recommended)
- GPU: Optional but recommended
- Apple Silicon (M1/M2/M3): Automatically uses MPS
- NVIDIA: Automatically uses CUDA if available
- RAM: 8GB minimum, 16GB recommended
- Storage: ~500MB for models and results
On Apple M1 Pro:
- Quick test (
run_all.py --quick-test): ~10 minutes - DAE only (
run_dae.py): ~6-8 hours (243 experiments) - KNN only (
run_knn.py): ~15-30 minutes (90 experiments) - Baselines only (
run_baseline.py): ~1 minute - Full pipeline (
run_all.py): ~7-9 hours total
On CPU only:
- Quick test: ~20-30 minutes
- DAE only: ~24-48 hours
- KNN only: ~1-2 hours
- Baselines only: ~5 minutes
- Full pipeline: ~25-50 hours total
The modular structure allows independent use of each component:
from src.common.data_preprocessing import FormulationDataPreprocessor
preprocessor = FormulationDataPreprocessor(
data_path='data/material_name_smilesRemoved.csv',
metadata_cols=6
)
preprocessor.load_data()
preprocessor.normalize_data()
# Corrupt 1% of data
original, corrupted, mask = preprocessor.prepare_data(
missingness_rate=0.01,
noise_std=0.1,
seed=42
)from src.dae.model import create_dae, get_device
device = get_device()
model = create_dae(input_dim=382, neuron_size=512, device=device)from src.dae.train import train_dae
trained_model, loss_history = train_dae(
model=model,
original_data=original,
corrupted_data=corrupted,
mask=mask,
device=device,
learning_rate=1e-3,
num_epochs=1000
)from src.dae.evaluate import evaluate_dae
predictions, metrics = evaluate_dae(
model=trained_model,
original_data=original,
corrupted_data=corrupted,
mask=mask,
device=device,
noise_std=0.1
)
print(f"R²: {metrics['r2']:.4f}")
print(f"RMSE: {metrics['rmse']:.4f}")from src.knn.imputation import create_knn_imputer
knn_imputer = create_knn_imputer(
n_neighbors=5,
weights='distance',
metric='euclidean'
)
# Convert to numpy
original_np = original.numpy()
corrupted_np = corrupted.numpy()
mask_np = mask.numpy()
# Impute
imputed_data, metrics = knn_imputer.impute(original_np, corrupted_np, mask_np)from src.comparison.plots import generate_all_comparisons
generate_all_comparisons(
dae_results_dir='results/dae',
knn_results_dir='results/knn',
zero_results_dir='results/baselines',
output_dir='results/comparisons'
)If you get "MPS not available" on Mac:
- Update to macOS 12.3+ and PyTorch 2.0+
- Check:
python -c "import torch; print(torch.backends.mps.is_available())"
If training crashes with OOM:
- Reduce neuron sizes:
--neuron-sizes 256 - Train on CPU (slower): Edit
src/dae/model.pyto force CPU
Ensure data/material_name_smilesRemoved.csv exists:
ls data/material_name_smilesRemoved.csv- Queen Mary University of London
- UCL School of Pharmacy