Batch-effect correction methods for MALDI-TOF spectra in clinical AMR prediction workflows
Installation • Features • Quick Start • Algorithms • Diagnostics • MaldiSuite • Citation
MaldiBatchKit is part of the MaldiSuite ecosystem and complements MaldiAMRKit: where MaldiAMRKit handles preprocessing, alignment and AMR-aware evaluation, MaldiBatchKit focuses on the harmonization step, removing the inter-batch / inter-site shifts that plague multi-centre MALDI-TOF studies.
pip install maldibatchkitOptional extras:
pip install maldibatchkit[viz] # UMAP plots, seaborn
pip install maldibatchkit[dev] # testing + linting
pip install maldibatchkit[docs] # sphinxmaldiamrkit is a core dependency - installing MaldiBatchKit pulls
it in automatically. BatchAwareWarping reuses
maldiamrkit.alignment.Warping under the hood, and the
MaldiSetAdapter bridges to maldiamrkit.MaldiSet for end-to-end
AMR workflows.
To install MaldiBatchKit together with MaldiAMRKit and MaldiDeepKit at compatible versions, install the maldisuite meta-package:
pip install maldisuiteVisit the MaldiSuite landing page at https://ettorerocchi.github.io/MaldiSuite/.
- Unified sklearn API (
BaseEstimator+TransformerMixin) for every correction method.batchand covariates are passed at construction time and aligned toX.indexatfit/transform, so the same object works insidePipeline/ cross-validation without data leakage. - ComBat variants (Johnson 2007, Fortin 2018, Chen 2022 CovBat) re-exported from combatlearn.
- Limma
removeBatchEffect(Ritchie et al. 2015). - Harmony (Korsunsky et al. 2019) via harmonypy, with a mandatory, frozen PCA preprocessing stage so it behaves sensibly on high-dimensional MALDI-TOF intensity matrices (tune with the
n_components=argument). - Simple baselines: median centering, z-score per batch, reference scaling.
- MALDI-specific corrections:
BatchAwareWarping- per-batch m/z warping sharing a global reference (wrapsmaldiamrkit.alignment.Warping).QualityWeightedComBat- weighted empirical-Bayes ComBat variant where low-SNR spectra contribute less to the shrinkage prior.SpeciesAwareComBat- convenience preset for ComBat-Fortin withspeciesas the protected biological covariate.
- Diagnostics: kBET, LISI, silhouette-by-batch, per-batch peak
drift, per-batch TIC coefficient of variation, per-batch spectrum
count, plus a combined
diagnostic_reportDataFrame summary. - Visualization: UMAP before/after, per-batch peak-shape overlays, before/after bar charts.
- Integration adapter:
MaldiSetAdapterturns amaldiamrkit.MaldiSetinto a correctedMaldiSetin one call. - CLI:
maldibatchkit correct ...andmaldibatchkit diagnose ....
from maldibatchkit import ComBat, QualityWeightedComBat, SpeciesAwareComBat
from maldibatchkit.diagnostics import diagnostic_report
# X: (n_samples, n_bins) DataFrame; batch & species indexed by X.index
corrector = SpeciesAwareComBat(batch=batch, species=species)
X_corrected = corrector.fit_transform(X)
report = diagnostic_report(X, X_corrected, batch)
print(report)Train/test without leakage:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=batch)
corrector = ComBat(batch=batch, method="fortin", discrete_covariates=species)
corrector.fit(X_train) # learns on train only
X_train_c = corrector.transform(X_train)
X_test_c = corrector.transform(X_test) # same parameters applied to testbatch is indexed by the same sample IDs that X uses, so the
corrector picks the right subset on each call.
from maldiamrkit import MaldiSet
from maldibatchkit.integrations import MaldiSetAdapter
from maldibatchkit import SpeciesAwareComBat
ds = MaldiSet.from_directory(...)
adapter = MaldiSetAdapter(
batch_column="Batch",
species_column="Species",
quality_column="SNR",
)
corrected_ds = adapter.correct(ds, SpeciesAwareComBat)
corrected_ds.X # harmonised feature matrix
corrected_ds.y # AMR labels, unchangedThe CLI is organised as maldibatchkit correct <method> +
maldibatchkit diagnose. Every method has its own subcommand with
only the flags it actually uses:
# Vanilla Johnson ComBat
maldibatchkit correct combat \
-i X.csv --batch-csv batch.csv -o X_corrected.csv
# Fortin ComBat with a species covariate
maldibatchkit correct combat-fortin \
-i X.csv --batch-csv batch.csv \
--discrete-covariates-csv species.csv \
-o X_corrected.csv
# Species-aware preset (shortcut for the above)
maldibatchkit correct species-combat \
-i X.csv --batch-csv batch.csv --species-csv species.csv \
-o X_corrected.csv
# Quality-weighted ComBat
maldibatchkit correct quality-combat \
-i X.csv --batch-csv batch.csv --quality-csv snr.csv \
-o X_corrected.csv
# Diagnostic report
maldibatchkit diagnose \
-i X.csv --corrected X_corrected.csv \
--batch-csv batch.csv --mz-csv mz.csv -o report.csvNPZ inputs bundle X, index, columns, and batch labels in one file, so the same commands work without sidecar CSVs:
maldibatchkit correct combat-fortin \
-i maldiset.npz \
--discrete-covariates-csv species.csv \
-o corrected.npzRun maldibatchkit correct <method> --help for the full flag list of
any corrector. combat-fortin / combat-chen refuse to run without
covariates (they would silently reduce to Johnson ComBat);
species-combat / quality-combat require their dedicated
--species-csv / --quality-csv inputs.
| Method | Class | Protects covariates? | Train/test safe? |
|---|---|---|---|
| ComBat (Johnson, Fortin, Chen) | ComBat |
Fortin / Chen | yes |
| Limma | Limma |
via design= |
yes |
| Harmony | Harmony |
via covariates= |
yes |
| Median centering | MedianCentering |
no | yes |
| Z-score per batch | ZScorePerBatch |
no | yes |
| Reference scaling | ReferenceScaling |
no | yes |
| Batch-aware warping | BatchAwareWarping |
no | yes |
| Quality-weighted ComBat | QualityWeightedComBat |
no | yes |
| Species-aware ComBat | SpeciesAwareComBat |
species | yes |
See the QualityWeightedComBat docstring for the mathematical
formulation of the weighted empirical-Bayes update.
Every corrector in this package inherits from BaseBatchCorrector,
which is re-exported at the top level. Subclass it, implement
_fit_impl and _transform_impl, and you get a scikit-learn compatible,
train/test-safe corrector for free - the base class handles index
alignment between X and the stored batch labels, NaN / finite
checks, DataFrame-vs-ndarray round-tripping, and the feature_names_in_
/ n_features_in_ / get_feature_names_out sklearn bookkeeping.
Minimal custom corrector:
import pandas as pd
from maldibatchkit import BaseBatchCorrector
class MeanCentering(BaseBatchCorrector):
"""Subtract per-batch means from each feature."""
def _fit_impl(self, X_df, batch):
# Store whatever you learn as ``..._`` attributes so
# ``sklearn.utils.validation.check_is_fitted`` picks them up.
self.batch_means_ = X_df.groupby(batch).mean()
self.grand_mean_ = X_df.mean(axis=0)
def _transform_impl(self, X_df, batch):
out = X_df.copy().astype(float)
known = set(self.batch_means_.index)
for lvl in pd.unique(batch):
mask = batch == lvl
offset = (
self.batch_means_.loc[lvl].to_numpy()
if lvl in known
else self.grand_mean_.to_numpy() # unseen-batch fallback
)
out.loc[mask] = out.loc[mask].to_numpy() - offset
return outDrop it straight into a pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
("mean", MeanCentering(batch=batch)),
("scaler", StandardScaler()),
("clf", RandomForestClassifier()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test) # no leakage: transform, never refitConventions (see CONTRIBUTING.md):
- NumPy-style docstring on every public class.
- Fitted attributes end in
_(self.batch_means_, notself.means). transformmust be idempotent - no side effects outsidefit.- Raise a clear
ImportError(not a bareModuleNotFoundError) when an optional dependency is missing; seeHarmony._require_harmonypyfor the reference pattern.
Look at maldibatchkit/corrections/baselines.py for the simplest
end-to-end references (MedianCentering, ZScorePerBatch,
ReferenceScaling), or at quality_weighted.py for a corrector with
an iterative fit.
from maldibatchkit.diagnostics import (
silhouette_batch, kbet, lisi,
peak_position_drift, tic_cov_per_batch, per_batch_spectrum_count,
diagnostic_report,
)All metrics take the same (X, batch) signature. diagnostic_report
composes them into a tidy DataFrame suitable for
plot_diagnostic_summary.
MaldiBatchKit is the harmonisation package of the MaldiSuite ecosystem:
- MaldiAMRKit - data model (
MaldiSpectrum,MaldiSet), preprocessing, alignment, peak detection, differential analysis, and AMR-aware evaluation. - MaldiBatchKit (this package) - batch-effect correction and harmonisation for multi-centre / multi-instrument MALDI-TOF spectra.
- MaldiDeepKit - sklearn-compatible deep learning classifiers (MLP, CNN, ResNet, Transformer).
The three packages share the MaldiSet / MaldiSpectrum data model and are designed to compose in a single end-to-end pipeline. Install the full suite with pip install maldisuite. Landing page: MaldiSuite.
If you use MaldiBatchKit in academic work please cite:
Citation will be available soon.
along with the upstream references for whichever methods you apply (Johnson 2007, Fortin 2018, Chen 2022, Ritchie 2015, Korsunsky 2019).
MIT. See LICENSE.
