GPU-accelerated Incremental PCA for PyTorch.
A PyTorch implementation of Incremental PCA adapted from scikit-learn, with full GPU support for fitting and transforming large datasets that don't fit in memory.
- GPU-accelerated: All operations run on CUDA when available
- Incremental fitting: Process data in batches with constant memory complexity
- scikit-learn compatible API: Drop-in replacement for
sklearn.decomposition.IncrementalPCA - Save/load support: Persist fitted models with
save()andload()
pip install git+https://github.com/ndgigliotti/torch-ipca.gitimport torch
from torch_ipca import IncrementalPCA
# Create some data
X = torch.randn(10000, 768, device="cuda")
# Fit incrementally
ipca = IncrementalPCA(n_components=128, device="cuda")
for batch in X.split(1000):
ipca.partial_fit(batch)
# Transform
X_reduced = ipca.transform(X) # Shape: (10000, 128)ipca = IncrementalPCA(n_components=128, device="cuda")
X_reduced = ipca.fit_transform(X)# Save fitted model
ipca.save("pca_model.pt")
# Load later
ipca = IncrementalPCA.load("pca_model.pt", device="cuda")
X_reduced = ipca.transform(new_data)Parameters:
n_components: Number of components to keep. If None, keepsmin(n_samples, n_features).whiten: If True, whitens the output to have unit variance.device: PyTorch device ("cuda" or "cpu").
Methods:
fit(X): Fit the model with X using minibatches.partial_fit(X): Incremental fit on a batch X.transform(X): Apply dimensionality reduction to X.inverse_transform(X): Transform reduced data back to original space.fit_transform(X): Fit and transform in one call.save(path): Save fitted model to file.load(path, device): Load fitted model from file (classmethod).
Attributes (after fitting):
components_: Principal axes (n_components, n_features).explained_variance_: Variance explained by each component.explained_variance_ratio_: Percentage of variance explained.mean_: Per-feature mean.n_samples_seen_: Number of samples processed.
- Large datasets: When data doesn't fit in GPU memory, use
partial_fit()to process in batches. - Streaming data: Continuously update PCA as new data arrives.
- Non-MRL models: For embedding models without Matryoshka training, PCA provides dimension reduction.
For models trained with Matryoshka Representation Learning (nomic-embed, jina-v3, OpenAI v3), simple truncation is preferred over PCA.
Apache 2.0. Portions derived from scikit-learn (BSD 3-Clause License).