torch-ipca

GPU-accelerated Incremental PCA for PyTorch.

A PyTorch implementation of Incremental PCA adapted from scikit-learn, with full GPU support for fitting and transforming large datasets that don't fit in memory.

Features

GPU-accelerated: All operations run on CUDA when available
Incremental fitting: Process data in batches with constant memory complexity
scikit-learn compatible API: Drop-in replacement for sklearn.decomposition.IncrementalPCA
Save/load support: Persist fitted models with save() and load()

Installation

pip install git+https://github.com/ndgigliotti/torch-ipca.git

Usage

import torch
from torch_ipca import IncrementalPCA

# Create some data
X = torch.randn(10000, 768, device="cuda")

# Fit incrementally
ipca = IncrementalPCA(n_components=128, device="cuda")
for batch in X.split(1000):
    ipca.partial_fit(batch)

# Transform
X_reduced = ipca.transform(X)  # Shape: (10000, 128)

Full fit

ipca = IncrementalPCA(n_components=128, device="cuda")
X_reduced = ipca.fit_transform(X)

Save and load

# Save fitted model
ipca.save("pca_model.pt")

# Load later
ipca = IncrementalPCA.load("pca_model.pt", device="cuda")
X_reduced = ipca.transform(new_data)

API

`IncrementalPCA(n_components=None, whiten=False, device="cuda")`

Parameters:

n_components: Number of components to keep. If None, keeps min(n_samples, n_features).
whiten: If True, whitens the output to have unit variance.
device: PyTorch device ("cuda" or "cpu").

Methods:

fit(X): Fit the model with X using minibatches.
partial_fit(X): Incremental fit on a batch X.
transform(X): Apply dimensionality reduction to X.
inverse_transform(X): Transform reduced data back to original space.
fit_transform(X): Fit and transform in one call.
save(path): Save fitted model to file.
load(path, device): Load fitted model from file (classmethod).

Attributes (after fitting):

components_: Principal axes (n_components, n_features).
explained_variance_: Variance explained by each component.
explained_variance_ratio_: Percentage of variance explained.
mean_: Per-feature mean.
n_samples_seen_: Number of samples processed.

When to Use

Large datasets: When data doesn't fit in GPU memory, use partial_fit() to process in batches.
Streaming data: Continuously update PCA as new data arrives.
Non-MRL models: For embedding models without Matryoshka training, PCA provides dimension reduction.

For models trained with Matryoshka Representation Learning (nomic-embed, jina-v3, OpenAI v3), simple truncation is preferred over PCA.

License

Apache 2.0. Portions derived from scikit-learn (BSD 3-Clause License).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tests		tests
torch_ipca		torch_ipca
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD_PARTY_NOTICES		THIRD_PARTY_NOTICES
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torch-ipca

Features

Installation

Usage

Full fit

Save and load

API

`IncrementalPCA(n_components=None, whiten=False, device="cuda")`

When to Use

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

torch-ipca

Features

Installation

Usage

Full fit

Save and load

API

IncrementalPCA(n_components=None, whiten=False, device="cuda")

When to Use

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`IncrementalPCA(n_components=None, whiten=False, device="cuda")`

Packages