Adding symmetry-compatible / equivariant optimizers for general matrix-valued parameters

**Is your feature request related to a problem? Please describe.**
I would like to request support for symmetry-compatible / equivariant optimizers for general matrix-valued parameters, proposed in [Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers](https://arxiv.org/abs/2605.18106).

Many modern neural-network parameters are naturally matrices rather than unstructured vectors: attention projections, MLP projections, embeddings, LM heads, and MoE routers. Standard coordinate-wise optimizers such as AdamW treat these matrices as collections of independent scalar coordinates, which can break natural symmetries such as left/right orthogonal equivariance, vocabulary permutation equivariance, or expert permutation equivariance.

Reference implementation:
https://github.com/timlautk/equivariant_optimizers

**Describe the solution you'd like**
I would like the optimizer API to support assigning equivariant optimizer rules to matrix-valued parameters based on their layer type and symmetry structure.

Concretely, this could include support for optimizer variants such as:

- **Full spectral / polar-gradient updates** for ordinary dense matrix parameters, such as attention and linear layers.
- **Right-spectral updates** for embedding and LM-head matrices, respecting vocabulary permutation symmetry and hidden-feature orthogonal symmetry.
- **Row-norm or hybrid row-norm/right-spectral updates** for large vocabulary-indexed matrices.
- **Left-spectral or centered equivariant updates** for MoE router matrices, respecting expert permutation symmetry and logit-shift structure.
- Momentum variants such as EMA or momentum-first polar updates.

Ideally, the implementation would allow users to define parameter groups by module type or tensor shape and attach different symmetry-compatible update rules to different matrix classes.

**Describe alternatives you've considered**
An alternative is to use existing matrix optimizers such as Muon or Shampoo on some matrix parameters while leaving embeddings, LM heads, and routers on AdamW. This is useful in practice, but it does not provide a unified layerwise principle for assigning optimizers according to parameter symmetry.

A third alternative is to maintain this as an external optimizer package. However, native or officially supported integration would make it much easier to use these methods in large-scale training workflows.

**Additional context**
The motivation is that optimizer design for neural networks may benefit from being **layerwise and symmetry-compatible** rather than uniformly coordinate-wise. Matrix-valued parameters have different natural symmetry groups depending on their architectural role. For example, standard linear layers naturally suggest bi-orthogonal equivariance, embeddings and LM heads suggest left-permutation/right-orthogonal equivariance, and MoE routers suggest expert-permutation equivariance with centered updates.

The reference implementation above contains prototype optimizers and parameter-group assignment logic for experimenting with these ideas in transformer and MoE pre-training settings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding symmetry-compatible / equivariant optimizers for general matrix-valued parameters #188

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Adding symmetry-compatible / equivariant optimizers for general matrix-valued parameters #188

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions