Skip to content

Adding symmetry-compatible / equivariant optimizers for general matrix-valued parameters #188

@timlautk

Description

@timlautk

Is your feature request related to a problem? Please describe.
I would like to request support for symmetry-compatible / equivariant optimizers for general matrix-valued parameters, proposed in Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers.

Many modern neural-network parameters are naturally matrices rather than unstructured vectors: attention projections, MLP projections, embeddings, LM heads, and MoE routers. Standard coordinate-wise optimizers such as AdamW treat these matrices as collections of independent scalar coordinates, which can break natural symmetries such as left/right orthogonal equivariance, vocabulary permutation equivariance, or expert permutation equivariance.

Reference implementation:
https://github.com/timlautk/equivariant_optimizers

Describe the solution you'd like
I would like the optimizer API to support assigning equivariant optimizer rules to matrix-valued parameters based on their layer type and symmetry structure.

Concretely, this could include support for optimizer variants such as:

  • Full spectral / polar-gradient updates for ordinary dense matrix parameters, such as attention and linear layers.
  • Right-spectral updates for embedding and LM-head matrices, respecting vocabulary permutation symmetry and hidden-feature orthogonal symmetry.
  • Row-norm or hybrid row-norm/right-spectral updates for large vocabulary-indexed matrices.
  • Left-spectral or centered equivariant updates for MoE router matrices, respecting expert permutation symmetry and logit-shift structure.
  • Momentum variants such as EMA or momentum-first polar updates.

Ideally, the implementation would allow users to define parameter groups by module type or tensor shape and attach different symmetry-compatible update rules to different matrix classes.

Describe alternatives you've considered
An alternative is to use existing matrix optimizers such as Muon or Shampoo on some matrix parameters while leaving embeddings, LM heads, and routers on AdamW. This is useful in practice, but it does not provide a unified layerwise principle for assigning optimizers according to parameter symmetry.

A third alternative is to maintain this as an external optimizer package. However, native or officially supported integration would make it much easier to use these methods in large-scale training workflows.

Additional context
The motivation is that optimizer design for neural networks may benefit from being layerwise and symmetry-compatible rather than uniformly coordinate-wise. Matrix-valued parameters have different natural symmetry groups depending on their architectural role. For example, standard linear layers naturally suggest bi-orthogonal equivariance, embeddings and LM heads suggest left-permutation/right-orthogonal equivariance, and MoE routers suggest expert-permutation equivariance with centered updates.

The reference implementation above contains prototype optimizers and parameter-group assignment logic for experimenting with these ideas in transformer and MoE pre-training settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions