| 2604 |
Newton-Muon  |
Adds input-side Newton preconditioning to Muon and reduces training steps and wall-clock time in reported GPT-2 pretraining runs. |
| 2604 |
CLion  |
Applies cautious updates to Lion to improve generalization while preserving lightweight optimizer state and efficiency. |
| 2604 |
Adam-HNAG  |
Reformulates Adam with a curvature-aware correction and provides accelerated convergence guarantees in the reported setting. |
| 2602 |
TSR-Adam  |
Uses two-sided low-rank synchronization to reduce Adam-family communication cost in distributed training. |
| 2602 |
NAMO  |
Combines Muon-style orthogonalized momentum with Adam-type noise adaptation to improve stability at negligible extra cost. |
| 2601 |
Spectral Sphere Optimizer (SSO)  |
Constrains both weights and updates on a spectral sphere to improve LLM training stability and outperform AdamW and Muon in reported experiments. |
| 2510 |
NorMuon  |
Combines Muon with neuron-wise normalization and second-order statistics to improve scalability and efficiency. |
| 2510 |
Hill-ADAM  |
Alternates minimization and maximization phases to help Adam escape local minima in non-convex loss landscapes. |
| 2510 |
DP-Adam-AC  |
Adds adaptive clipping to Adam-based private fine-tuning to improve the privacy-utility trade-off for localizable language models. |
| 2509 |
Conda  |
Blends Adam-style adaptivity with column-normalized updates to improve optimization efficiency in LLM training. |
| 2506 |
SPlus  |
Uses stable whitening-style preconditioning to cut gradient steps and wall-clock time in reported neural network training runs. |
| 2505 |
PolarGrad  |
Unifies matrix-gradient preconditioning and introduces polar-decomposition updates that outperform Adam and Muon in reported studies. |
| 2505 |
Gluon  |
Generalizes Muon and Scion within an LMO framework and improves layer-wise large-model optimization in the reported setting. |
| 2504 |
Dion  |
Distributed orthonormalized updates that reduce large-scale training overhead while preserving Muon-style gains. |
| 2502 |
Scion  |
Uses norm-constrained LMO updates that improve stability, memory efficiency, and hyperparameter transfer. |
| 2502 |
D-Muon  |
Scales Muon-style orthogonalized updates to distributed LLM training and improves compute efficiency over strong AdamW baselines. |
| 2412 |
Muon  |
Uses orthogonalized matrix updates for hidden-layer weights and is typically paired with AdamW for non-matrix parameters. |
| 2411 |
MARS  |
Injects variance reduction into adaptive and sign-based optimizers and reports strong GPT-2 training gains. |
| 2411 |
Cautious Optimizers  |
Adds a one-line cautious mask to momentum optimizers such as AdamW and Lion. |
| 2411 |
ADOPT  |
A modified Adam-family method with stronger convergence guarantees and improved practical stability. |
| 2409 |
SOAP  |
Blends Shampoo-style preconditioning with Adam-style moment updates. |
| 2409 |
AdEMAMix  |
Mixes older-gradient EMAs into AdamW to improve token efficiency. |
| 2406 |
Adam-mini  |
Uses fewer learning-rate groups to reduce optimizer memory with AdamW-like quality. |
| 2405 |
SF-AdamW (Schedule-Free)  |
Removes explicit learning-rate schedules and simplifies tuning while keeping AdamW-style behavior. |
| 2405 |
MicroAdam  |
Compresses optimizer-state updates to reduce memory overhead while preserving convergence quality. |
| 2405 |
FAdam  |
Uses diagonal empirical Fisher preconditioning to make Adam behave more like a lightweight natural-gradient optimizer. |
| 2312 |
AGD  |
Auto-switches preconditioning based on stepwise gradient differences to balance adaptivity and efficiency. |
| 2310 |
AdaLOMO  |
Low-memory optimizer with adaptive learning rates for resource-constrained full-parameter LLM fine-tuning. |
| 2309 |
AdaPlus  |
Adds Nesterov momentum and more precise stepsize control on top of AdamW-style updates. |
| 2307 |
CoRe  |
All-in-one optimizer designed to work robustly across tasks with less retuning. |
| 2307 |
CAME  |
Confidence-guided memory-efficient optimization for large-scale model training. |
| 2307 |
Adam+CM  |
Adds critical momenta to Adam-style updates to improve exploration and escape poor minima. |
| 2306 |
Prodigy  |
Parameter-free learner derived from D-Adaptation that reduces learning-rate tuning. |
| 2306 |
LOMO  |
Fuses gradient computation and parameter updates to enable low-memory full-parameter LLM fine-tuning. |
| 2305 |
WSAM  |
Revisits SAM with weighted sharpness to improve generalization while keeping optimization practical. |
| 2305 |
UAdam  |
Unified Adam-type framework that studies convergence behavior across a broad class of Adam-family methods. |
| 2305 |
Sophia  |
Scalable stochastic second-order optimizer for language-model pretraining. |
| 2305 |
DoWG  |
Universal parameter-free gradient method that extends DoG-style step-size adaptation with stronger empirical performance. |
| 2302 |
Lion  |
Sign-based momentum optimizer discovered by symbolic search. |
| 2302 |
FOSI  |
Combines first-order optimizers with second-order curvature information for faster convergence on difficult objectives. |
| 2302 |
DoG  |
Parameter-free dynamic step-size schedule that makes SGD-style optimization much less tuning-sensitive. |
| 2301 |
D-Adaptation  |
Learning-rate-free optimization for SGD, Adam, and AdaGrad variants. |
| 2211 |
VeLO  |
Learned optimizer trained at scale to transfer across tasks and architectures better than smaller learned optimizers. |
| 2210 |
Amos  |
Adam-style optimizer with adaptive decay and scale-aware weight decay. |
| 2210 |
AdaNorm  |
Corrects gradient-norm scaling to stabilize adaptive optimization for CNNs. |
| 2208 |
Adan  |
Adaptive Nesterov momentum optimizer for faster and more stable deep-model training. |
| 2206 |
GradaGrad  |
Non-monotone adaptive stochastic gradient method aimed at improving practical convergence over monotone variants. |