Skip to content

JiwenJ/Awesome-Optimizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Optimizers List

Awesome GitHub stars License: MIT Papers Code Links Coverage PRs Welcome

Curated optimizer-design papers from 2022+, ordered by date in reverse chronological order.

Date Optimizer Name Advantage
2604 Newton-Muon Paper Link Adds input-side Newton preconditioning to Muon and reduces training steps and wall-clock time in reported GPT-2 pretraining runs.
2604 CLion Paper Link Applies cautious updates to Lion to improve generalization while preserving lightweight optimizer state and efficiency.
2604 Adam-HNAG Paper Link Reformulates Adam with a curvature-aware correction and provides accelerated convergence guarantees in the reported setting.
2602 TSR-Adam Paper Link Uses two-sided low-rank synchronization to reduce Adam-family communication cost in distributed training.
2602 NAMO Paper Link Combines Muon-style orthogonalized momentum with Adam-type noise adaptation to improve stability at negligible extra cost.
2601 Spectral Sphere Optimizer (SSO) Paper Link GitHub Constrains both weights and updates on a spectral sphere to improve LLM training stability and outperform AdamW and Muon in reported experiments.
2510 NorMuon Paper Link GitHub Combines Muon with neuron-wise normalization and second-order statistics to improve scalability and efficiency.
2510 Hill-ADAM Paper Link Alternates minimization and maximization phases to help Adam escape local minima in non-convex loss landscapes.
2510 DP-Adam-AC Paper Link Adds adaptive clipping to Adam-based private fine-tuning to improve the privacy-utility trade-off for localizable language models.
2509 Conda Paper Link Blends Adam-style adaptivity with column-normalized updates to improve optimization efficiency in LLM training.
2506 SPlus Paper Link GitHub Uses stable whitening-style preconditioning to cut gradient steps and wall-clock time in reported neural network training runs.
2505 PolarGrad Paper Link Unifies matrix-gradient preconditioning and introduces polar-decomposition updates that outperform Adam and Muon in reported studies.
2505 Gluon Paper Link Generalizes Muon and Scion within an LMO framework and improves layer-wise large-model optimization in the reported setting.
2504 Dion Paper Link GitHub Distributed orthonormalized updates that reduce large-scale training overhead while preserving Muon-style gains.
2502 Scion Paper Link GitHub Uses norm-constrained LMO updates that improve stability, memory efficiency, and hyperparameter transfer.
2502 D-Muon Paper Link GitHub Scales Muon-style orthogonalized updates to distributed LLM training and improves compute efficiency over strong AdamW baselines.
2412 Muon Paper Link GitHub Uses orthogonalized matrix updates for hidden-layer weights and is typically paired with AdamW for non-matrix parameters.
2411 MARS Paper Link GitHub Injects variance reduction into adaptive and sign-based optimizers and reports strong GPT-2 training gains.
2411 Cautious Optimizers Paper Link GitHub Adds a one-line cautious mask to momentum optimizers such as AdamW and Lion.
2411 ADOPT Paper Link GitHub A modified Adam-family method with stronger convergence guarantees and improved practical stability.
2409 SOAP Paper Link GitHub Blends Shampoo-style preconditioning with Adam-style moment updates.
2409 AdEMAMix Paper Link GitHub Mixes older-gradient EMAs into AdamW to improve token efficiency.
2406 Adam-mini Paper Link GitHub Uses fewer learning-rate groups to reduce optimizer memory with AdamW-like quality.
2405 SF-AdamW (Schedule-Free) Paper Link GitHub Removes explicit learning-rate schedules and simplifies tuning while keeping AdamW-style behavior.
2405 MicroAdam Paper Link GitHub Compresses optimizer-state updates to reduce memory overhead while preserving convergence quality.
2405 FAdam Paper Link GitHub Uses diagonal empirical Fisher preconditioning to make Adam behave more like a lightweight natural-gradient optimizer.
2312 AGD Paper Link GitHub Auto-switches preconditioning based on stepwise gradient differences to balance adaptivity and efficiency.
2310 AdaLOMO Paper Link GitHub Low-memory optimizer with adaptive learning rates for resource-constrained full-parameter LLM fine-tuning.
2309 AdaPlus Paper Link GitHub Adds Nesterov momentum and more precise stepsize control on top of AdamW-style updates.
2307 CoRe Paper Link GitHub All-in-one optimizer designed to work robustly across tasks with less retuning.
2307 CAME Paper Link GitHub Confidence-guided memory-efficient optimization for large-scale model training.
2307 Adam+CM Paper Link GitHub Adds critical momenta to Adam-style updates to improve exploration and escape poor minima.
2306 Prodigy Paper Link GitHub Parameter-free learner derived from D-Adaptation that reduces learning-rate tuning.
2306 LOMO Paper Link GitHub Fuses gradient computation and parameter updates to enable low-memory full-parameter LLM fine-tuning.
2305 WSAM Paper Link GitHub Revisits SAM with weighted sharpness to improve generalization while keeping optimization practical.
2305 UAdam Paper Link Unified Adam-type framework that studies convergence behavior across a broad class of Adam-family methods.
2305 Sophia Paper Link GitHub Scalable stochastic second-order optimizer for language-model pretraining.
2305 DoWG Paper Link GitHub Universal parameter-free gradient method that extends DoG-style step-size adaptation with stronger empirical performance.
2302 Lion Paper Link GitHub Sign-based momentum optimizer discovered by symbolic search.
2302 FOSI Paper Link GitHub Combines first-order optimizers with second-order curvature information for faster convergence on difficult objectives.
2302 DoG Paper Link GitHub Parameter-free dynamic step-size schedule that makes SGD-style optimization much less tuning-sensitive.
2301 D-Adaptation Paper Link GitHub Learning-rate-free optimization for SGD, Adam, and AdaGrad variants.
2211 VeLO Paper Link GitHub Learned optimizer trained at scale to transfer across tasks and architectures better than smaller learned optimizers.
2210 Amos Paper Link GitHub Adam-style optimizer with adaptive decay and scale-aware weight decay.
2210 AdaNorm Paper Link GitHub Corrects gradient-norm scaling to stabilize adaptive optimization for CNNs.
2208 Adan Paper Link GitHub Adaptive Nesterov momentum optimizer for faster and more stable deep-model training.
2206 GradaGrad Paper Link Non-monotone adaptive stochastic gradient method aimed at improving practical convergence over monotone variants.

About

A curated list of optimizer papers from 2022 onward

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors