Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains...

arXiv cs.CL · Feb 27, 2026 18:57 UTC · Paper: ~15 min

2-Minute Brief

According to arXiv cs.CL: Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training

Read Original

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

TLDR

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains...

Artifacts

Paper PDF

2-Minute Brief

According to arXiv cs.CL: Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training

Open

O open S save B back M mode