DiscoverDaily Paper CastNo More Adam: Learning Rate Scaling at Initialization is All You Need
No More Adam: Learning Rate Scaling at Initialization is All You Need

No More Adam: Learning Rate Scaling at Initialization is All You Need

Update: 2024-12-20
Share

Description

🤗 Upvotes: 177 | cs.LG, cs.AI



Authors:

Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen



Title:

No More Adam: Learning Rate Scaling at Initialization is All You Need



Arxiv:

http://arxiv.org/abs/2412.11768v2



Abstract:

In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

Comments 
loading
In Channel
GUI Agents: A Survey

GUI Agents: A Survey

2024-12-2021:01

loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

No More Adam: Learning Rate Scaling at Initialization is All You Need

No More Adam: Learning Rate Scaling at Initialization is All You Need

Jingwen Liang, Gengyu Wang