DiscoverDaily Paper CastFourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Update: 2024-12-26
Share

Description

🤗 Upvotes: 20 | cs.AI, cs.CL



Authors:

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou



Title:

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization



Arxiv:

http://arxiv.org/abs/2412.17739v1



Abstract:

Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.

Comments 
loading
In Channel
1.58-bit FLUX

1.58-bit FLUX

2024-12-3122:59

loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Jingwen Liang, Gengyu Wang