Attention
updated
Selective Attention Improves Transformer
Paper
• 2410.02703
• Published
• 25
Paper
• 2410.05258
• Published
• 180
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent
Sparse Attention
Paper
• 2410.05076
• Published
• 8
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper
• 2410.13276
• Published
• 29
Star Attention: Efficient LLM Inference over Long Sequences
Paper
• 2411.17116
• Published
• 53
KV Shifting Attention Enhances Language Modeling
Paper
• 2411.19574
• Published
• 8
Entropy-Guided Attention for Private LLMs
Paper
• 2501.03489
• Published
• 14
Not All Language Model Features Are Linear
Paper
• 2405.14860
• Published
• 40
Your Transformer is Secretly Linear
Paper
• 2405.12250
• Published
• 157
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published
• 300
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published
• 90
Sigma: Differential Rescaling of Query, Key and Value for Efficient
Language Models
Paper
• 2501.13629
• Published
• 48
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published
• 57
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published
• 168
Does Time Have Its Place? Temporal Heads: Where Language Models Recall
Time-specific Information
Paper
• 2502.14258
• Published
• 26
How Do Large Vision-Language Models See Text in Image? Unveiling the
Distinctive Role of OCR Heads
Paper
• 2505.15865
• Published
• 5
Learning to Skip the Middle Layers of Transformers
Paper
• 2506.21103
• Published
• 18
Limitations of Normalization in Attention Mechanism
Paper
• 2508.17821
• Published
• 7
Native Hybrid Attention for Efficient Sequence Modeling
Paper
• 2510.07019
• Published
• 17
Attention Sinks in Diffusion Language Models
Paper
• 2510.15731
• Published
• 49
Every Attention Matters: An Efficient Hybrid Architecture for
Long-Context Reasoning
Paper
• 2510.19338
• Published
• 115
Paper
• 2510.23052
• Published
• 30
Kimi Linear: An Expressive, Efficient Attention Architecture
Paper
• 2510.26692
• Published
• 127
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
Paper
• 2601.21709
• Published
• 2