TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao1* Xi Lin3* Wei Huang2* Yuxin Xie1 Tianfu Fu1 Bohan Zhuang3
Song Han1,2 Yukang Chen2
1MIT 2NVIDIA 3ZJU * Equal contribution

Real-World Deployment

OpenClaw + 32B Model on a 24GB GPU: From OOM to Task Complete

Running a 32B model on a 24GB GPU leaves very little room for KV cache. OpenClaw ships with such lengthy default instructions that Full Attention hits out-of-memory before the agent can even start. TriAttention compresses KV cache on the fly, letting the agent run to completion.

2.5×

Throughput Gain

On AIME25 at matched accuracy (40.8%), TriAttention delivers 2.5× higher throughput than Full Attention.

10.7×

Memory Reduction

TriAttention reduces KV cache memory by 10.7× while matching Full Attention reasoning accuracy on AIME25.

6.3×

Peak Speedup

On MATH 500, TriAttention reaches 1,405 tokens/s vs. 223 tokens/s for Full Attention (68.4% vs. 69.6% accuracy).

Abstract

Extended reasoning in large language models (LLMs) requires long and accurate decoding and creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning.

To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions—Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.

On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5× higher throughput or 10.7× KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency.

Key Insight

Q/K Concentration in Pre-RoPE Space

Across most attention heads, pre-RoPE Q and K vectors are highly concentrated around fixed non-zero centers. This concentration is stable across positions and input contexts, and it makes attention patterns predictable via a trigonometric series.

Q/K concentration phenomenon: pre-RoPE vectors cluster around stable non-zero centers, enabling attention reconstruction via trigonometric series
(A) Pre-RoPE Q/K vectors at the dominant frequency band are highly concentrated (high Mean Resultant Length R). (B) RoPE rotation disperses these vectors into arc patterns. Three distinct input sequences are overlaid, showing this structure is stable across content. (C) This concentration holds across nearly all heads. (D) When Q/K are concentrated, attention logits can be accurately reconstructed using a trigonometric series (Pearson r = 0.72).

Post-RoPE is Unstable

Existing methods use recent post-RoPE queries to estimate importance. But queries rotate with position, so only a tiny window is usable—important keys go undetected and permanently evicted.

Pre-RoPE is Stable

In pre-RoPE space, Q and K vectors cluster around non-zero centers that stay consistent across positions and inputs. These centers are intrinsic properties, unaffected by positional rotation.

Distance Preferences are Predictable

When Q/K are concentrated, the attention logit reduces to a trigonometric series in Q-K distance. The learned centers determine which distances each head prefers.

Experimental Validation

Trigonometric Series Accurately Reconstructs Attention

We validate across three DeepSeek-R1 distilled architectures that the trigonometric series computed from Q/K centers faithfully predicts actual attention patterns.

Attention reconstruction correlation across three DeepSeek-R1 distilled LLMs. All models show right-skewed distributions with means above 0.5.
Distribution of per-head reconstruction Pearson correlation () across all attention heads for Qwen3, Qwen2.5, and Llama3. The red dashed line indicates the mean. All models show right-skewed distributions with means above 0.5, confirming that Q/K concentration enables accurate attention prediction across architectures.

Method

TriAttention: Scoring Keys via Trigonometric Series

TriAttention scores each key by combining a trigonometric series score (capturing distance preferences) with a norm-based score (handling dispersed heads), balanced by Q/K concentration.

TriAttention method overview: offline calibration computes Q centers, then scoring combines trigonometric series and norm-based components for KV cache pruning
Method overview. Offline calibration computes Q distribution centers. During inference, keys are scored by combining Strig (distance preference from trigonometric series) and norm-based components. Strig assigns low scores to keys at non-preferred distances, while the norm-based score identifies low-norm keys. Together, they accurately identify unimportant tokens for pruning.

Trigonometric Series Score

Uses Q centers from offline calibration and the trigonometric series to predict how much attention each key will receive at its current distance. Captures the distance preferences encoded in Q/K concentration.

Norm-Based Score

Complements the trigonometric series for the minority of heads where Q/K are less concentrated. Weights each frequency band by expected query contribution, accounting for variation around centers.

Adaptive Weighting

Uses Mean Resultant Length R to automatically balance the two components. When concentration is high, Strig dominates; when low, the norm-based score provides complementary signal.

Results

Matching Full Attention at a Fraction of the Cost

Performance trade-offs on AIME25: TriAttention achieves 2.5x higher throughput and 10.7x KV memory reduction while matching Full Attention accuracy
Performance trade-offs on AIME25 (Qwen3-8B). (A) At equivalent accuracy (40.8%), TriAttention achieves 2.5× higher throughput than Full Attention. (B) TriAttention reduces KV cache memory by 10.7× while matching Full Attention accuracy.

Reasoning Performance on AIME24 & AIME25 (KV Budget = 2048)

Method AIME24 AIME25
Qwen3-8B DS-Llama DS-Qwen GPT-OSS Qwen3-8B DS-Llama DS-Qwen GPT-OSS
Full Attention 57.150.443.869.2 40.831.434.260.0
SnapKV 34.65.034.648.3 20.06.725.036.7
R-KV 25.425.834.649.6 17.511.223.339.2
TriAttention 42.133.842.559.2 32.919.630.049.2
Performance comparison on Qwen3-8B: accuracy vs KV cache budget on three math reasoning benchmarks, plus memory retention benchmark
Performance on Qwen3-8B. (A–C) Accuracy vs. KV cache budget on MATH 500, AIME24, and AIME25. TriAttention consistently outperforms R-KV across all budget levels. (D) Recursive State Query benchmark measuring memory retention under increasing depth. TriAttention performs comparably to Full Attention up to depth 16, while R-KV shows catastrophic degradation.

BibTeX

@inproceedings{mao2026triattention,
  title={TriAttention: Efficient Long Reasoning with Trigonometric KV Compression},
  author={Mao, Weian and Lin, Xi and Huang, Wei and Xie, Yuxin and Fu, Tianfu and Zhuang, Bohan and Han, Song and Chen, Yukang},
  booktitle={Preprint},
  year={2026}
}