ReasonCACHE: Teaching LLMs To Reason Without Weight Updates

§ FAIR at Meta MIT TU Munich



Reasoning with Training KV Cache (ReasonCACHE). ReasonCACHE adapts a frozen backbone by learning a trainable KV cache (prefixes) at each attention layer, compressing demonstrations into a compact cached state. In contrast, (a) in-weight methods adapt by updating model parameters, and (b) in-context learning adapts only through additional exemplar tokens at inference time. (Right) Test accuracy (%) on GPQA-Diamond where ReasonCACHE outperforms both in-context and in-weight baselines, and surpasses full supervised fine-tuning.




Abstract

Can Large language models (LLMs) learn to reason without any weight update and only through in-context learning (ICL)? Acquiring complex reasoning capabilities typically requires learning from many demonstrations. However, standard ICL approaches break down at this scale: attention costs grow quadratically, performance degrades, and learning becomes shallow as the context is overloaded. Due to these limitations, practitioners predominantly rely on in-weight learning (IWL) to induce reasoning. We show that by using Prefix Tuning, LLMs can learn to reason without overloading the context window and without any weight updates. We introduce ReasonCACHE, an instantiation of this mechanism that distills demonstrations into a fixed key-value cache. Empirically, across challenging reasoning benchmarks, including GPQA-Diamond, ReasonCACHE outperforms standard ICL and matches or surpasses IWL approaches. Further, it achieves this all while being more efficient across three key axes: data-, inference-, and parameter-efficiency. Theoretically, we prove that ReasonCACHE can be strictly more expressive than low-rank weight update since the latter ties expressivity to input rank, whereas ReasonCACHE bypasses this constraint by directly injecting key-values into the attention mechanism.



The Mechanism Behind Prefix Tuning


As shown in the figure above, Prefix tuning (PT) adds m trainable key–value vectors at each transformer layer $l$ i.e. $(P_K^{(ℓ)}, P_V^{(ℓ)}) \in \mathbb{R}^{m \times d}$. These prefix keys and values are prepended to the token-derived keys and values $(K^{(\ell)}, V^{(\ell)})$ respectively, so each attention layer operates over both prefix and token information. $$ \tilde{K}^{(\ell)} = \begin{bmatrix} P_K^{(\ell)} \\[2pt] K^{(\ell)} \end{bmatrix}, \qquad \tilde{V}^{(\ell)} = \begin{bmatrix} P_V^{(\ell)} \\[2pt] V^{(\ell)} \end{bmatrix} $$ All pretrained weights are frozen; only $\mathcal P=\{(P_K^{(\ell)},P_V^{(\ell)})\}_{\ell=1}^L$ is optimized to minimize the standard next-token prediction loss. We use ReasonCACHE to denote Prefix Tuning specialized to reasoning. Note that ReasonCACHE introduces no new adaptation primitive; rather, it studies and demonstrates how learned KV prefixes can scale in-context learning into a reliable and efficient mechanism for reasoning.




Experimental Results


1. Effective Reasoning Without Weight Updates

(a) ReasonCACHE Scales In-Context Learning. We first observe a clear limitation of shallow contextual adaptation: both ICL and prompt tuning yield only modest gains, especially on long-horizon reasoning. In contrast, ReasonCACHE consistently outperforms other in-context methods.

(b) ReasonCACHE Outperforms In-Weight Adaptation. We next compare ReasonCACHE to IWL methods. Even under matched parameter budgets, and with pretrained weights frozen, ReasonCACHE matches or beats LoRA and even full SFT, with the gap widening on long-form reasoning. Specifically, ReasonCACHE achieves the best performance on GPQA-Diamond ($41.92\%$), exceeding both LoRA and full SFT.



2. ReasonCACHE is Data Efficient

ReasonCACHE data efficiency results

ReasonCACHE traces the accuracy–data Pareto frontier over nearly four orders of magnitude. In the low-data regime, ReasonCACHE matches or exceeds ICL, while ICL degrades as examples accumulate due to long-context dilution and position effects. IWL methods (e.g., LoRA, SFT) often overfit with limited data. ReasonCACHE is $\approx59$% more data-efficient than LoRA (to reach 50% accuracy), outperforms it throughout this regime, and closes much of the gap to high-data SFT—all while keeping the backbone frozen. Overall, ReasonCACHE bridges in-context and in-weight adaptation: it inherits ICL’s inductive bias via few-shot initialization and gains IWL-style scaling via optimization.



2. ReasonCACHE is Inference Efficient

(a) ReasonCACHE Reduces Prefill Cost. As shown in the figure below (left), ReasonCACHE dominates ICL, outperforming the best ICL configuration by 44.8 points while using 90% less total inference compute. This gap is driven by prefill costs, since ICL incurs quadratic overhead as demonstrations are added, whereas ReasonCACHE replaces them with a short learned prefix.

(b) ReasonCACHE Reduces Generation Length. As shown in the figure below (right), on long-form reasoning tasks such as GPQA-D, decoding dominates inference. ReasonCACHE achieves higher accuracy with substantially shorter generations (compared to SFT): it reduces generation length by 34% while improving accuracy by 11%.



2. ReasonCACHE is Parameter Efficient

ReasonCACHE parameter efficiency results

Under matched parameter budgets, ReasonCACHE consistently dominates LoRA across the entire accuracy–parameter curve on GSM8K. For example, to reach a target test accuracy of 50%, it requires 46% fewer parameters than LoRA. ReasonCACHE is more parameter-efficient in the low-budget regime and continues to improve as the budget increases, while LoRA saturates early.





Theoretical Insights


We prove that Prefix Tuning (and thus ReasonCACHE) and LoRA face fundamentally different expressivity bottlenecks. Both can introduce arbitrary new directions outside the base model's subspace; the difference is how many independent directions they can realize. LoRA is jointly constrained by (i) the intrinsic rank of the input context and (ii) the adapter rank. Prefix Tuning is constrained only by the number of prefix vectors. Thus, when the input lies in a low-dimensional subspace, Prefix Tuning can be strictly more expressive than LoRA. For details, please refer to our paper.

In figure below, input is assumed to be of rank $1$. The original value vectors span a 1D subspace $S_X$ (vertical axis); any new information must lie in the orthogonal space $S_X^\perp$ (horizontal plane). Here, (a) The base model is confined to $S_X$. (b) Since LoRA is limited by the rank of input, it adds at most one new dimension regardless of the adapter rank $r$. (c) Prefix tuning even with two prefixes can span the full 2D space $S_X^\perp$, which LoRA cannot reach.





To cite this work, please use the following bibtex:

@inproceedings{sharut2026reasoncache,
    title={ReasonCACHE: Teaching LLMs To Reason Without Weight Updates},
    author={Gupta, Sharut and Isola, Phillip and Jegelka, Stefanie and 
        Lopez-Paz, David and Ahuja, Kartik and Ibrahim, Mark and Pezeshki, Mohammad},
    journal={arXiv preprint arXiv:2602.02366},
    year={2026}
}