DiscoverLessWrong (30+ Karma)“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg
“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg

“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg

Update: 2025-12-15
Share

Description

Authors: Roy Rinberg, Adam Karvonen, Alex Hoover, Daniel Reuter, Keri Warr

Arxiv paper link

One Minute Summary

Anthropic has adopted upload limits to prevent model weight exfiltration. The idea is simple: model weights are very large, text outputs are small, so if we cap the output bandwidth, we can make model weight transfer take a long time. The problem is that inference servers now generate an enormous amount of tokens (on the order of ~1TB tokens per day), and the output text channel is the one channel you can't easily restrict.

Nonetheless, in this work we find that it's possible to dramatically limit the amount of information an adversary can send using those output tokens. This is because LLM inference is nearly deterministic: if you fix the sampling seed and regenerate an output, over ~98% of tokens match exactly. This means an attacker attempting to send secret information via steganography, the practice of embedding hidden messages inside otherwise normal-looking text, has very little entropy in the user channel to work with. We show that steganographic exfiltration can be limited to <0.5% of the total information being sent through the user channel (e.g. from 1TB/day to 5 GB/day), extending exfiltration [...]

---

Outline:

(01:43 ) Paper: Verifying LLM Inference to Prevent Model Weight Exfiltration

(04:28 ) The Key Insight: LLM Inference is mostly Deterministic

(06:42 ) Not all tokens are equally likely, even under non-determinism

(09:14 ) The Verification Scheme

(10:58 ) Headline Results: Information-Theoretic Bounds on Exfiltration

(13:52 ) Other Applications of Inference Verification

(15:31 ) Limitations

(16:10 ) Seeing this work in production:

(17:55 ) Resources

(18:13 ) How to Cite

The original text contained 2 footnotes which were omitted from this narration.

---


First published:

December 15th, 2025



Source:

https://www.lesswrong.com/posts/7i33FDCfcRLJbPs6u/defending-against-model-weight-exfiltration-through-1


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Figure 1: Comparison of normal vs. compromised inference flows. In normal operation (left), the inference server simply returns model outputs. If compromised (right), the server can exfiltrate model weights by encoding them steganographically in ordinary-looking model responses.
Figure 2: The Token-IPT-DiFR method for estimating the Fixed-Seed Sampling Likelihood. Given a sampled token and seed, we estimate the probability that the token was honestly generated by integrating over the valid non-determinism in the model's probability distribution.
Figure 3: High-level diagram of how Token-DiFR works.
Figure 4: Overview of the verification system. The inference server logs all traffic, and a separate verification server randomly samples and re-verifies outputs.
Figure 5: Pareto frontiers for rank-aware classification thresholds across multiple models (Llama-3.1-8B, Llama-3.2-3B, Qwen3-30B-A3B, Qwen1.5-MoE-A2.7B). Using logit-rank=8 as a filtering function to differentiate between suspicious and dangerous tokens. Only Pareto-optimal points are shown. At a 0.01% false positive rate, we can limit exfiltrable information to <0.5% of theoretical capacity.
Figure 6: Token-DiFR reliably detects a simulated bug, where with probability 1% per token, ignores the model logits and instead samples uniformly from the top-k tokens (k ∈ {2, 32}), and otherwise samples correctly. The curves show AUC at 1% FPR as a function of batch size for cross-entropy and Token-DiFR variants.
Figure 7: Sampling rate required to detect at least 1,000 malicious messages with given confidence. Assumes 1TB/day total throughput, where some fraction p is malicious. X-axis shows days to exfiltrate 1TB (e.g., if 10% of traffic is malicious, exfiltrating 1TB takes 10 days).

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
loading
In Channel
loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg

“Defending Against Model Weight Exfiltration Through Inference Verification” by Roy Rinberg