First published benchmarks of Google's TurboQuant algorithm at 7B scale on an RTX 4080. 45 data points, 4 models, real VRAM measurements.
TL;DR
I benchmarked TurboQuant KV cache compression on an RTX 4080 (16 GB). At 4K context on a 3B model, it saves 1 GB of VRAM and runs 74% faster than FP16 (because FP16 starts thrashing). At 8K context on a 0.5B model, it saves 2 GB. The library is pip-installable and works with any HuggingFace model in 3 lines of code.
from turboquant import TurboQuantCache
cache = TurboQuantCache(bits=4)
outputs = model(**inputs, past_key_values=cache, use_cache=True)
Install: pip install turboquant
Repo: github.com/back2matching/turboquant
Background
Google published TurboQuant at ICLR 2026 — an algorithm that compresses LLM KV cache entries from 16 bits to 3-4 bits using random rotation + optimal scalar quantization. Their paper shows impressive results on H100s with 8B models. But nobody had published independent benchmarks on consumer GPUs at that scale.
I built an open-source implementation and ran the benchmarks.
Setup
- GPU: NVIDIA RTX 4080 (16 GB VRAM)
- PyTorch: 2.5.1, CUDA 12.1
- Models: Qwen2.5-7B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-0.5B-Instruct, StableLM-2-1.6B
- Methodology: FP16 baseline vs TurboQuant 4-bit vs TurboQuant 3-bit, greedy decoding, 100 tokens generated per run. Peak VRAM measured via
torch.cuda.max_memory_allocated().
All results are reproducible:
pip install turboquant
python benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-3B-Instruct --context "512,1024,2048,4096"
Results
Qwen2.5-7B-Instruct (14.5 GB model weights)
The 7B model barely fits on 16 GB. This is where compression matters most.
| Context | FP16 Peak | TQ 4-bit Peak | Saved | FP16 Speed | TQ Speed |
|---|---|---|---|---|---|
| 460 | 14,833 MB | 14,758 MB | 75 MB | 17.7 tok/s | 23.8 tok/s |
| 1860 | 16,659 MB | 16,215 MB | 444 MB | 1.0 tok/s | 1.4 tok/s |
At 1.8K context, FP16 exceeds the 16 GB physical VRAM limit (16,659 > 16,376 MB). The GPU starts swapping and speed collapses to 1 tok/s. TurboQuant saves 444 MB, stays under the limit, and runs 40% faster.
This is the scenario the paper designed for — but nobody had shown it on consumer hardware.
Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model)
With a smaller model, we can test longer contexts and see how savings scale.
| Context | FP16 Peak | TQ 4-bit Peak | Saved | FP16 Speed | TQ Speed |
|---|---|---|---|---|---|
| 460 | 6,126 MB | 6,078 MB | 48 MB | 30.7 tok/s | 18.7 tok/s |
| 930 | 6,451 MB | 6,267 MB | 184 MB | 31.4 tok/s | 18.8 tok/s |
| 1860 | 7,359 MB | 6,851 MB | 508 MB | 26.0 tok/s | 17.8 tok/s |
| 3720 | 10,222 MB | 9,206 MB | 1,016 MB | 3.5 tok/s | 6.1 tok/s |
The pattern is clear:
- VRAM savings scale linearly with context length. From 48 MB at 512 tokens to over 1 GB at 4K.
- Speed overhead is constant (~35% slower) at short contexts. The rotation math adds fixed overhead.
- At 4K, FP16 hits memory pressure and TQ wins on speed too. FP16 drops to 3.5 tok/s while TQ maintains 6.1 tok/s — 74% faster.
Qwen2.5-0.5B-Instruct — Long Context (942 MB model)
The tiny model leaves 15 GB for KV cache, so we can push to 8K tokens.
| Context | FP16 Peak | TQ 4-bit Peak | Saved | FP16 Speed | TQ Speed |
|---|---|---|---|---|---|
| 3720 | 4,654 MB | 3,621 MB | 1,033 MB | 31.9 tok/s | 26.5 tok/s |
| 7440 | 13,265 MB | 11,195 MB | 2,070 MB | 17.8 tok/s | 19.8 tok/s |
At 8K context, TurboQuant saves 2 GB and is 11% faster than FP16.
Cross-Architecture: StableLM-2-1.6B
I also tested StableLM to see if results hold across architectures. Short answer: mixed.
| Context | FP16 Peak | TQ 4-bit Peak | VRAM Diff |
|---|---|---|---|
| 460 | 3,433 MB | 3,488 MB | +55 MB |
| 3720 | 5,459 MB | 6,318 MB | +859 MB |
TQ uses MORE memory on StableLM. This exposed a limitation I fixed in v0.2.0 (see below), but the StableLM numbers are from v0.1.0 and haven't been re-tested yet.
The v0.2.0 Story: From Fake Compression to Real Compression
Here's something I discovered during benchmarking that I think is worth sharing.
The initial implementation (v0.1.0) had a subtle architectural problem: after quantizing the KV cache entries to 4-bit indices, it immediately dequantized them back to FP16 and stored the full-size tensors. Like compressing a file and then decompressing it and throwing away the zip. The "VRAM savings" were actually just side effects of different PyTorch memory allocation patterns between code paths.
I rewrote the cache layer in v0.2.0 to store the actual compressed data (uint8 indices + float32 norms) and dequantize on-the-fly when the model needs the data for attention. The dequantization adds ~1-2ms per forward pass — negligible compared to generation time. But now the compression is real.
The 3B results above are from v0.2.0 with genuine compressed storage.
How It Works (30-Second Version)
- Random rotation: Multiply each KV vector by a random orthogonal matrix. This spreads information evenly across coordinates, making them nearly independent.
- Optimal quantization: Each coordinate now follows a Beta distribution. Compute the mathematically optimal quantization levels analytically — no training data or calibration needed.
- Residual window: Keep the most recent 128 tokens in full FP16. Only compress older tokens. Recent tokens matter most for attention.
The codebook is derived from probability theory, not data. Works with any model out of the box. No fine-tuning, no calibration data, no configuration.
Comparison with Alternatives
| Method | Bits | Ecosystem | Real Compression | Speed |
|---|---|---|---|---|
| TurboQuant | 3-4 | Any HuggingFace model | Yes (v0.2.0) | ~35% slower, faster under pressure |
| Ollama q4_0 KV | 4 | Ollama only | Yes | Native speed |
| vLLM FP8 KV | 8 | vLLM only | Yes | Native speed |
| KIVI | 2 | Research code | Yes | Unknown |
TurboQuant is the only pip-installable option for HuggingFace users. If you're already using Ollama or vLLM, their built-in options are faster. But if you're writing custom HuggingFace inference code and hitting VRAM limits, TurboQuant is the simplest path.
What's Next
- Sub-byte bitpacking: Currently indices are stored as uint8 (1 byte each), wasting 4-5 bits per value. Packing will push compression from 1.9x to 3-4x — potentially 4+ GB saved at 8K context.
- Re-benchmark StableLM with v0.2.0 compressed storage.
- Perplexity benchmarks on WikiText-2 for rigorous quality measurement.
Try It
pip install turboquant
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
import torch
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct", dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
cache = TurboQuantCache(bits=4)
inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
outputs = model(**inputs, past_key_values=cache, use_cache=True)
Paper: arXiv:2504.19874 (ICLR 2026) Repo: github.com/back2matching/turboquant PyPI: turboquant
This is an independent implementation, not affiliated with Google Research.