Independent TurboQuant Benchmarks on Consumer GPUs

First published benchmarks of Google's TurboQuant algorithm at 7B scale on an RTX 4080. 45 data points, 4 models, real VRAM measurements.

TL;DR

I benchmarked TurboQuant KV cache compression on an RTX 4080 (16 GB). At 4K context on a 3B model, it saves 1 GB of VRAM and runs 74% faster than FP16 (because FP16 starts thrashing). At 8K context on a 0.5B model, it saves 2 GB. The library is pip-installable and works with any HuggingFace model in 3 lines of code.

from turboquant import TurboQuantCache
cache = TurboQuantCache(bits=4)
outputs = model(**inputs, past_key_values=cache, use_cache=True)

Install: pip install turboquant Repo: github.com/back2matching/turboquant

Background

Google published TurboQuant at ICLR 2026 — an algorithm that compresses LLM KV cache entries from 16 bits to 3-4 bits using random rotation + optimal scalar quantization. Their paper shows impressive results on H100s with 8B models. But nobody had published independent benchmarks on consumer GPUs at that scale.

I built an open-source implementation and ran the benchmarks.

Setup

GPU: NVIDIA RTX 4080 (16 GB VRAM)
PyTorch: 2.5.1, CUDA 12.1
Models: Qwen2.5-7B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-0.5B-Instruct, StableLM-2-1.6B
Methodology: FP16 baseline vs TurboQuant 4-bit vs TurboQuant 3-bit, greedy decoding, 100 tokens generated per run. Peak VRAM measured via torch.cuda.max_memory_allocated().

All results are reproducible:

pip install turboquant
python benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-3B-Instruct --context "512,1024,2048,4096"

Results

Qwen2.5-7B-Instruct (14.5 GB model weights)

The 7B model barely fits on 16 GB. This is where compression matters most.

Context	FP16 Peak	TQ 4-bit Peak	Saved	FP16 Speed	TQ Speed
460	14,833 MB	14,758 MB	75 MB	17.7 tok/s	23.8 tok/s
1860	16,659 MB	16,215 MB	444 MB	1.0 tok/s	1.4 tok/s

At 1.8K context, FP16 exceeds the 16 GB physical VRAM limit (16,659 > 16,376 MB). The GPU starts swapping and speed collapses to 1 tok/s. TurboQuant saves 444 MB, stays under the limit, and runs 40% faster.

This is the scenario the paper designed for — but nobody had shown it on consumer hardware.

Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model)

With a smaller model, we can test longer contexts and see how savings scale.

Context	FP16 Peak	TQ 4-bit Peak	Saved	FP16 Speed	TQ Speed
460	6,126 MB	6,078 MB	48 MB	30.7 tok/s	18.7 tok/s
930	6,451 MB	6,267 MB	184 MB	31.4 tok/s	18.8 tok/s
1860	7,359 MB	6,851 MB	508 MB	26.0 tok/s	17.8 tok/s
3720	10,222 MB	9,206 MB	1,016 MB	3.5 tok/s	6.1 tok/s

The pattern is clear:

VRAM savings scale linearly with context length. From 48 MB at 512 tokens to over 1 GB at 4K.
Speed overhead is constant (~35% slower) at short contexts. The rotation math adds fixed overhead.
At 4K, FP16 hits memory pressure and TQ wins on speed too. FP16 drops to 3.5 tok/s while TQ maintains 6.1 tok/s — 74% faster.

Qwen2.5-0.5B-Instruct — Long Context (942 MB model)

The tiny model leaves 15 GB for KV cache, so we can push to 8K tokens.

Context	FP16 Peak	TQ 4-bit Peak	Saved	FP16 Speed	TQ Speed
3720	4,654 MB	3,621 MB	1,033 MB	31.9 tok/s	26.5 tok/s
7440	13,265 MB	11,195 MB	2,070 MB	17.8 tok/s	19.8 tok/s

At 8K context, TurboQuant saves 2 GB and is 11% faster than FP16.

Cross-Architecture: StableLM-2-1.6B

I also tested StableLM to see if results hold across architectures. Short answer: mixed.

Context	FP16 Peak	TQ 4-bit Peak	VRAM Diff
460	3,433 MB	3,488 MB	+55 MB
3720	5,459 MB	6,318 MB	+859 MB

TQ uses MORE memory on StableLM. This exposed a limitation I fixed in v0.2.0 (see below), but the StableLM numbers are from v0.1.0 and haven't been re-tested yet.

The v0.2.0 Story: From Fake Compression to Real Compression

Here's something I discovered during benchmarking that I think is worth sharing.

The initial implementation (v0.1.0) had a subtle architectural problem: after quantizing the KV cache entries to 4-bit indices, it immediately dequantized them back to FP16 and stored the full-size tensors. Like compressing a file and then decompressing it and throwing away the zip. The "VRAM savings" were actually just side effects of different PyTorch memory allocation patterns between code paths.

I rewrote the cache layer in v0.2.0 to store the actual compressed data (uint8 indices + float32 norms) and dequantize on-the-fly when the model needs the data for attention. The dequantization adds ~1-2ms per forward pass — negligible compared to generation time. But now the compression is real.

The 3B results above are from v0.2.0 with genuine compressed storage.

How It Works (30-Second Version)

Random rotation: Multiply each KV vector by a random orthogonal matrix. This spreads information evenly across coordinates, making them nearly independent.
Optimal quantization: Each coordinate now follows a Beta distribution. Compute the mathematically optimal quantization levels analytically — no training data or calibration needed.
Residual window: Keep the most recent 128 tokens in full FP16. Only compress older tokens. Recent tokens matter most for attention.

The codebook is derived from probability theory, not data. Works with any model out of the box. No fine-tuning, no calibration data, no configuration.

Comparison with Alternatives

Method	Bits	Ecosystem	Real Compression	Speed
TurboQuant	3-4	Any HuggingFace model	Yes (v0.2.0)	~35% slower, faster under pressure
Ollama q4_0 KV	4	Ollama only	Yes	Native speed
vLLM FP8 KV	8	vLLM only	Yes	Native speed
KIVI	2	Research code	Yes	Unknown

TurboQuant is the only pip-installable option for HuggingFace users. If you're already using Ollama or vLLM, their built-in options are faster. But if you're writing custom HuggingFace inference code and hitting VRAM limits, TurboQuant is the simplest path.

What's Next

Sub-byte bitpacking: Currently indices are stored as uint8 (1 byte each), wasting 4-5 bits per value. Packing will push compression from 1.9x to 3-4x — potentially 4+ GB saved at 8K context.
Re-benchmark StableLM with v0.2.0 compressed storage.
Perplexity benchmarks on WikiText-2 for rigorous quality measurement.

Try It

pip install turboquant

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct", dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

cache = TurboQuantCache(bits=4)
inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
outputs = model(**inputs, past_key_values=cache, use_cache=True)

Paper: arXiv:2504.19874 (ICLR 2026) Repo: github.com/back2matching/turboquant PyPI: turboquant

This is an independent implementation, not affiliated with Google Research.