[Changed quote to Tether CEO, 6/2/26]
I have written in March about Google’s TurboQuant for compressing data in memory for AI applications, focusing on data center applications. In that article, I said that TurboQuant is a compression algorithm to address the challenge of memory overhead in key-value storage for AI models with zero accuracy loss.
I also said that by enabling AI with lower memory and storage requirements, we make that memory and storage even more useful and this will likely increase AI workflows, particularly on-premise. This could increase the memory and storage demand for implementing local AI inference. With today’s costs for digital memory and storage, this technology could enable useful AI implementations at much lower costs.
Recently a company called Tether introduced a version of TurboQuant that can be used on consumer devices like laptops and phones to process documents and extending AI conversations locally by using local memory and storage rather than public cloud-based resources. Tether Turboquant is an open-source AI memory compression algorithm that reduces the key-value (KV) cache of large language models (LLMs) by 3-6 times, depending upon the workload. The figure below, from Tether shows an 5 times reduction in required memory using TurboQuant.
Data resource requirements with and without TurboQuant
Tether
TurboQuant compresses the KV cache using during inference sessions but doesn’t change the trained LLM model weights. This is important as a model is accessed by a user. The KV cache keeps past keys and values in memory and this increases over time as a user interacts with the model. The KV cache contents grow with every token and every active session. This can become a major constraint on throughput, latency, concurrency and maximum context length.
By reducing the storage requirements for a given memory using compression, a fixed computation system with limited memory can store more KV cache information and thus make better, more useful inferences.
Because compression and decompression require computation and thus takes some time, there is some performance hit from using TurboQuant. Tether CEO, Paolo Ardoino, says that “TurboQuant achieves roughly 5x KV-cache memory reduction while keeping output quality nearly identical to full precision. The primary tradeoff is lower prompt processing/prefill throughput, which ranges from roughly 30–60% of baseline depending on context length and hardware. Token generation speed is much less affected, remaining around 94–98% of full-precision performance. Quality impact is minimal, with only a -0.03% perplexity delta and LongBench results remaining very close to the f16 baseline.”
In other words, there is some noticeable delay in prompt processing using TurboQuant, but overall accuracy of the LLM results is close to the results using uncompressed KV memory. This means that using TurboQuant, local AI can handle longer conversations, larger files, more context and heavier workloads on small and less expensive memory and storage.
This opens up opportunities to using strong AI models in more applications and can help startups deploy AI with less infrastructure, including on consumer level PCs and other devices. It also enables developers to run larger-context models locally to save cloud service costs and also avoid exposing proprietary data to the cloud.
Tether’s TurboQuant enables useful and powerful local AI applications on consumer devices at much lower costs and without exposing private data to the cloud.






















