Abstract: One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cacheaware (CA) and cache-oblivious (CO) algorithms take into consideration the ...
Abstract: This paper describes FFTS, a discrete Fourier transform (DFT) library that achieves state-of-the-art performance using a new cache-oblivious algorithm implemented with run-time ...
First open-source implementation of TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni — Google Research / Google DeepMind / NYU, April 2025).
TurboQuant is a compression algorithm introduced by Google Research (Zandieh et al.) at ICLR 2026 that solves the primary memory bottleneck in large language model inference: the key-value (KV) cache.
Running a 70-billion-parameter large language model for 512 concurrent users can consume 512 GB of cache memory alone, nearly four times the memory needed for the model weights themselves. Google on ...
As Large Language Models (LLMs) expand their context windows to process massive documents and intricate conversations, they encounter a brutal hardware reality known as the "Key-Value (KV) cache ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results