When you ask an LLM a question, it doesn't write the whole answer at once. It generates one word (token) at a time — and for every single token, it reads through all its weights (billions of numbers ...
If the draft was right, you accept those tokens. If not, you correct and continue. The actual problem it solves Autoregressive decoding is memory-bound, not compute-bound. Your GPU sits idle while ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results