Gemini Diffusion: How this generation write everything at once!
magine an AI that doesn’t type word by word but instead starts with a messy blob and carves out fully coherent paragraphs—all at once. That’s Google’s Gemini Diffusion, and it’s turning heads with speeds topping 1,600 tokens per second.
In our latest video, we explore:
- How it mimics image diffusion to generate text
- Real-world demos that include instant app generation and multi-language translations
- Benchmarks where it holds its own against much bigger models
- Why it might be the key to unlocking massive context windows with blazing speed
Average Token per Second for LLMs (2024-2025)
These are typical average speeds for LLMs under standard conditions (may vary based on model size, backend hardware, and prompt complexity):
Model | Avg. Token/sec (output) | Notes |
---|---|---|
GPT-4 (API) | ~30–60 tokens/sec | Slower, optimized for quality; GPT-4 Turbo is faster |
GPT-3.5 (API) | ~60–100 tokens/sec | Snappier than GPT-4 |
Claude 2 | ~20–40 tokens/sec | Focuses on coherence and safety |
Claude 3 (Opus/Sonnet) | ~50–100 tokens/sec | Sonnet is noticeably faster than Opus |
Mistral 7B (local) | ~60–120 tokens/sec | Blazing fast if run on good hardware |
Gemini 1.5 | ~50–100 tokens/sec | Comparable to Claude 3 in speed |
Gemini Diffusion (preview) | 1000–1600 tokens/sec | Order of magnitude faster — thanks to parallel generation |