Inference-Time Scaling vs training compute

September 13, 2024 admin

We’re witnessing a shift in AI where scaling during inference time takes center stage, driving real-world applications. As Sutton pointed out in the Bitter Lesson, there are two techniques that scale limitlessly with compute power: learning and search. Now, the spotlight turns toward search.

Why Is Inference-Time Scaling so Crucial?

Smarter Models, Not Just Bigger ones

Instead of relying on massive models for reasoning, many parameters in these models are dedicated to memorizing facts, particularly for tasks like trivia QA. But what if we separate reasoning from knowledge? By using a smaller “reasoning core” that knows how to leverage external tools (like a browser or a code verifier), we could drastically cut down on pre-training compute without sacrificing performance.

Shift to Inference Compute

The future lies in shifting a significant amount of compute to serving inference rather than focusing solely on pre- or post-training. Think of LLMs as text-based simulators. By running multiple strategies and simulations—akin to Monte Carlo Tree Search (MCTS) used by AlphaGo—the model can ultimately converge on good solutions. This technique emphasizes exploitation of the model’s knowledge through repeated sampling, but it comes at a cost: latency and the need for substantial inference compute power.

Academia Is Catching Up

OpenAI likely figured out these inference scaling laws a while back, and now academia is catching up. Just recently, two groundbreaking papers were published that highlight the power of inference-time scaling:

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling by Brown et al. shows how DeepSeek-Coder’s performance on SWE-Bench jumped from 15.9% with one sample to 56% with 250 samples, surpassing even Sonnet-3.5.
Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters by Snell et al. revealed that PaLM 2-S outperformed a model 14x its size on MATH, purely by optimizing test-time search.

ArXiv

The Challenges of Going Beyond Benchmarks

While this approach is groundbreaking, productionizing o1 comes with unique challenges. In real-world reasoning problems, how do we decide when to stop searching? What’s the optimal reward function? How do we determine when to call tools like a code interpreter in the loop? These questions need to be answered, particularly when factoring in the compute cost of additional processes, and OpenAI’s research post didn’t provide many specifics on this front.

Demo

The Power of a Data Flywheel

Strawberry o1 also has the potential to create a powerful data flywheel. If the correct answer is found, the entire search trace can become a mini dataset, training the model on both positive and negative rewards. Over time, this reinforces the reasoning core, similar to how AlphaGo’s value network improved as MCTS generated increasingly refined training data.

How do you feel about the shift toward inference-time scaling over massive pre-training? What trade-offs do you think we’ll face as we push more towards exploitation through search rather than exploration in model training? Share your thoughts below!