Sn Tech Science Bg 01

Small Language Models Outperform Large Counterparts with Test-Time Scaling

Very small language models (SLMs) have demonstrated the ability to outperform large language models (LLMs) in complex reasoning tasks, according to a recent study conducted by Shanghai AI Laboratory. This study reveals that an SLM with 1 billion parameters can exceed the performance of a 405 billion parameter LLM on intricate math benchmarks.

Understanding Test-Time Scaling

Test-time scaling (TTS) enhances model performance during inference by allocating additional compute cycles. Leading reasoning models utilize two main approaches: internal TTS and external TTS. Internal TTS involves training models to generate a lengthy chain of thought tokens, while external TTS enhances model performance with external assistance, allowing existing models to tackle reasoning tasks without further fine-tuning.

Methods of External TTS

External TTS setups typically include a policy model, which generates answers, and a process reward model (PRM) that evaluates these answers. Common methods include:

  • Best-of-N: The policy model generates multiple answers, and the PRM selects the best response.
  • Beam Search: The model breaks down the answer into steps, sampling multiple responses for each step.
  • Diverse Verifier Tree Search (DVTS): The model creates various branches of answers to enhance response diversity.

Choosing the Right TTS Strategy

The effectiveness of TTS strategies depends on the policy model and PRM used, as well as the difficulty of the problem. For instance, smaller policy models tend to benefit more from search-based methods, while larger models perform better with the best-of-N approach. The study highlights that for small models, the best method varies with problem difficulty, while larger models tend to favor the best-of-N strategy across difficulty levels.

Performance Insights

Developers can leverage these findings to create efficient TTS strategies tailored to specific models and problem types. The research indicates that a Llama-3.2-3B model using the optimal TTS strategy can outperform a Llama-3.1-405B model on challenging math benchmarks, demonstrating that smaller models can excel with the right compute strategy. Similar results emerged with other models, showing that SLMs can achieve superior outcomes with significantly less computational effort.

Future Research Directions

While the current study focuses on math benchmarks, researchers plan to explore other reasoning tasks, including coding and chemistry, to further validate the effectiveness of compute-optimal TTS strategies.

For more information, visit the original article on VentureBeat.