LLM performance on mathematical reasoning benchmarks may involve memorization rather than true reasoning ability.
This study addresses concerns about dataset contamination in LLMs by introducing the Grade School Math 1000 (GSM1k) benchmark, designed to match the style and complexity of the widely-used GSM8k benchmark for mathematical reasoning. The research shows that many models, including Phi and Mistral, experience accuracy drops of up to 13% on GSM1k, indicating systematic overfitting, while frontier models like GPT and Claude show minimal overfitting. A positive correlation (Spearman’s r² = 0.32) was found between a model's likelihood of generating examples from GSM8k and the performance gap between GSM8k and GSM1k, suggesting partial memorization of GSM8k by some models.
The study evaluated LLMs on both the GSM8k and GSM1k benchmarks using a standardized evaluation setup for fair comparison. Open-source models were tested using EleutherAI's LM Evaluation Harness, and closed-source models were queried through the LiteLLM library. All models were tested with identical prompts using five examples from the GSM8k train set. Results indicate that some lesser-known models, particularly those near the top of the OpenLLMLeaderboard, performed significantly worse on GSM1k, suggesting they may have over-optimized for GSM8k, in line with Goodhart’s law.
Analysis
- Certain model families, notably Phi and Mistral, consistently overfit to GSM8k, showing better performance on it compared to GSM1k across almost all model versions and scales. Other model families, including Yi, Xwin, Gemma, and CodeLlama, also exhibit this pattern but to a lesser degree.
- Many models do not display signs of overfitting and perform similarly on both GSM8k and GSM1k. Two possible reasons for this are: 1) these models possess advanced reasoning abilities, allowing them to generalize to new problems even if they have seen GSM8k data, and 2) frontier model builders may be more cautious about data contamination.
- Overfit Models Are Still Capable of Reasoning. The fact that a model is overfit does not mean that it is poor at reasoning, merely that it is not as good as the benchmarks might indicate it to be. In fact, we find that many of the most overfit models are still capable of reasoning and solving novel problems.