This year.Numina Competing with Hugging FaceAI Math Olympiad (AIMO) of the inaugural Progress Award. The aim of the competition was to fine-tune the open LLM so that it could solve International Mathematical Olympiad training problems of high school difficulty. We are pleased to report: our model - theNuminaMath 7B TIR - came out on top of the competition, successfully solving 29 of the 50 questions in the private test set 🥳!
This article describes the Numina program and the technical details behind our winning scenario. If you want to test the model directly with your math problem first, skip to thisdemonstrations Have some fun.
Here we go!
Introducing Numina - The Open AI For Math Program
Math is always a little different on the body!
Everyone is exposed to math on a daily basis, and children are exposed to math even before they can read. One of the greatest mathematicians of all timeRamanujin (name) Born in 1887 to an ordinary family in India, he became a self-taught master. Everyone has had an encounter with math in one way or another, ranging from a pastime to a livelihood.
There's no denying that math is vital to mankind, and everything in the business world, from iPhones to nuclear power plants, is rooted in math. But even purely application-oriented math problems are interesting in their own way.
Pure math transcends the intellect and is like an infinite ocean in which only the mind can wander.
That's why when we start theNumina When that happens, open source and open data integration becomes the natural choice. As opposed to human intelligence, we believe that AI deserves to play the same broad role in the advancement of mathematics that it deserves. If the computer is the bicycle of the mind, then AI is its engine - opening up new horizons for the Ramanujin of our time.
The genesis, collectively founded in late 2023 by a group of people passionate about artificial intelligence and mathematics with the support of Mistral AI (Jia Li、Yann Fleureau、Guillaume Lample、Stan Polu as well asHélène Evain) is Numina, inspired by the AI Math Olympiad (AIMO) competition initiated by Alex Gerko and XTX Markets.
In early 2024, the Numina team was supported by two LLM fine-tuning experts from Hugging Face (👋Lewis Tunstall cap (a poem)Ed Beeching) and thus begin the race2024 AIMO Progress Award. Subsequently, we were givenGeneral Catalyst cap (a poem) support. By March 2024, Numina has assembled a team ofTop talent from around the world Team.
With the team in place, it was time to take on AIMO!
AIMO Award
Each year, high school students from all over the world attend theInternational Mathematical Olympiad - A competition containing six challenging questions across the fields of algebra, geometry, and number theory. To give you an idea of the difficulty of the competition, it is given belowA question from last year:
November 2023AIMO Award Launched to promote the open development of AI models that excel in mathematical reasoning. The grand prize of $5 million will be awarded to the person who can train an AI model that can win an IMO gold medal. In addition to the grand prize, AIMO has launched a series ofprogress awardto honor milestone work in achieving this ultimate goal. The first Progress Award is based on theKaggle Contests The topics are held in a more formalized way than those in IMO.Keep it simple. This is equivalent to the level of the IMO qualifiers. Below, we give an example problem, which, as you can see, is a bit easier than the IMO question above, but still tricky for the LLM:.
honorific title\(k, l > 0\) is parameterized by the parabola\(y = kx^2 - 2kx + l\) directrix of a parabola\(y = 4\) intersect at two points\(A\) cap (a poem)\(B\)and the distance between the two points is 6. Ask\(A\) Distance to the origin and\(B\) What is the sum of the squares of the distances to the origin?
The questions are divided into two groups of 50 questions each, which serve as a public leaderboard and a private leaderboard, where the questions are invisible to the participants. The difficulty of these questions is similar to that of theAMC12 cap (a poem)AIME The exams are equivalent and their answers are all integers. The competition uses a private leaderboard to determine the final rankings. Participants can submit twice per day, using only open models released before February 23rd. Each submission is assigned a P100 GPU or 2xT4 GPU and is given up to 9 hours to solve 50 questions.
Considering the above rules and constraints, strategy selection is crucial for us to develop winning solutions.
Our first Progress Award winning solution
After multiple rounds of iterations throughout the competition, our first Progress Award solution consists of three main components.
- trimmingDeepSeekMath-Base 7B This is a walkthrough of a model that can solve math problems. With this how-to, we make the model a "reasoning agent" that can solve math problems by combining natural language reasoning with Python REPL to compute an intermediate result that ultimately solves the problem.
- A new decoding algorithm for tool-integrated reasoning (TIR) with code execution feedback to generate candidate solutions during reasoning.
- Various internal validation sets used to guide model selection and avoid overfitting public leaderboards.
We used several open-source libraries to train our model, notablyTRL、PyTorch、vLLM as well asDeepSpeedWe spent 10 hours training the model on an 8xH100 GPU node. We spent 10 hours training the model on an 8xH100 GPU node.
Training Tips
The fine-tuning approach we use is based primarily onMuMath-Code ThesisThe model training process is divided into two phases.
A two-stage training approach in MuMath-Code papers
- Phase 1. Fine-tune the underlying model on a large and diverse dataset of natural language "math problems + solutions", where each solution needs to apply a chain-of-thinking (CoT) template to motivate LLM reasoning.
- Phase 2. The model obtained in Phase 1 was fine-tuned on a synthetic dataset of tool-integrated reasoning, where each mathematical problem was decomposed into a series of reasoning, Python programs, and their outputs. At this point, we follow the MicrosoftToRA Thesis approach, prompting GPT-4 to generate solutions with code execution feedback in ToRA format. Fine-tuning this data produces an inference agent that can solve mathematical problems by combining natural language reasoning with the use of Python REPL to compute intermediate results (see figure below).
Figure from the ToRA paper that describes the tool-integrated inference format we used to train the model.
For both phases, we used "full model fine-tuning" where all model weights were updated during backpropagation. In other words, we did not use parameter-efficient techniques such as LoRA or DoRA, as there is no extensive experimental evidence that they can match the performance of full-model fine-tuning. We used TRL'sSFTTrainer
The "fill" function in DeepSpeed ZeRO-3 concatenates multiple samples into a 2048-word block. All models were trained with gradient checkpointing enabled and sliced using DeepSpeed ZeRO-3 to ensure that the weights, gradients, and optimizer state could be put into VRAM. the main hyperparameters used in both phases were as follows.
1 Stage | 2 Stages | |
---|---|---|
learning rate | 2.0 E-5 | 2.0 E-5 |
Total batch size | 32 | 32 |
block size | 2048 | 1024 |
epoch number (math.) | 3 | 4 |
Learning Rate Scheduler | cosine | cosine |
warm-up rate | 0.1 | 0.1 |
For the first commit, we used theDeepSeek 7B
model, we only fine-tuned it for stage 1, but we found the performance to be quite limited, with its best maj@32 score on the public leaderboard being only 8/50.Abdur Rafae (used form a nominal expression)Public Notebook prompted us to consider adding code execution to the training regimen. Initially, we focused onMMOS (Mix of Minimal Optimal Sets) dataset. We found that using MMOS improved performance, but the highest maj@32 score on the public leaderboard was still only 16/50, and we speculated at the time that the reason for this was that MMOS only contained single-round solutions (i.e., the model only generates a single Python program, which is not enough to solve the puzzle). Later, we realized that MMOS was a misnomer, and that the Kaggle notebook was actually using theDeepSeekMath 7B RL model, meaning that it is capable of multi-step reasoning and code execution.
After this, we wanted to focus on generating a dataset similar to the one used by the DeepSeekMath Instruct/RL model, an approach that, when combined with the MuMath-Code cheat sheet, led to significant improvements.
Below, take a look at how we constructed these datasets.
Required data
In constructing the dataset, we drew extensively on the methods of DeepSeek Math and other scholars, and extended them significantly. We generated datasets containing hundreds of thousands ofMath Problems - Solution pairs of fine-tuned datasets covering everything from high school math to competition-level math. In the next few weeks, we'll make that dataset fully open source. In the meantime. We may also check the scalability of our cheat sheet with a larger model. For more information on the construction of the dataset, see our upcoming technical report on the dataset.
Specifically for this Progress Award, we constructed two datasets for this purpose to fine-tune the model.
thought chain
The dataset consists of hundreds of thousands of questions, each with an answer written in the form of a chain of thought. The sources of the dataset range from Chinese high school math practice to US and international math Olympiad questions. The data are primarily from online test paper PDFs and math forums.
The processing steps are as follows.
- OCR the original PDF.
- Split into "topic-solution" pairs.
- Translation into English.
- Retooled to become a thought chain reasoning format.
- Formatted as a final answer.
tool-integrated reasoning
Tool Integrated Reasoning (TIR) plays a crucial role in this competition. However, collecting and labeling such data is both expensive and time-consuming. To address this problem, we selected about 60,000 questions from the Numina dataset, focusing on those with numerical answers, most of which are integers.
We then utilize GPT-4's pipeline to generate TORA-like inference paths, execute the code and generate results until complete answers are generated. We filter out solutions where the final answer does not match the reference answer and repeat this process three times to ensure accuracy and consistency. This iterative approach allows us to efficiently generate high-quality TORA data.
For reference, here is a model of the first phase of our trainingNuminaMath-7B-CoT and Phase 2 modelsNuminaMath-7B-TIR In the MATH benchmark Running scores on vs. other open and private models.
mould | MATH (%) |
---|---|
chain-of-minds reasoning | |
GPT-4 (2023) | 42.5 |
GPT-4o | 76.6 |
Claude 3.5 Sonnet | 71.1 |
DeepSeekMath-7B-Instruct | 46.8 |
DeepSeekMath-7B-RL | 51.7 |
NuminaMath-7B-CoT | 56.3 |
tool-integrated reasoning | |
DeepSeekMath-7B-Instruct | 57.4 |
DeepSeekMath-7B-RL | 58.8 |
NuminaMath-7B-TIR | 68.2 |
The performance of each model on the MATH benchmark. Unless explicitly stated, all runs are obtained from zero-sample greedy decoding.
Suppressing High Volatility through Self-Consistent Tool-Integrated Reasoning (SC-TIR)
The competition posed a number of challenges both in terms of model submission and evaluation, as noted by other participants:.
- The Evaluation API plays the questions in a randomized order, so strategies such as stopping early create high volatility because there may be a lot of difficult questions at the beginning, which results in little time left for the remaining sections (and vice versa).
- Most of the innovations in LLM inference are based on the latest GPUs, so the
Flash Attention 2
maybeStandard methods such as bfloat16 are not applicable to T4 GPUs; similarly, new data types such as bfloat16 are not supported on older GPUs, prompting us to explore post-training quantization methods such as AWQ and GPTQ.
Initially, we use theAbdur Rafae (used form a nominal expression)Public Notebook to submit, but found that high volatility is a big problem. To solve this problem, we took a new approach based on tool-integrated reasoning: the
- Replicate each question N times to generate a batch of vLLM. n can be thought of as the number of candidates for majority voting.
- These N inputs are sampled and decoded until a complete block of Python code is generated.
- Execute each Python code block and string its output after the code, including the stack traceback (if any).
- Repeat M times to generate N generations of depth M, allowing the model to self-correct code errors using stack backtracking. If a sample fails to generate a reasonable output (e.g., generates an incomplete block of code), it is deleted.
- Post-process the candidate answers and use majority voting to select the final answer.
Our winning submission used theN=48,M=4
. Since increasing the value of either parameter does not improve performance, we chose these two minimum values to ensure that the time constraints are met. In fact, the algorithm is enhanced by tool-integrated inferenceSelf-consistency of CoT (shown below).
We find that our SC-TIR algorithm produces more robust results and significantly reduces fluctuations in both the internal evaluation set and the public leaderboard.
One technical detail worth mentioning is that we found it useful to quantize the model with 8-bit precision. There are three reasons for this.
- Uploading models to Kaggle Hub was very slow, and compressing them doubled the upload speed.
- T4 GPUs don't support bfloat16, and converting to float16 would cause the model to lose performance. Converting to float32 is not possible because it exceeds the available GPU memory.
- In addition, the 16-bit model consumes about 32GB of VRAM just for loading the weights. for 2xT4, which requires enabling the KV cache to run fast, we find a tradeoff between model accuracy and speed to be beneficial.
We useAutoGPTQ and for calibrating the dataset to quantify our model. In practice, this leads to a small decrease in accuracy, but provides the best compromise to fit the constraints imposed by the Kaggle platform on model evaluation.
Avoiding the Curse of Overfitting
Overfitting public leaderboards is a common risk in Kaggle competitions, even more so when the test set has only 50 questions. Additionally, the rules allow for a maximum of two submissions per day, making a strong internally validated dataset critical to our development rhythm. According to the AIMO team, the test questions are of medium difficulty, between AMC12 and AIME levels, and each question has an integer answer.
To guide model selection, we used four internal validation sets to measure model performance on math questions of varying difficulty. To avoid potential data contamination in the base model, we selected questions from AMC12 (2022, 2023) and AIME (2022, 2023, 2024) to create two internal validation datasets.
- AMC (83 questions). We choseAMC12 22, AMC12 23 for all questions, and retained those questions with integer results. The final generated dataset contains 83 questions. The validation set is intended to mimic the private test set on Kaggle, as we know from the contest descriptions that the difficulty of the questions is greater than or equal to this level. We found that our model could answer about 60-65% of the questions. To measure fluctuations, we use 5-10 different seeds per evaluation and typically see about 1-3% fluctuations using our SC-TIR algorithm.
- AIME (90 questions). We choseAIME 22、AIME 23 as well asAIME 24 of all topics to measure how well our model performs in solving puzzles and to observe the most common error patterns. As above, for each evaluation, we use 5-10 seeds to measure fluctuations.
Due to the small size of the AMC/AIME validation sets, similar to the public leaderboards, model performance on these datasets is susceptible to noise. To better evaluate the performance of the models, we also evaluated them using a subset of the MATH test set (containing 5,000 questions). We keep only questions with integer answers to simplify majority voting and simulate the Olympiad evaluation. As a result, we added two more validation sets: the
- MATH Level 4 (754 questions)
- MATH Level 5 (721 questions)
By using these four validation sets, we were able to select the most promising models at different stages of training and narrow down the choice of hyper-references. We found it useful to combine small but representative validation sets with larger validation sets for this AIMO race, since each submission is subject to sampling randomness.
Other ideas we've tried
As mentioned above, we tried a few other methods during the process, but eventually gave up and switched to MuMath-Code's method. The methods we tried were.
- Training a pure CoT model and evaluating it using majority voting
- Training MMOS Models to Solve Problems in One Step with Python
We also tried a complementary application to the SFT model generationKahneman-Tversky Optimization (KTO), the specific idea is somewhat similar toOrcaMathi.e..
- Interweaving the use of inference and code execution, 4 complements per question were sampled using the SFT model. We use the Phase 2 SFT dataset as a cue.
- Extract the answers and compare them to the labeled answers. If correct, mark the sample as positive, otherwise mark it as negative.
- Apply KTO to the SFT model on this dataset.
We find that this form of co-curricular KTO generates a slightly better model than the SFT model (several percentage points of internal evaluation), scoring 27/50 on the public leaderboard.
One of the nice features of KTO is that you can track implicit rewards during training, which really helps with debugging - e.g., the image below shows one of our successful training logs, in which one can see that rewards for correct answers increase with training, while rewards for incorrect answers are suppressed.
But, due to time constraints, we did not end up applying this method to the final SFT model. If we had done it, we might have gotten 1-2 more questions right!
We also tried to apply our SFT cheats to larger models such as InternLM-20B, CodeLama-33B, and Mixtral-8x7B, but found that (a) the DeepSeek 7B model was hard to beat due to having been mathematically pre-trained incrementally, and (b) inference was very slow on 2xT4 GPUs, and that we encountered many mysterious timeouts, but we were unable to analyze the root cause.
There was also a failed experiment that attempted to incorporate reinforcement learning (specifically the PPO algorithm and theReinforce-Leave-One-Out (RLOO) Algorithm) and code execution feedback are combined to generate rewards for writing code and getting correct/incorrect answers. We apply this toDeepSeekMath 7B RL
Model. While we saw some nice reward curves, we did not see any significant improvement in performance. Given that online methods like RLOO are limited by the performance of text generation and are slow to iterate, we abandoned reinforcement learning and tried KTO instead.
In terms of reasoning, we also conducted the following experiments.
- Using a static KV cache and torch compilation. We found that we were able to convert the native
transformers
The code is generated 2-3 times faster, but various mysterious errors are encountered on Kaggle T4, mostly due to theaccelerate
The torch compilation in lacks support for model slicing.
Various model merging techniques, such asDARE、 TIES as well asWARP. Here we usemergekit to merge the SFT and KTO models, or to combine the SFT model with the publicly availableDeepSeekMath
Model mergers. Overall, we found that these mergers resulted in a significant regression in our internal assessment and that we did not have time to explore this in more depth.
The Future of Numina - Seeking Contributors and Partners!
Following Numina's initial success in winning the AIMO 2024 Progress Award, our goal has become even more ambitious, namely to take on the mission of promoting the development of artificial and human intelligence in mathematics. You can learn more about our program by visiting our website, feel free to contact us atcontact@ Leave us a message.
Numina aims to maintain the open nature of mathematics by being open to talented people and supporters around the world who are willing to further advance mathematics through AI!
a thank-you note
We thank Thomas Wolf and Leandro von Werra for making the collaboration between Numina and Hugging Face possible. We also thank Hugo Larcher for his help in our use of the Hugging Face GPU cluster, Colin Raffel for his suggestions on model merging methods, and Omar Sanseviero for his feedback on blog posts.
We would also like to thank、General Catalyst、 as well asBeijing International Mathematical Research Center, Peking University Support since the beginning of the project.
Finally, we would like to thank the AIMO team for launching such an exciting and inspiring competition!
Original in English./blog/winning-aimo-progress-prize
Original authors: Yann Fleureau, Li Jia, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul
Translator: Matrix Yao (Yao Weifeng), Deep Learning Engineer at Intel, working on the application of transformer-family models to modal data and training inference for large-scale models.