Link To Hugging Face Repo : http://huggingface.co/sarapatel/llama31-8b-grpo-gsm8k-run1

Link to Wandb Training Report : https://api.wandb.ai/links/saranshpatel-abes-engineering-college-ghaziabad/9dasolv9

TL;DR

What is GRPO? Why it matters?

In January 2025, DeepSeek released a paper titled DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (). The paper introduced DeepSeek R1-Zero — a model trained purely with reinforcement learning, no supervised fine-tuning, no human-labeled reasoning chains. The central finding was that the model spontaneously learned to allocate more thinking time to hard problems on its own. They called this the "aha moment."

image.png

image.png

The reinforcement learning algorithm behind this is called Group Relative Policy Optimization (GRPO), introduced earlier by the DeepSeek team in their math reasoning work. Unlike standard RLHF which requires training a separate reward model on human preferences, GRPO is simpler — it generates a group of completions for each prompt, scores them, and uses the relative scores within that group to update the model.

A Comparison of GRPO and PPO

Group Relative Policy Optimization (GRPO) (Shao et al., 2024) is the reinforcement learning algorithm that we adopt to train DeepSeek-R1-Zero and DeepSeek-R1. It was originally proposed to simplify the training process and reduce the resource consumption of Proximal Policy Optimization (PPO) (Schulman et al., 2017), which is widely used in the RL stage of LLMs (Ouyang et al., 2022). For an overall comparison between GRPO and PPO.

image.png

Fig :Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead estimating the advantages from group scores.

For each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:

image.png

Setup :