Fine-Tuning LLMs with Non-Differentiable Human Feedback + RL

The blog is focused on giving you an intuition of why even using non-differentiable reward function we are able to use Group Relative Position Optimization (GRPO) for fine-tuning LLMs.

Why Do We Need Differentiable Loss Functions?

Deep learning models—like the ones powering LLMs or your favorite image generator—are trained using gradient-based optimization algorithms, stuff like gradient descent. These algorithms are basically the GPS for training: they figure out how to adjust the model’s parameters (those millions of tiny weights) to make the loss—the “how wrong are we?” score—as small as possible. To do that, they need to calculate the gradient of the loss function with respect to those parameters. Think of the gradient as a little arrow saying, “Nudge this weight up a bit, and the loss goes down,” or “Tweak that one down, it’s messing things up.”

Now, here’s the catch: gradients only exist if the loss function is differentiable. That just means it’s smooth enough that we can measure how it changes when we tweak the inputs—no sudden jumps or breaks where the slope goes haywire.

What Happens If the Loss Function Isn’t Differentiable?

If the loss function isn’t differentiable, gradient-based optimization—the backbone of deep learning—falls apart. Those little arrows we rely on? Gone. Without them, we can’t tell the model which way to tweak its weights, and it’s stuck, unable to learn. Sure, there are derivative-free optimization methods out there—like random search or evolutionary algorithms. Those methods are way less efficient, especially when you’re dealing with the crazy high-dimensional parameter spaces in deep learning models.

Here’s something that might trip you up: in Group Relative Policy Optimization (GRPO), reward functions don’t have to be differentiable—think stuff like the length of a generated response or whether it even contains an answer. So how are we still using GRPO to fine-tune LLMs, which live and breathe gradients?

TLDR:

“The advantage (normalized reward) acts as a constant or scaling factor in the GRPO Loss”

GRPO, cooked up by the DeepSeek team, is a twist on reinforcement learning (RL) that’s all about efficiency and stability when fine-tuning LLMs. It builds on the idea of PPO—using human feedback to guide the model—but it’s got some unique tricks up its sleeve, especially with how it handles rewards. Let’s break it down.

The Basics of GRPO

In GRPO, the LLM is your policy ( $\\pi_\\theta$ ), spitting out responses based on prompts. You've got a reward model ( $R_\\phi$ )—trained on human feedback or reward function—that scores those responses.

Unlike traditional RL methods like PPO, GRPO skips the critic (a separate model estimating future rewards), which cuts compute costs by about 50%. Instead, it uses a clever way to judge how good a response is by comparing it to a group of other responses for the same prompt. That's where the "group relative" part comes in.

The Reward Function in GRPO

Here's where it gets interesting: the reward function in GRPO doesn't need to be differentiable. It could be something simple and rule-based, like:

"How long is this response?" (e.g., word count)