Open In Colab

What’s RLHF and why the hype?

ChatGPT has taken the world by storm and for a good reason - it’s an extremely powerful tool equipped not only in vast knowledge, but also eloquency that appeals to the human user. Large language models (LLMs) are not restricted to ChatGPT of course, but it’s not an accident that OpenAI’s tool is the most popular one. Quality wise it is easier to converse with than most other LLMs.

Why is ChatGPT that good? Well, one of the core reasons is a novel approach used in training it called Reinforced Learning from Human Feedback or RLHF for short. As the name might suggest, it’s a new trick for improving the model’s conversational capabilities, involving humans in the training process. It helped in elevating ChatGPT to its current approachable form. What’s more, ChatGPT’s few worthy competitors: Bard, Claude and Llama all use RLHF, as well.

RLHF can be summarised as a way of guiding the model’s learning by having humans directly rate its outputs, based on which the model adjusts itself to receive the biggest human rating possible. This helps its speech become more user friendly across multiple iterations.

I’d like to dedicate this tutorial to introduce you to RLHF method, but before we dive deeper, let’s cover some theoretical basis necessary to understand what’s going on - supervised learning, reinforcement learning and how to create an LLM from scratch.

Crash course through basics

Supervised learning vs Reinforcement learning

When we train machine learning models, in our case neural nets, we can do this through unsupervised, supervised and reinforced learning methods, the latter two being relevant for this tutorial.

Supervised learning occurs when we have a dataset with ground truth labels. Our model learns to predict those labels. The difference between its predictions and ground truth is recorded as loss. We know an algorithm known as gradient descent that allows us to minimize this loss. Thus, across multiple iterations, it allows the net to adjust itself and become better at predicting the labels.

Reinforced learning is a method that can be used when we don’t have the luxury of a labeled dataset, but we still do know the end goal that we want to achieve. We allow our model a set of actions it can perform. Its goal is to learn a policy of using those actions to achieve the goal given to it. Just like in supervised learning the model tries multiple times. At each step it receives a reward which varies depending on how good the model is meeting its goal.

Similarly to supervised learning we have an algorithm that aids us in updating the model into superior versions. In this case we’re not minimizing the loss, but maximizing the reward. The popularly used algorithm is Proximal Policy Optimization (PPO). While I won’t be discussing it here at large, its purpose is to not only update the model towards max reward, but also to keep the updates as modest as possible. Large updates are undesired, since they can lead to unstable training.

The reinforcement learning method can be compared to placing a robot in a sandbox, allowing it a set of moves and letting it run freely by waving its arms and legs at leisure. The robot will regularly receive candy. The closer its hands are to the ground and the faster they move, the sweeter the candy will be. Eventually the robot will start digging to get the best candy possible.

LLM from scratch

When hearing about training and LLM you might first think about the extremely resource intense process that painstakingly trains the model on gigantic amounts of data. While this is a good intuition, this can only be described as step one, as later on we have additional steps that help us forge the capable, but still rough model into a conversation partner.

Steps:

1) Pretraining

A supervised learning example. You train your model on previously mentioned unfiltered amounts of text. Since it’s supervised learning the model will be predicting a label. It this case the model takes in a certain amount of text as input and tries to predict the next word (with the word being the label). During this phase the LLM learns to understand human speech and fundamental concepts. It’s also absolutely the most resource heavy step.

Model output example in response to a prompt (in cursive):

The best movie of all time coming soon on DVD.

2) Supervised finetuning

Another supervised learning step. The pretrained model has learned to do its job - to predict the next word in a sentance. That’s not quite what we’re after. Thus, we need to use a technique known as finetuning to further train the model to our liking, namely to converse instead of just finishing sentences. This time a more modest dataset is enough for the already pretrained model, since we just need to adjust it a little bit.

Since the required dataset sizes are now more manageable, we use this to provide higher quality, filtered data. Our new dataset should represent conversation examples, as well as non-biased and non-opinionated data to prune unwanted behavior from the model.

In practice the finetuning happens by freezing (preventing from change) most of the model’s weights except for the last few layers. Only those are trained, which adjusts the model to act in a more preferrable way.

Model output example in response to a prompt:

The best movie of all time

Released in 2009, “Avatar” by James Cameron can be regarded as the best movie of all time due to it grossing $2,923,706,026, the highest amount out of all movies in history.

3) RLHF

We already have a capable, conversational model. But some of its outputs are kind of dry, maybe not what the consumers would expect. Maybe sometimes the model outputs too little information and sometimes too much? This is adjusted through RLHF (expanded upon in later section) by incorporating human preferances into the model behavior. RLHF is a reinforcement learning step.

Model output example in response to a prompt:

The best movie of all time

Certainly! The term “best movie of all time” carries an intrinsic bias that makes it a purely subjective choice, varying from person to person. However, if we’re looking for a quantifiable score to answer your question, we could use box office returns. As of my latest knowledge update of 29th November 2023, the highest grossing film was James Cameron’s “Avatar”, released in 2009. It’s box office return was an impressive sum of $2,923,706,026, potentially earning it the spot as “the best movie of all time”, at least in terms of box office returns.

Nowadays it is standard practice in creating an LLM to skip either of the above steps. The trained models are readily available on the internet, so it’s more convenient and affordable to grab an already existing model and finetune it to your use case, either through supervised finetuning, RLHF or both.

RLHF - So, what’s new?

Reward model

We’ve discussed how regular reinforcement learning works and we know that we have a fairly capable conversational LLM at our disposal. So how exactly does RLHF work and what’s the new gimmick it introduces?

As mentioned earlier, in reinforcement learning the key factor is the reward we’re feeding to our model. The reward is produced by the reward function, normally hard-coded by humans and designed by their understanding of the goal they want the model to accomplish. In RLHF’s case our goal is to make the model more user friendly in conversation. So, uh…how do we put that into an equation?

There are a few clearly defined aspects of the conversation that we could rate in the reward function. But that wouldn’t truly capture the subtle meaning behind “Is this output good for human eyes?”. With such a loose (and often very subjective, depending on the person) metric, we have to get a little bit more creative with how we create our reward function.

To capture the subtleness of the goal we’re going to use a second neural network (on top of the already trained LLM we already have). This net will output the reward for our LLM, just like a regular reward function would. Its inputs will be prompts and textual responses generated by our LLM (or by a wider array of differing LLMs, the techniques vary). Although RLHF is generally a reinforcement learning method, the new net introduced here will learn in a supervised way. This means that to learn, the model will have to compare the predicted reward against some ground truth (a label).

What would be a ground truth in this case? Of course the rating by the actual humans and that’s where the slow and costly human aspect of RLHF comes in. To prepare a dataset for training our reward model we need to generate a substantial amount of responses from the net and then have living people read and rate them according to their preferances.

reward-model.png

Source: https://huggingface.co/blog/rlhf

Actual finetuning

Good news - the change in reward function is the end of new concepts in RLHF. Once the reward model is trained we use it as a substitute to finetune the LLM conversational model. We freeze most of the layers and then apply reinforcement learning to have the model attain the maximum possible reward.

Since we’re discussing reinforcement learning, you might be wondering how does the reward we’re generating play into the PPO optimization method. One of the core metrics that need to get calculate in order to find the most advantageous policy is the so-called advantage function:

\[\text{Advantage}(s, a) = Q(s, a) - V(s)\]

Where: s - is the state of our model a - is the action the model takes Q - is the reward received after taking a certain action assuming a certain model’s state V - is the expected reward assuming a certain model’s state

The equation can be understood as “how much advantage does the model gain after executing a certain action?”.

Weaknesses

While RLHF is clearly a powerful new method of guiding the model towards superior outputs and less harmful content, it comes with a few weaknesses. They can be collectively summarised as the human factor.

Having humans label lots of training data is slow and expensive. RLHF does not scale well if you wish to use it to teach your model a complex behavior. It’s also important to remember that the choice of labeler team is critical for the results you’re going to get. Ratings given by the labelers can vary wildly for the same LLM output. More often than not the ratings will also carry some inherent bias related to the author. This variability can sometimes skew the learning results.

RLHF is a fairly recent technique and is part of active development. At the point of writing this short tutorial there are multiple different methods of creating the neural network rating model, each one trying to squeeze out additional precision. Here we’ve just covered one of them, although the modifications don’t stray far from the idea presented. That said, if RLHF continues advancing at the current rate, It’s very likely that at least some its current weaknesses will be addressed in a near future.