Training Human-in-the-loop Reward Learning model
Last updated
Last updated
To further improve language models’ performance, attempts have been made to introduce Reinforcement Learning into language models. However, this research topic has been slow in progress and deemed as unpromising by professionals because it is hard for machines to assess the quality of natural language output. Although DeepMind has long proposed the RLHF (Reinforcement Learning with human feedback) training method, it has not seen any results in actual products. OpenAI fine-tuned a small GPT-3 in InstructGPT through RLHF and achieved better results than the original large GPT-3, showcasing the strength of RLHF. Later, ChatGPT brought RLHF to the forefront.
In the original reinforcement learning framework, an Agent has to constantly optimize its policy based on the reward signals given by the environment. If we take the chatbot as an example, then the language model as an Agent outputs text (action) based on the user’s input context (environment). So, what defines the Reward? As mentioned above, only humans can evaluate the quality of the output text, so people should act as part of the Reward function. This is known as human feedback. This updating process must be constant, but clearly, people can’t stay there to score the output text all the time. Then a deep learning model is developed to learn the process of humans’ evaluation of the output quality. Then comes the Reward Model (RL), as shown in the figure.
Reward Model focuses on learning human preferences and is also called a preference model. The principal goal is to obtain a scoring model that takes a series of texts and outputs a scalar reward. This reward, in the form of numbers, represent human preferences for inputs and outputs. The key is that the model should output a scalar reward so that it can work seamlessly with existing RL algorithms. An RL is, in most cases, based on other language models or trained from the very beginning by Transformer.
OpenAI uses previous prompts submitted by users via the API integrated by GPT and then uses the initial language model to generate a series of new texts as prompt-generation pairs. Then human trainers will rank the initial LM-generated text. While our original idea was to have humans directly score these outputs, it is hardly practical as different scoring criteria of different people may cause deviations from the actual scores. Nevertheless, we can adopt ranking to compare the quality of multiple model outputs and create a better regularized dataset. There are several ways to rank the output texts, and the more successful one is to allow users to compare different output texts produced by two language models based on the same prompt, and then generate a relative ranking between the models and the outputs by means such as Elo Rating System so that we can standardize the ranking into the scalar reward signal we need.
At this point, the two preconditions for the RLHF system are met, and the next step is to use RL to further fine-tune the language model.