> For the complete documentation index, see [llms.txt](https://yuriai.gitbook.io/yuriai-guidebook/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://yuriai.gitbook.io/yuriai-guidebook/developer-documentation/model-of-chatgpt/training-human-in-the-loop-reward-learning-model.md).

# Training Human-in-the-loop Reward Learning model

To further improve language models’ performance, attempts have been made to introduce Reinforcement Learning into language models. However, this research topic has been slow in progress and deemed as unpromising by professionals because it is hard for machines to assess the quality of natural language output. Although DeepMind has long proposed the RLHF (Reinforcement Learning with human feedback) training method, it has not seen any results in actual products. OpenAI fine-tuned a small GPT-3 in InstructGPT through RLHF and achieved better results than the original large GPT-3, showcasing the strength of RLHF. Later, ChatGPT brought RLHF to the forefront.

<figure><img src="/files/lrRX46uzMCrxXw4DoCVJ" alt=""><figcaption></figcaption></figure>

In the original reinforcement learning framework, an Agent has to constantly optimize its policy based on the reward signals given by the environment. If we take the chatbot as an example, then the language model as an Agent outputs text (action) based on the user’s input context (environment). So, what defines the Reward? As mentioned above, only humans can evaluate the quality of the output text, so people should act as part of the Reward function. This is known as human feedback. This updating process must be constant, but clearly, people can’t stay there to score the output text all the time. Then a deep learning model is developed to learn the process of humans’ evaluation of the output quality. Then comes the Reward Model (RL), as shown in the figure.

<figure><img src="/files/y7DctPGvxzdQsiOKAcWV" alt=""><figcaption></figcaption></figure>

## Reward Model Training Framework

Reward Model focuses on learning human preferences and is also called a preference model. The principal goal is to obtain a scoring model that takes a series of texts and outputs a scalar reward. This reward, in the form of numbers, represent human preferences for inputs and outputs. The key is that the model should output a scalar reward so that it can work seamlessly with existing RL algorithms. An RL is, in most cases, based on other language models or trained from the very beginning by Transformer.

OpenAI uses previous prompts submitted by users via the API integrated by GPT and then uses the initial language model to generate a series of new texts as prompt-generation pairs. Then human trainers will rank the initial LM-generated text. While our original idea was to have humans directly score these outputs, it is hardly practical as different scoring criteria of different people may cause deviations from the actual scores. Nevertheless, we can adopt ranking to compare the quality of multiple model outputs and create a better regularized dataset. There are several ways to rank the output texts, and the more successful one is to allow users to compare different output texts produced by two language models based on the same prompt, and then generate a relative ranking between the models and the outputs by means such as Elo Rating System so that we can standardize the ranking into the scalar reward signal we need.

At this point, the two preconditions for the RLHF system are met, and the next step is to use RL to further fine-tune the language model.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://yuriai.gitbook.io/yuriai-guidebook/developer-documentation/model-of-chatgpt/training-human-in-the-loop-reward-learning-model.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
