YuriAI Guidebook
  • 🤖GET STARTED
    • About Yuri AI
  • 👁️‍🗨️BACKGROUND
    • ChatGPT
    • Compared to its predecessor models, ChatGPT also has the following features:
    • Arbitrum
    • AIGC
    • YURI AI Protocol
  • 🧿Our services
    • AIGC NFT - Yuri da Vinci
  • Yuri Digital Avatar -Yuri
  • 🦾YuriAI's Products
    • Introduction
    • AAVG (Action Adventure Game) - Kajama
    • Yuri Credit Card - SBT
    • Education Simulation - Yuri
    • "Yuri" - Your AI Assistant
  • 🦿Developer Documentation
    • Extensible API
    • Model of ChatGPT
      • Training Human-in-the-loop Reward Learning model
      • Reinforcement Learning Fine-tuning
  • 🗺️ROADMAP
    • Roadmap
  • 🔗SOCIAL MEDIAS' LINK
    • Links
  • Twitter
  • Discord
  • Telegram
  • Medium
Powered by GitBook
On this page
  1. Developer Documentation
  2. Model of ChatGPT

Training Human-in-the-loop Reward Learning model

PreviousModel of ChatGPTNextReinforcement Learning Fine-tuning

Last updated 2 years ago

To further improve language models’ performance, attempts have been made to introduce Reinforcement Learning into language models. However, this research topic has been slow in progress and deemed as unpromising by professionals because it is hard for machines to assess the quality of natural language output. Although DeepMind has long proposed the RLHF (Reinforcement Learning with human feedback) training method, it has not seen any results in actual products. OpenAI fine-tuned a small GPT-3 in InstructGPT through RLHF and achieved better results than the original large GPT-3, showcasing the strength of RLHF. Later, ChatGPT brought RLHF to the forefront.

In the original reinforcement learning framework, an Agent has to constantly optimize its policy based on the reward signals given by the environment. If we take the chatbot as an example, then the language model as an Agent outputs text (action) based on the user’s input context (environment). So, what defines the Reward? As mentioned above, only humans can evaluate the quality of the output text, so people should act as part of the Reward function. This is known as human feedback. This updating process must be constant, but clearly, people can’t stay there to score the output text all the time. Then a deep learning model is developed to learn the process of humans’ evaluation of the output quality. Then comes the Reward Model (RL), as shown in the figure.

Reward Model Training Framework

Reward Model focuses on learning human preferences and is also called a preference model. The principal goal is to obtain a scoring model that takes a series of texts and outputs a scalar reward. This reward, in the form of numbers, represent human preferences for inputs and outputs. The key is that the model should output a scalar reward so that it can work seamlessly with existing RL algorithms. An RL is, in most cases, based on other language models or trained from the very beginning by Transformer.

OpenAI uses previous prompts submitted by users via the API integrated by GPT and then uses the initial language model to generate a series of new texts as prompt-generation pairs. Then human trainers will rank the initial LM-generated text. While our original idea was to have humans directly score these outputs, it is hardly practical as different scoring criteria of different people may cause deviations from the actual scores. Nevertheless, we can adopt ranking to compare the quality of multiple model outputs and create a better regularized dataset. There are several ways to rank the output texts, and the more successful one is to allow users to compare different output texts produced by two language models based on the same prompt, and then generate a relative ranking between the models and the outputs by means such as Elo Rating System so that we can standardize the ranking into the scalar reward signal we need.

At this point, the two preconditions for the RLHF system are met, and the next step is to use RL to further fine-tune the language model.

🦿
Page cover image