Reinforcement Learning Fine-tuning

Although the industry has almost declared that reinforcement learning does not apply to language models, there are still many institutions and researchers exploring the feasibility of reinforcement learning to fine-tune all or part of the language model parameters. OpenAI is the most notable one. ChatGPT adopted PPO, the mature SOTA reinforcement learning model proposed by OpenAI itself, for language model fine-tuning. Up to now, PPO has been the only successful RL algorithm to be adopted on language models. Let’s see how this fine-tuning process is described from the perspective of using an RL algorithm.

The Policy is a language model that accepts Prompt to return a test sequence (or simply a probability distribution of texts). The Action Space of the Policy is all the tokens corresponding to the vocabulary of the language model (usually 50,000 or so). The Observation is all possible input token sequences (so the state space is the vocabulary size ^ input token size). The Reward function is determined by both the RM and the constrained policy transfer described above. The whole process probably looks like this:

➪Sample a prompt from the training set:.

➪Generate a text sequence from the original language model and a text sequence from the current fine-tuned iteration of the language model.

➪Input the text generated by the current policy to the RM to get a scalar reward .

➪Compare with ;usually the KL Divergence is used to calculate the difference between them. acts as a variation constraint to prevent the model from fabricating texts that make no sense but are capable of deceiving the RM.

➪Combine and to create the final Reward function for RL updates. Moreover, OpenAI adds a pre-training gradient on the set of human annotations when training InstructGPT.

➪Update online by maximizing the Return of the current data just like a normal PPO does.

After the PPO algorithm iterates itself and human trainers correct the Reward function, the language model will continue to evolve like AlphaGo and eventually achieve amazing results.

PreviousTraining Human-in-the-loop Reward Learning model NextRoadmap

Last updated 2 years ago