Reinforcement Learning Fine-tuning
Although the industry has almost declared that reinforcement learning does not apply to language models, there are still many institutions and researchers exploring the feasibility of reinforcement learning to fine-tune all or part of the language model parameters. OpenAI is the most notable one. ChatGPT adopted PPO, the mature SOTA reinforcement learning model proposed by OpenAI itself, for language model fine-tuning. Up to now, PPO has been the only successful RL algorithm to be adopted on language models. Let’s see how this fine-tuning process is described from the perspective of using an RL algorithm.
The Policy is a language model that accepts Prompt to return a test sequence (or simply a probability distribution of texts). The Action Space of the Policy is all the tokens corresponding to the vocabulary of the language model (usually 50,000 or so). The Observation is all possible input token sequences (so the state space is the vocabulary size ^ input token size). The Reward function is determined by both the RM and the constrained policy transfer described above. The whole process probably looks like this:
➪Update online by maximizing the Return of the current data just like a normal PPO does.
After the PPO algorithm iterates itself and human trainers correct the Reward function, the language model will continue to evolve like AlphaGo and eventually achieve amazing results.
Last updated