Model of ChatGPT

Pre-training Language Model

Starting from GPT/Bert, pre-training language models follow a two-stage paradigm: pre-training large models under self-supervision and then fine-tuning them for specific downstream tasks. GPT features Natural Language Generation because it uses a one-way Transformer decoder, while Bert is characterized by Natural Language Understanding because it uses a two-way Transformer encoder. At that time, OpenAI released GPT-2 only to receive a mediocre response. Most insiders saw Bert as more promising because it opened its source in time, was supported by the influential Google, and was able to ensure rapid implementation for which business-oriented AI application companies hope. This laid the groundwork for Bert’s later backwardness.

The capability of this two-stage language model can only be applied in a single domain, i.e., the translation model can only translate, the fill-in-the-blank model can only fill in the blanks, the abstract model can only generate abstracts, etc. It is unintelligent to fine-tune the model on the basis of each domain’s data respectively in actual tasks. To step further to a generic language model similar to the human mind, GPT-2 introduced more tasks for pre-training. The innovation was that it supervised learning through a self-supervised model. The model trained in this way performed well on downstream tasks without being trained for them, i.e., its capability was greatly extended. Yet its alignment was still relatively poor and needed fine-tuning in practical applications. GPT-2 is now considered to help lay the foundation for zero-shot learning. To improve the alignment function, GPT-3 uses a larger training model with more data. It also optimizes the training method of in-context learning, i.e., fitting the Prompts close to human language during trainings to guide the model on what it should do, which further enhances the model’s zero-shot learning ability. In short, language models are becoming larger and larger.

The above chart in the GPT-3 paper shows that zero-shot largely depends on the Large Language Model (LLM). It is fair to say that since GPT-3, the development of language models has been irrelevant to ordinary people who lack resources, and the development of Natural Language Processing has entered an era of large language models. But it does not mean that we cannot understand or learn from its ideas.

ChatGPT also relies on an LLM for cold boot, as shown in the following figure:

Only a few Human Augmented Texts was involved in the initial model fine-tuning, which was a tiny part of the overall data for the language model training. Hence, fine-tuning here is probably optional for ChatGPT when initializing the language model.

Although a well-designed LLM has performed well in capability and alignment, a language model developed only by pre-training or fine-tuning through adding supervised texts is ultimately unable to cope with the complexity of the actual natural language environment. This type of model often has the following setbacks in practical application:

Providing useless responses: not following the user's actual requests and providing irrelevant answers.
Fabricating contents: fabricating unreasonable contents on the sole basis of the probability distribution of words.
Lack of interpretability: making it difficult for people to understand how the model arrives at a particular decision and making people doubt its credibility.
Biased and harmful contents: drawing from biased data and making unfair or inaccurate predictions.
Weak consistent interaction: not good at generating long texts or consistent contexts.

PreviousExtensible API NextTraining Human-in-the-loop Reward Learning model

Last updated 2 years ago