A Brief History: from GPT-1 to GPT-3

This is my reading notes of 《Developing Apps with GPT-4 and ChatGPT》.

In this section, we will introduce the evolution of the OpenAI GPT medels from GPT-1 to GPT-4.

GPT-1

In mid-2018, OpenAI published a paper titled “Improving Language Understanding by Generative Pre-Training” by Radford, Alec, et al. in which they introduced the Generative Pre-trained Transformer, also known as GPT-1.

The full name of GPT is Generative Pre-trained Transformer.

Before GPT-1, the common approach to building high-performance NLP neural models relied on supervised learning which needs large amounts of manually labeled data. However, the need for large amounts of well-annotated supervised data has limited the performance of these techniques because such datasets are both difficult and expensive to generate.

The authors of GPT-1 proposed a new learning process where an unsupervised pre-training step is introduced. In this step, no labeled data is needed.Instead, the model is trained to predict what the next token is.

The GPT-1 model used the BooksCorpus dataset for the pre-training which is a dataset containing the text of approximately 11,000 unpublished books.

In the unsupervised learning phase, the model learned to predict the next item in the texts of the BookCorpus dataset.

However, because the model is small, it was unable to perform complex tasks without fine-tuning.

To adapt the model to a specific target task, a second supervised learning step, called fine-tuning, was performed on a small set fo manually labeled data.

The process of fine-tuning allowed the parameters learned in the initial pre-training phase to be modified to fit the task at hand better.

In contrast to other NLP neural models, GPT-1 showed remarkable performance on several NLP tasks using only a small amount of manually labeled data for fine-tuning.

NOTE

GPT-1 was trained in two stages:


Stage 1: Unsupervised Pre-training
Goal: To learn general language patterns and presentations.
Method: The model is trained to predict the next token in the sentence.
Data: A large unlabeled text dataset
Type of Learning: Unsupervised learning – no manual labels are needed.
Outcome: The model learns a strong general understanding of language, but it’s not yet specialized for specific tasks(e.g., sentiment analysis or question answering)


Stage 2: Supervise Fine-tuning
Goal: To adapt the pre-trained model to a specific downstream task.
Method: The model is further trained on a small labeled dataset.
Type of Learning: Supervised learning – the data includes input-output pairs(e.g., a sentence and its sentiment label).
Outcome: The model’s parameters are fine-tuned so it performs better on that particular task.


Summary:
  • Pre-training teaches the model how language works(general knowledge).
  • Fine-tuning teaches the model how to perform a specific task(specialized skills).

A good analogy would be:
The model first read lots of books to be smart(pre-training), and then takes a short course to learn a particular job(fine-tuning).

The architecture of GPT-1 was a similar encoder from the original transformer, introduced in 2017, with 117 million parameters.

This first GPT model paved the way for future models with larger datasets and more parameters to take better advantage of the potential of the transformer architectures.

GPT-2

In early 2019, OpenAI proposed GPT-2, a scaled-up version of the GPT-1 model, increasing the number of parameters and the size of the training dataset tenfold.

The number of parameters of this new version was 1.5 billion, trained on 40 GB of text.

In November 2019, OpenAI released the full version of the GPT-2 language model.

GPT-2 is publicly available and can be downloaded from Huggingface or GitHub.

GPT-2 showed that training a larger language model on a larger dataset improves the ability of a language model to understand tasks and outperforms the state-of-art on many jobs.

GPT-3

GPT-3 was released by OpenAI in June 2020.

The main differences between GPT-2 and GPT-3 are the size of the model and the quantity of data used for the training.

GPT-3 is a much larger model, with 175 billion parameters, allowing it to capture more complex pattern.

In addition, GPT-3 is trained on a more extensive dataset.

This includes Common Crawl, a large web archive containing text from billions of web pages and other sources, such as Wikipedia.

This training dataset, which includes content from websites, books, and articles, allows GPT-3 to develop a deeper understanding of the language and context.

As a result, GPT-3 improved performance on a variety of linguistic tasks.

GPT-3 eliminates the need for a fine-tuning step that was mandatory for its predecessors.

NOTE

How GPT-3 eliminates the need for fine-tuning:

GPT-3 is trained on a massive amount of data, and it’s much larger than GPT-1 and GPT-2 – with 175 billion parameters.
Because of the scale, GPT-3 learns very strong general language skills during pre-training alone.


Instead of fine-tuning, GPT-3 uses:
  1. Zero-shot learning
    Just give it a task description in plain text – no example needed.
  2. One-shot learning
    Give it one example in the prompt to show what kind of answer you want.
  3. Few-shot learning
    Give it a few examples in the prompt, and it learns the pattern on the fly.

So in short:

GPT-3 doesn’t need fine-tuning because it can understand and adapt to new tasks just by seeing a few examples in the input prompt — thanks to its massive scale and powerful pre-training.


GPT-3 is indeed capable of handling many tasks without traditional fine-tuning, but that doesn’t mean it completely lacks support for or never uses fine-tuning.

GPT-3’s default approach: Few-shot / Zero-shot Learning

What makes GPT-3 so impressive is that it can:

  • Perform tasks without retraining (fine-tuning)
  • Learn through prompts alone
Does GPT-3 support fine-tuning?

Yes! OpenAI eventually provided a fine-tuning API for GPT-3, which is useful in scenarios like:

  • When you have domain-specific data (e.g., legal, medical).

  • When you want the model to maintain a consistent tone or writing style.

  • When you need a stable and structured output format (e.g., JSON).

  • When prompt engineering isn’t sufficient.


To summarize:
  1. Does GPT-3 need fine-tuning?
    Usually nofew-shot/zero-shot learning is enough for most tasks.

  2. Does GPT-3 support fine-tuning?
    Yes, especially useful for domain-specific or high-requirement tasks.

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/bicheng/74629.shtml
繁體地址,請注明出處:http://hk.pswp.cn/bicheng/74629.shtml
英文地址,請注明出處:http://en.pswp.cn/bicheng/74629.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

基于大數據的各品牌手機銷量數據可視化分析系統(源碼+lw+部署文檔+講解),源碼可白嫖!

摘要 時代在飛速進步,每個行業都在努力發展現在先進技術,通過這些先進的技術來提高自己的水平和優勢,各品牌手機銷量數據可視化分析系統當然不能排除在外。基于大數據的各品牌手機銷量數據可視化分析系統是在實際應用和軟件工程的開發原理之…

人工智能-群暉Docker部署DB-GPT

人工智能-群暉Docker部署DB-GPT 0 環境及說明1 獲取dbgpt的docker鏡像2 下載向量模型3 下載配置文件4 修改配置文件5 創建dbgpt容器并運行6 訪問dbgpt0 環境及說明 環境項說明DSM版本DSM 7.2.1-69057 update 3Container Manager版本24.0.2-1535當前 hub.docker.com 鏡像倉庫中的…

Netty——TCP 粘包/拆包問題

文章目錄 1. 什么是 粘包/拆包 問題?2. 原因2.1 Nagle 算法2.2 滑動窗口2.3 MSS 限制2.4 粘包的原因2.5 拆包的原因 3. 解決方案3.1 固定長度消息3.2 分隔符標識3.3 長度前綴協議3.3.1 案例一3.3.2 案例二3.3.3 案例三 4. 總結 1. 什么是 粘包/拆包 問題&#xff1f…

JavaScript Fetch API

簡介 fetch() API 是用于發送 HTTP 請求的現代異步方法,它基于 Promise,比傳統的 XMLHttpRequest 更加簡潔、強大 示例 基本語法 fetch(url, options).then(response > response.json()).then(data > console.log(data)).catch(error > con…

UMI-OCR Docker 部署

額外補充 Docker 0.前置條件 部署前,請檢查主機的CPU是否具有AVX指令集 lscpu | grep avx 輸出如下即可繼續部署 Flags: ... avx ... avx2 ... 1.下載dockerfile wget https://raw.githubusercontent.com/hiroi-sora/Umi-OCR_runtime_linux/main/Do…

C++ --- 二叉搜索樹

1 二叉搜索樹的概念 ?叉搜索樹?稱?叉排序樹,它或者是?棵空樹,或者是具有以下性質的?叉樹: 1 若它的左?樹不為空,則左?樹上所有結點的值都?于等于根結點的值 2 若它的右?樹不為空,則右?樹上所有結點的值都?于等于根結點…

跨語言語言模型預訓練

摘要 最近的研究表明,生成式預訓練在英語自然語言理解任務中表現出較高的效率。在本研究中,我們將這一方法擴展到多種語言,并展示跨語言預訓練的有效性。我們提出了兩種學習跨語言語言模型(XLM)的方法:一種…

文件描述符,它在哪里存的,exec()后還存在嗎

學過計系肯定了解 寄存器、程序計數器、堆棧這些 程序運行需要的資源。 這些是進程地址空間。 而操作系統分配一個進程資源時,分配的是 PCB 進程控制塊。 所以進程控制塊還維護其他資源——程序與外部交互的資源——文件、管道、套接字。 文章目錄 文件描述符進程管…

Slidev使用(一)安裝

文章目錄 1. **安裝位置**2. **使用方式**3. **適用場景**4. **管理和維護** 全局安裝1. **檢查 Node.js 和 npm 是否已安裝**2. **全局安裝 Slidev CLI**3. **驗證安裝是否成功**4. **創建幻燈片文件**5. **啟動 Slidev**6. **實時編輯和預覽**7. **構建和導出(可選…

第二十一章:模板與繼承_《C++ Templates》notes

模板與繼承 重點和難點編譯與測試說明第一部分:多選題 (10題)第二部分:設計題 (5題)答案與詳解多選題答案:設計題參考答案 測試說明 重點和難點 21.1 空基類優化(EBCO) 知識點 空基類優化(Empty Base Cla…

AOA與TOA混合定位,MATLAB例程,自適應基站數量,三維空間下的運動軌跡,濾波使用EKF

本代碼實現了一個基于 到達角(AOA) 和 到達時間(TOA) 的混合定位算法,結合 擴展卡爾曼濾波(EKF) 對三維運動目標的軌跡進行濾波優化。代碼通過模擬動態目標與基站網絡,展示了從信號測量、定位解算到軌跡濾波的全流程,適用于城市峽谷、室內等復雜環境下的定位研究。 文…

量子計算:開啟未來計算的新紀元

一、引言 在當今數字化時代,計算技術的飛速發展深刻地改變了我們的生活和工作方式。從傳統的電子計算機到如今的高性能超級計算機,人類在計算能力上取得了巨大的進步。然而,隨著科技的不斷推進,我們面臨著越來越多的復雜問題&…

AMD機密計算虛擬機介紹

一、什么機密計算虛擬機 機密計算虛擬機 是一種基于硬件安全技術(如 AMD Secure Encrypted Virtualization, SEV)的虛擬化環境,旨在保護虛擬機(VM)的 ?運行中數據?(包括內存、CPU 寄存器等)免受外部攻擊或未經授權的訪問,即使云服務提供商或管理員也無法窺探。 AMD …

如何通過數據可視化提升管理效率

通過數據可視化提升管理效率的核心方法包括清晰展示關鍵指標、及時發現和解決問題、支持決策優化。其中,清晰展示關鍵指標尤為重要。通過數據可視化工具直觀地呈現關鍵績效指標(KPI),管理者能快速、準確地理解業務現狀&#xff0c…

.git 文件夾

文件夾介紹 🍎 在 macOS 上如何查看 .git 文件夾? ? 方法一:終端查看(最推薦) cd /你的項目路徑/ ls -a-a 參數表示“顯示所有文件(包括隱藏的)”,你就能看到: .git…

MongoDB 與 Elasticsearch 使用場景區別及示例

一、核心定位差異 ?MongoDB? ?定位?:通用型文檔數據庫,側重數據的存儲、事務管理及結構化查詢,支持 ACID 事務?。?典型場景?: 動態數據結構存儲(如用戶信息、商品詳情)?。需事務支持的場景&#xf…

【深度學習基礎 2】 PyTorch 框架

目錄 一、 PyTorch 簡介 二、安裝 PyTorch 三、PyTorch 常用函數和操作 3.1 創建張量(Tensor) 3.2 基本數學運算 3.3 自動求導(Autograd) 3.4 定義神經網絡模型 3.5 訓練與評估模型 3.6 使用模型進行預測 四、注意事項 …

uniapp中APP上傳文件

uniapp提供了uni.chooseImage(選擇圖片), uni.chooseVideo(選擇視頻)這兩個api,但是對于打包成APP的話就沒有上傳文件的api了。因此我采用了plus.android中的方式來打開手機的文件管理從而上傳文件。 下面…

推陳換新系列————java8新特性(編程語言的文藝復興)

文章目錄 前言一、新特性秘籍二、Lambda表達式2.1 語法2.2 函數式接口2.3 內置函數式接口2.4 方法引用和構造器引用 三、Stream API3.1 基本概念3.2 實戰3.3 優勢 四、新的日期時間API4.1 核心概念與設計原則4.2 核心類詳解4.2.1 LocalDate(本地日期)4.2…

樹莓派5從零開發至脫機腳本運行教程——1.系統部署篇

樹莓派5應用實例——工創視覺 前言 哈嘍,各位小伙伴,大家好。最近接觸了樹莓派,然后簡單的應用了一下,學習程度并不是很深,不過足夠剛入手樹莓派5的小伙伴們了解了解。后面的幾篇更新的文章都是關于開發樹莓派5的內容…