文檔智能:OCR+Rocketqa+layoutxlm <Rocketqa>

此次梳理Rocketqa,個人認為該篇文件講述的是段落搜索的改進點,關于其框架:粗檢索 + 重排序----(dual-encoder architecture),講訴不多,那是另外的文章;

之前根據文檔智能功能,粗略過了一遍。

文檔智能:OCR+Rocketqa+layoutxlm<LayoutLMv2>

最近在看RAG相關內容,提到了檢索排序,故而重新梳理。如有不足或錯誤之處,歡迎感謝指正。

記錄如下:

RocketQA是一種優化訓練方法,用于密集段落檢索(Dense Passage Retrieval,DPR),以支持開放域問答(Open-Domain Question Answering,ODQA)系統。

1. Abstract & Introduction


It is difficult to effectively train a dual-encoder for dense passage retrieval due to the following three major challenges:

First, there exists the discrepancy between training and inference for the dual-encoder retriever.
During inference, the retriever needs to identify positive (or relevant) passages for each question from a large collection containing millions of candidates.
However, during training, the model is learned to estimate the probabilities of positive passages in a small candidate set for each question, due to the limited memory of a single GPU (or other device).

To reduce such a discrepancy, previous work tried to design specific mechanisms for selecting a few hard negatives from the top-k retrieved candidates. However, it suffers from the false negative issue due to the following challenge.

Second, there might be a large number of unlabeled positives.

Third, it is expensive to acquire large-scale training data for open-domain QA.


采用的一系列優化策略:跨批次負采樣(Cross-batch Negatives)、去噪的強負例采樣(Denoised Hard Negatives)和數據增強(Data Augmentation)等。

用于解決訓練過程中負例樣本不足,和,存在大量錯誤負例樣本的問題。


First, RocketQA introduces cross-batch negatives. Comparing to in-batch negatives, it increases the number of available negatives for each question during training, and alleviates the discrepancy between training and inference.

Second, RocketQA introduces denoised hard negatives. It aims to remove false negatives from the top-ranked results retrieved by a retriever, and derive more reliable hard negatives.

Third, RocketQA leverages large-scale unsupervised data “labeled” by a cross-encoder (as shown in Figure1b) for data augmentation.

Though inefficient, the cross-encoder architecture has been found to be more capable than the dual-encoder architecture in both theory and practice.

Therefore, we utilize a cross-encoder to generate high quality pseudo labels for unlabeled data which are used to train the dual-encoder retriever.

在這里插入圖片描述


2. Related work

2.1 Passage retrieval for open-domain QA

Recently, researchers have utilized deep learning to improve traditional passage retrievers, including:

  • document expansions,
  • question expansions,
  • term weight estimation.

Different from the above term-based approaches, dense passage retrieval has been proposed to represent both questions and documents as dense vectors (i.e., embeddings), typically in a dual-encoder architecture (as shown in Figure 1a).

在這里插入圖片描述

Existing approaches can be divided into two categories:
(1) self-supervised pre-training for retrieval.
(2) fine-tuning pre-trained language models on labeled data.

Our work follows the second class of approaches, which show better performance with less cost.

2.2 Passage re-ranking for open-domain QA

Based on the retrieved passages from a first-stage retriever, BERT-based rerankers have recently been applied to retrieval-based question answering and search-related tasks, and yield substantial improvements over the traditional methods.
基于從第一階段檢索器檢索到的段落,BERT-based(基于BERT的)重排器最近被應用于基于檢索的問答系統和搜索相關任務,相較于傳統方法,取得了顯著的改進。

Although effective to some extent, these rankers employ the cross-encoder architecture (as shown in Figure 1b) that is impractical to be applied to all passages in a corpus with respect to a question.
盡管在某種程度上是有效的,但這些排序器采用了交叉編碼器架構(如圖1b所示),這對于應用于語料庫中與問題有關的所有段落是不切實際的。

The re-rankers with light weight interaction based on the representations of dense retrievers have been studied. However, these techniques still rely on a separate retriever which provides candidates and representations.
已經研究了基于密集檢索器表示且具有輕量級交互的重排器。然而,這些技術仍然依賴于一個單獨的檢索器來提供候選結果和表示。

As a comparison, we focus on developing dual-encoder based retrievers.

3. Approach

3.1 Task Description

The task of open-domain QA is described as follows.
Given a natural language question, a system is required to answer it based on a large collection of documents.

Let C C C denote the corpus, consisting of N N N documents.

We split the N N N documents into M M M passages, denoted by p 1 p_{1} p1?, p 2 p_{2} p2?, …, p M p_{M} pM?,

where each passage p i p_{i} pi? can be viewed as an l l l-length sequence of tokens p i ( 1 ) p_{i}^{(1)} pi(1)?, p i ( 2 ) p_{i}^{(2)} pi(2)?, …, p i ( l ) p_{i}^{(l)} pi(l)?.

Given a question q q q, the task is to find a passage p i p_{i} pi? among the M M M candidates,

and extract a span p i ( s ) p_{i}^{(s)} pi(s)?, p i ( s + 1 ) p_{i}^{(s+1)} pi(s+1)?, …, p i ( e ) p_{i}^{(e)} pi(e)? from p i p_{i} pi? that can answer the question.

In this paper, we mainly focus on developing a dense retriever to retrieve the passages that contain the answer.


每個段落的長度 l l l 是同一個數值嗎?

見4.1.3:

4.1.3 Implementation Details

1. Maximal length

We set the maximum length of questions and passages as 32 and 128, respectively.


3.2 The Dual-Encoder Architecture

We develop our passage retriever based on the typical dual-encoder architecture, as illustrated in Figure 1a.

First, a dense passage retriever uses an encoder E p ( ? ) E_{p}(·) Ep?(?) to obtain the d d d-dimensional real-valued vectors (a.k.a., embedding) of passages.

Then, an index of passage embeddings is built for retrieval.

At query time, another encoder E q ( ? ) E_{q}(·) Eq?(?) is applied to embed the input question to a d d d-dimensional real-valued vector, and k k k passages whose embeddings are the closest to the question’s will be retrieved.

The similarity between the question q q q and a candidate passage p p p can be computed as the dot product of their vectors:
在這里插入圖片描述

In practice, the separation of question encoding and passage encoding is desirable, so that the dense representations of all passages can be precomputed for efficient retrieval.
在實踐中,將問題編碼和段落編碼分離是理想的做法,因為這樣可以先預先計算出所有段落的密集表示,從而實現高效的檢索。

Here, we adopt two independent neural networks initialized from pre-trained LMs for the two encoders E q ( ? ) E_{q}(·) Eq?(?) and E p ( ? ) E_{p}(·) Ep?(?) separately,
在這里,我們分別為兩個編碼器 Eq(·) 和 Ep(·) 采用了兩個從預訓練語言模型(LMs)初始化的獨立神經網絡,

and take the representations at the first token (e.g., [CLS] symbol in BERT) as the output for encoding.
并取第一個標記(例如,在BERT中的[CLS]符號)的表示作為編碼的輸出。


為什么使用[CLS]符號)的表示作為編碼的輸出,簡單解釋的話,是BERT使用的是transformer結構,而一句話的開始的標記[CLS]能夠“兼顧”整句話的含義。

詳細可以看鏈接:
https://blog.csdn.net/sdsasaAAS/article/details/142926242
https://blog.csdn.net/weixin_45947938/article/details/144232649


3.2.1 Training

Formally, given a question q i q_{i} qi? together with its positive passage p i + p_{i}^+ pi+? and m m m negative passages { p i , j ? } j = 1 m \left\{p_{i, j}^-\right\}_{j=1}^m {pi,j??}j=1m?, we minimize the loss function:

在這里插入圖片描述

where we aim to optimize the negative log likelihood of the positive passage against a set of m m m negative passages.

Ideally, we should take all the negative passages in the whole collection into consideration in Equation 2.

However, it is computationally infeasible to consider a large number of negative samples for a question, and hence m m m is practically set to a small number that is far less than M M M.

As what will be discussed later, both the number and the quality of negatives affect the final performance of passage retrieval.

3.2.2 Inference

In our implementation, we use FAISS to index the dense representations of all passages.
使用了FAISS(Facebook AI Similarity Search)庫來對所有段落的密集表示進行索引。

Specifically, we use IndexFlatIP for indexing and the exact maximuminner product search for querying.
具體地說,使用了 IndexFlatIP 作為索引類型,以及精確的最大內積搜索(exact maximum inner product search)作為查詢方法。

  • FAISS:是一個高效相似性搜索和稠密向量聚類的庫,尤其適用于在大型數據集上進行快速相似性搜索。

  • IndexFlatIP:這是一個基于平坦(flat)索引的FAISS類;
    它直接存儲了所有向量,并在查詢時計算查詢向量與所有存儲向量的內積。
    IP代表內積(Inner Product),所以 IndexFlatIP 適用于那些需要基于內積相似性度量(如余弦相似度)的應用場景。

  • 最大內積搜索:這是基于內積相似度的一種搜索方法。對于給定的查詢向量,它會找到與查詢向量內積最大的存儲向量。這在信息檢索、推薦系統等領域中特別有用,因為這些領域通常涉及到計算向量之間的相似性。

通過結合使用IndexFlatIP和最大內積搜索,能夠在大型文本集合中高效地找到與給定查詢最相似的段落。

對于更大規模的數據集,可能需要考慮使用FAISS提供的更高效的索引方法,如基于聚類的索引(如IndexIVFPQ)或基于圖的索引(如IndexHNSW),以在保持較高搜索質量的同時提高搜索速度。

不理解,沒用過FAISS

3.3 Optimized Training Approach

Three major challenges in training the dual-encoder based retriever, including:

  • the training and inference discrepancy,
  • the existence of unlabeled positives,
  • limited training data.

3.3.1 Cross-batch Negatives

Assume that there are B questions in a mini-batch on a single GPU, and each question has one positive passage.
在這里插入圖片描述
Figure 2: The comparison of traditional in-batch negatives and our cross-batch negatives when trained on multiple GPUs, where A is the number of GPUs, and B is the number of questions in each min-batch.

With A GPUs (or mini-batches) , we can indeed obtain A × B ? 1 A×B-1 A×B?1 negatives for a given question, which is approximately A A A times as many as the original number of in-batch negatives.

In this way, we can use more negatives in the training objective of Equation 2, so that the results are expected to be improved.

3.3.2 Denoised Hard Negatives

因為人工標記的標簽是有限的,存在大量未標記的正確答案;所以之前:

To obtain hard negatives, a straightforward method is to select the top-ranked passages (excluding the labeled positive passages) as negative samples.

這種方法,容易 假陰;

基于此:

We first train a cross-encoder.

Then, when sampling hard negatives from the top-ranked passages retrieved by a dense retriever, we select only the passages that are predicted as negatives by the cross-encoder with high confidence scores.

The selected top-retrieved passages can be considered as denosied samples that are more reliable to be used as hard negatives.

3.3.3 Data Augmentation

The third strategy aims to alleviate the issue of limited training data.

Since the cross-encoder is more powerful in measuring the similarity between questions and passages, we utilize it to annotate unlabeled questions for data augmentation.

Specifically, we incorporate a new collection of unlabeled questions, while reuse the passage collection.

Then, we use the learned cross-encoder to predict the passage labels for the new questions.

To ensure the quality of the automatically labeled data, we only select the predicted positive and negative passages with high confidence scores estimated by the cross-encoder.

Finally, the automatically labeled data is used as augmented training data to learn the dual encoder.

3.4 The Training Procedure

在這里插入圖片描述
Require:
Let C C C denote a collection of passages.
Q L Q_{L} QL? is a set of questions that have corresponding labeled passages in C C C,
Q U Q_{U} QU? is a set of questions that have no corresponding labeled passages.
D L D_{L} DL? is a dataset consisting of C C C and Q L Q_{L} QL?,
D U D_{U} DU? is a dataset consisting of C C C and Q U Q_{U} QU?.

Step1:
Train a dual-encoder M D ( 0 ) M_{D}^{(0)} MD(0)? by using cross-batch negatives on D L D_{L} DL?.

STEP 2:
Train a cross-encoder M C M_{C} MC? on D L D_{L} DL?.

  • The positives used for training the cross-encoder are from the original training set D L D_{L} DL?,
  • while the negatives are randomly sampled from the top-k passages (excluding the labeled positive passages) retrieved by M D ( 0 ) M_{D}^{(0)} MD(0)? from C C C for each question q ∈ D L q \in D_{L} qDL?.

This design is to let the cross-encoder adjust to the distribution of the results retrieved by the dual-encoder, since the cross-encoder will be used in the following two steps for optimizing the dual-encoder.

STEP 3:
Train a dual-encoder M D ( 1 ) M_{D}^{(1)} MD(1)? by further introducing denoised hard negative sampling on D L D_{L} DL?.

Regarding to each question q ∈ D L q \in D_{L} qDL?, the hard negatives are sampled from the top passages retrieved by M D ( 0 ) M_{D}^{(0)} MD(0)? from C C C,

and only the passages that are predicted as negatives by the cross-encoder M C M_{C} MC? with high confidence scores will be selected.

STEP 4:
Construct pseudo training data D U D_{U} DU? by using M C M_{C} MC? to label the top-k passages retrieved by M D ( 1 ) M_{D}^{(1)} MD(1)? from C C C for each question q ∈ D U q \in D_{U} qDU?,

and then train a dual-encoder M D ( 2 ) M_{D}^{(2)} MD(2)? on both the manually labeled training data D L D_{L} DL? and the automatically augmented training data D U D_{U} DU?.


我個人理解為,

先用人工標記的數據集, D L D_{L} DL?,訓練一個檢索模型 dual-encoder M D ( 0 ) M_{D}^{(0)} MD(0)?

然后,訓練一個分類模型,cross-encoder M C M_{C} MC? ,該模型最后給出正負樣本的二分類。 其中,正樣本來自 D L D_{L} DL?,負樣本來自: M D ( 0 ) M_{D}^{(0)} MD(0)? 給出的 top-k passages (excluding the labeled positive passages)。

然后,訓練檢索模型 dual-encoder M D ( 1 ) M_{D}^{(1)} MD(1)?;其增加的負樣本,仍然來自 M D ( 0 ) M_{D}^{(0)} MD(0)? 給出的 top-k passages (excluding the labeled positive passages),不過經過了一些篩選,是第二步中經過cross-encoder預測過為負樣本的負樣本;

這樣會排除一些直接使用 M D ( 0 ) M_{D}^{(0)} MD(0)? 給出的 top-k passages (excluding the labeled positive passages)導致的未標記的正樣本;

再然后,將 D U D_{U} DU?喂給 M D ( 1 ) M_{D}^{(1)} MD(1)?,get the top-k passages;將這些數據再喂給 M C M_{C} MC?輸出標簽;

然后使用人工標記的 D L D_{L} DL?,和,得到“偽標簽”的 D U D_{U} DU?,再訓練一個檢索模型 dual-encoder M D ( 2 ) M_{D}^{(2)} MD(2)?


M C M_{C} MC? 是二分類模型是不合適的,結合 4.1.3 來看,其也是個檢索模型:

4.1 Experimental Setup

4.1.3 Implementation Details

1. Pre-trained LMs

The dual-encoder is initialized with the parameters of ERNIE 2.0 base, and the cross-encoder is initialized with ERNIE 2.0 large.

2. Denoised hard negatives and data augmentation

We use the cross-encoder for both denoising hard negatives and data augmentation.

Specifically, we select the top retrieved passages with scores less than 0.1 as negatives and those with scores higher than 0.9 as positives.

We manually evaluated the selected data, and the accuracy was higher than 90%.

3. The number of positives and negatives

When training the cross-encoders, the ratios of the number of positives to the number of negatives are 1:4 and 1:1 on MSMARCO and NQ, respectively.

The negatives used for training cross-encoders are randomly sampled from the top-1000 and top-100 passages retrieved by the dual-encoder M D ( 0 ) M_{D}^{(0)} MD(0)? on MSMARCO and NQ, respectively.

When training the dual-encoders in the last two steps ( M D ( 1 ) M_{D}^{(1)} MD(1)?? and M D ( 2 ) M_{D}^{(2)} MD(2)??), we set the ratios of the number of positives to the number of hard negatives as 1:4 and 1:1 on MSMARCO and NQ, respectively.


本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/web/66098.shtml
繁體地址,請注明出處:http://hk.pswp.cn/web/66098.shtml
英文地址,請注明出處:http://en.pswp.cn/web/66098.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

ESP8266 AP模式 網頁配網 arduino ide

ESP8266的AP配網,可以自行配置網絡,一個簡單的demo,文檔最后有所有的代碼,已經測試通過. 查看SPIFFS文件管理系統中的文件 賬號密碼是否存在,如不存在進入AP配網,如存在進入wifi連接模式 // 檢查Wi-Fi憑據if (isWiFiConfigured()) {Serial.println("找到Wi-Fi憑據&#…

ubuntu官方軟件包網站 字體設置

在https://ubuntu.pkgs.org/22.04/ubuntu-universe-amd64/xl2tpd_1.3.16-1_amd64.deb.html搜索找到需要的軟件后,點擊,下滑, 即可在Links和Download找到相關鏈接,下載即可, 但是找不到ros的安裝包, 字體設…

使用 WPF 和 C# 繪制覆蓋網格的 3D 表面

此示例展示了如何使用 C# 代碼和 XAML 繪制覆蓋有網格的 3D 表面。示例使用 WPF 和 C# 將紋理應用于三角形展示了如何將紋理應用于三角形。此示例只是使用該技術將包含大網格的位圖應用于表面。 在類級別,程序使用以下代碼來定義將點的 X 和 Z 坐標映射到 0.0 - 1.…

[Do374]Ansible一鍵搭建sftp實現用戶批量增刪

[Do374]Ansible一鍵搭建sftp實現用戶批量增刪 1. 前言2. 思路3. sftp搭建及用戶批量新增3.1 配置文件內容3.2 執行測試3.3 登錄測試3.4 確認sftp服務器配置文件 4. 測試刪除用戶 1. 前言 最近準備搞一下RHCA LV V,外加2.9之后的ansible有較大變化于是練習下Do374的課程內容. 工…

SK海力士(SK Hynix)是全球領先的半導體制造商之一,其在無錫的工廠主要生產DRAM和NAND閃存等存儲器產品。

SK海力士(SK Hynix)是全球領先的半導體制造商之一,其在無錫的工廠主要生產DRAM和NAND閃存等存儲器產品。以下是SK海力士的一些主要產品型號和類別: DRAM 產品 DDR4 DRAM 特點: 高速、低功耗,廣泛應用于PC、服務器和移…

WordPress如何配置AJAX以支持點擊加載更多?

WordPress 配置 AJAX 支持點擊加載更多內容通常涉及到前端 JavaScript 和服務器端的配合。以下是基本步驟: 安裝插件:你可以選擇一個現成的插件如 “Advanced Custom Fields” 或者 “WP Infinite Scroll”,它們已經內置了 AJAX 功能&#xf…

【IDEA 2024】學習筆記--文件選項卡

在我們項目的開發過程中,由于項目涉及的類過多,以至于我們會打開很多的窗口。使用IDEA默認的配置,個人覺得十分不便。 目錄 一、設置多個文件選項卡按照文件字母順序排列 二、設置多個文件選項卡分行顯示 一、設置多個文件選項卡按照文件字…

【C】數組和指針的關系

在 C 語言 和 C 中,數組和指針 有非常密切的關系。它們在某些情況下表現類似,但也有重要的區別。理解數組和指針的關系對于掌握低級內存操作和優化程序性能至關重要。 1. 數組和指針的基本關系 數組是一個 連續存儲的元素集合,在內存中占據一…

Maven 配置本地倉庫

步驟 1&#xff1a;修改 Maven 的 settings.xml 文件 找到你的 Maven 配置文件 settings.xml。 Windows: C:\Users\<你的用戶名>\.m2\settings.xmlLinux/macOS: ~/.m2/settings.xml 打開 settings.xml 文件&#xff0c;找到 <localRepository> 標簽。如果沒有該標…

Docker save load 鏡像 tag 為 <none>

一、場景分析 我從 docker hub 上拉了這么一個鏡像。 docker pull tomcat:8.5-jre8-alpine 我用 docker save 命令想把它導出成 tar 文件以便拷貝到內網機器上使用。 docker save -o tomcat-8.5-jre8-alpine.tar.gz 鏡像ID 當我把這個鏡像傳到別的機器&#xff0c;并用 dock…

O2O同城系統架構與功能分析

2015工作至今&#xff0c;10年資深全棧工程師&#xff0c;CTO&#xff0c;擅長帶團隊、攻克各種技術難題、研發各類軟件產品&#xff0c;我的代碼態度&#xff1a;代碼虐我千百遍&#xff0c;我待代碼如初戀&#xff0c;我的工作態度&#xff1a;極致&#xff0c;責任&#xff…

《盤古大模型——鴻蒙NEXT的智慧引擎》

在當今科技飛速發展的時代&#xff0c;華為HarmonyOS NEXT的發布無疑是操作系統領域的一顆重磅炸彈&#xff0c;其將人工智能與操作系統深度融合&#xff0c;開啟了智能新時代。而盤古大模型在其中發揮著至關重要的核心作用。 賦予小藝智能助手超強能力 在鴻蒙NEXT中&#xf…

走出實驗室的人形機器人,將復刻ChatGPT之路?

1月7日&#xff0c;在2025年CES電子展現場&#xff0c;黃仁勛不僅展示了他全新的皮衣和采用Blackwell架構的RTX 50系列顯卡&#xff0c;更進一步展現了他對于機器人技術領域&#xff0c;特別是人形機器人和通用機器人技術的篤信。黃仁勛認為機器人即將迎來ChatGPT般的突破&…

EF Core執行原生SQL語句

目錄 EFCore執行非查詢原生SQL語句 為什么要寫原生SQL語句 執行非查詢SQL語句 有SQL注入漏洞 ExecuteSqlInterpolatedAsync 其他方法 執行實體相關查詢原生SQL語句 FromSqlInterpolated 局限性 執行任意原生SQL查詢語句 什么時候用ADO.NET 執行任意SQL Dapper 總…

Java中網絡編程的學習

目錄 網絡編程概述 網絡模型 網絡通信三要素: IP 端口號 通信協議 IP地址&#xff08;Internet Protocol Address&#xff09; 端口號 網絡通信協議 TCP 三次握手 四次揮手 UDP TCP編程 客戶端Socket的工作過程包含以下四個基本的步驟&#xff1a; 服務器程序…

HarmonyOS NEXT開發進階(七):頁面跳轉

文章目錄 一、前言二、頁面跳轉三、頁面返回四、頁面返回前增加確認對話框4.1 系統的默認詢問框4.2 自定義詢問框 五、拓展閱讀 一、前言 APP開發過程中&#xff0c;多頁面跳轉場景十分常見&#xff0c;例如&#xff0c;登錄 -> 首頁 -> 個人中心。在鴻蒙開發中&#xf…

【Python】第一彈---解鎖編程新世界:深入理解計算機基礎與Python入門指南

?個人主頁&#xff1a; 熬夜學編程的小林 &#x1f497;系列專欄&#xff1a; 【C語言詳解】 【數據結構詳解】【C詳解】【Linux系統編程】【MySQL】【Python】 目錄 1、計算機基礎概念 1.1、什么是計算機 1.2、什么是編程 1.3、編程語言有哪些 2、Python 背景知識 2.…

LeetCode:131. 分割回文串

跟著carl學算法&#xff0c;本系列博客僅做個人記錄&#xff0c;建議大家都去看carl本人的博客&#xff0c;寫的真的很好的&#xff01; 代碼隨想錄 LeetCode:131. 分割回文串 給你一個字符串 s&#xff0c;請你將 s 分割成一些子串&#xff0c;使每個子串都是回文串。返回 s 所…

優化神馬關鍵詞排名原理(優化神馬搜索引擎關鍵詞排名規則)

優化神馬&#xff08;即百度&#xff09;關鍵詞排名的原理主要基于搜索引擎的算法和用戶體驗的考量。以下是一些關鍵的優化原理&#xff1a; 一、搜索引擎算法 網頁重要性評估&#xff1a; 搜索引擎通過復雜的算法評估網頁的重要性和權威性&#xff0c;如基于PageRank的評估方…