通過觀看數百個外科手術視頻講座來學習多模態表征|文獻速遞-最新論文分享

Title

題目

Learning multi-modal representations by watching hundreds of surgical ?video lectures

通過觀看數百個外科手術視頻講座來學習多模態表征

文獻速遞介紹

外科計算機視覺領域的最新進展，已開始為手術室（OR）的新一代人工智能輔助支持系統鋪平道路（Maier-Hein 等人，2017，2022；Ward 等人，2021；Mascagni 等人，2022；Madani 等人，2020；Yuan 等人，2021）。該領域取得了重大進展，從粗略的手術流程識別（Blum 等人，2008，2010；Padoy 等人，2012；Twinanda 等人，2016；Dergachyova 等人，2016），發展到通過手術動作三元組（Nwoye 等人，2022）、像素級場景分割（Allan 等人，2019；Alapatt 等人，2021）和手術場景重建（Wang 等人，2022；Pfeiffer 等人，2019；Rivoir 等人，2021）實現細粒度的手術場景理解。然而，當前的進展存在三個主要局限性。首先，這些方法在很大程度上集中于構建特定任務的全監督深度學習模型，這需要臨床專家付出大量努力來生成帶標簽的真值數據。其次，這些方法的有效性主要在數量有限的單中心、特定手術流程的手術視頻數據集上得到驗證，而這些數據集不足以涵蓋整個手術流程的復雜細節（Eisenmann 等人，2022）。第三，這些方法在設計中沒有明確整合自然語言文本的豐富語義。包含廣泛視覺概念的自然語言文本，可作為視覺模型的自然監督信號，確保其對多種下游任務具有較高的通用性和可用性。那些能夠利用最少標注的自然語言監督擴展到多個下游任務，同時充分利用大規模多手術流程視頻的方法，將在擴大這些方法的廣泛應用方面發揮重要作用。在通用計算機視覺領域，結合視覺和自由形式自然文本信息的多模態表征學習（Radford 等人，2021；Miech 等人，2020）正成為一種可行的替代方案，可避免為不同下游任務收集帶標簽訓練數據的需求（Radford 等人，2021；Jia 等人，2021）。這些方法旨在通過在大規模配對的視覺-文本輸入上預訓練兩個并行編碼器（一個用于視覺，一個用于文本），來學習低維的聯合潛在空間。兩種模態的共享潛在空間支持零樣本遷移學習，即預訓練的視覺和文本編碼器能夠適應不同的下游任務，而無需使用特定任務的標簽進行微調。這一突破在眾多通用計算機視覺應用中取得了令人矚目的成果，包括零樣本圖像分類（Radford 等人，2021）、圖像 caption 生成（Nukrai 等人，2022）、語義圖像檢索（Sain 等人，2023）和文本到形狀生成（Sanghi 等人，2022）。考慮到多模態表征學習領域的這一顯著進展，一個自然的問題隨之產生：能否為外科計算機視覺學習此類高級聯合表征？如果可能，這將是外科數據科學發展的重要一步（Maier-Hein 等人，2022）。通過獲得此類表征，我們不僅能夠在不使用特定任務標簽的情況下執行現有的手術視頻分析任務，如從粗粒度到細粒度的手術流程識別（Twinanda 等人，2016；Nwoye 等人，2022），還能為手術室中可擴展的智能認知輔助開辟新途徑。這包括視覺-語言應用，如手術視覺問答（Seenivasan 等人，2022）、手術報告生成（Xu 等人，2021b），以及促進臨床醫生與手術設備之間的交互式通信。本研究引入了 SurgVLP（外科視覺-語言預訓練），這是一種用于外科計算機視覺的大規模多模態表征學習深度學習方法。開發此類方法并非沒有獨特挑戰。主要障礙之一是，與通用計算機視覺領域中數百萬的多模態視覺-文本對相比，缺乏大規模的多模態多手術流程外科數據集（Radford 等人，2021；Grauman 等人，2022；Miech 等人，2019）。例如，最近開發的 Ego4D（Grauman 等人，2022）數據集收集了 3000 小時的活動視頻，并對其進行了人工敘述。由于收集和標注手術視頻需要大量人力，此類方法在外科領域難以實現。作為我們的第一項貢獻，我們提議利用通過開放式外科電子學習平臺（如 WebSurg（Websurg，2023）和 EAES（EAES，2023））以及在線視頻分享平臺（如 YouTube（YouTube，2023））獲取的手術視頻講座，進行視覺-文本多模態學習。與人工標注的醫學影像報告（Chen 等人，2022a）或手術指令（Rojas-Mu?oz 等人，2020）相比，我們提議將未處理且可能含噪聲的音頻作為多模態表征學習的主要監督來源。我們利用語音識別（ASR）的最新進展（Mehrish 等人，2023），將講座音頻轉錄為句子，并將其與相應的視頻片段關聯，構建大量的視頻片段-文本對，如圖 1 所示。由此產生的手術視頻講座（SVL）數據集包含了各種手術流程中手術事件、器械使用和解剖狀態的多樣化描述，從而為外科多模態表征學習提供了足夠的監督。然而，使用 SVL 數據集進行多模態表征學習面臨若干語言挑戰。首先，這些視頻中描述的外科概念使用特定領域的知識和科學術語，這在通用計算機視覺中并不常見。例如，“抓住膽囊頸部并將其向左下腹牽拉以打開膽囊三角”以及“在連接 Rouviere 溝和肝第四段基部的假想安全線上方進行解剖”，是腹腔鏡膽囊切除術手術視頻講座中常見的特定手術描述。此外，手術視頻片段與其相應的文本描述之間可能存在語義錯位。實際上，描述手術流程的講師可能會偏離當前病例，回憶起一個類似的有出血事件的病例，即使相關視頻中并未顯示該出血事件。此外，這些視頻存在長程依賴關系。例如，講師可能會評論充分解剖以獲得無張力吻合的重要性，即使該解剖步驟在手術開始時已展示或已被剪輯掉。最后，盡管最新的 ASR 模型（Chen 等人，2022b；Radford 等人，2023）能夠有效轉錄日常 speech，但如前所述，由于外科特定的語言挑戰，它們在外科場景中的性能并不理想。例如，最先進的 ASR Whisper 模型（Radford 等人，2023）能夠理解句子結構和常用詞匯，但在處理外科特定術語時存在困難（例如，將“jejunostomy（空腸造口術）”轉錄為“egenostomy”）。商業醫療專用解決方案，如 AWS（2023），在轉錄醫學術語方面表現更佳，但往往無法捕捉句子的整體結構和邊界。我們提出了兩項關鍵技術用于開發特定于外科的多模態表征學習。首先，我們采用來自兩個含噪聲但互補的 ASR 系統（即 Whisper（Radford 等人，2023）和 AWS（2023））的文本轉錄，以獲得改進的學習過程監督信號，如圖 1 所示，有效緩解每個系統各自的局限性和不準確性。其次，我們提出了一種新的對比學習目標，該目標利用來自 ASR 系統的雙重文本轉錄以及相應的視頻片段。所提出的對比學習目標旨在促使視頻片段和相應雙重文本轉錄的嵌入向量在聯合潛在空間中接近。通過這種方式，所學的多模態表征保留了含噪聲 ASR 轉錄中存在的共同語義，實現視覺和文本信息的更有效融合。為了有效展示所學聯合潛在空間的表征能力，我們引入了多種外科視覺-語言任務作為多模態評估基準。這些任務包括基于文本的視頻檢索、 temporal 活動定位和視頻 caption 生成。基于文本的視頻檢索任務旨在將給定的文本查詢與各種視頻片段相關聯，而 temporal 活動定位任務則涉及將給定的文本查詢定位到整個視頻中的特定視頻片段。這兩項任務檢驗了聯合潛在空間對手術視覺信息及其文本描述中固有潛在關系的捕捉程度。視頻 caption 生成任務旨在為給定的手術視頻片段生成 caption。由于這是一項生成任務，它需要使用文本解碼器來生成連貫的文本輸出。我們提出了一種構建文本解碼器的方法，并將其附加到我們預訓練的編碼器上，從而將我們的預訓練模型無縫地重新用于視頻 caption 生成器。整個過程僅需要文本數據來訓練文本解碼器模型。我們證明，在所有視覺-語言任務中，我們的方法相比基線方法均有顯著改進。接下來，我們評估了我們的方法在應用于未見過的外科數據集和任務時的穩健性和適應性。具體而言，我們考察了其在傳統純視覺外科任務中的性能，包括手術工具、手術階段和動作三元組識別（Twinanda 等人，2016；Nwoye 等人，2022）。我們通過將類別標簽（工具、階段或動作三元組）轉換為文本形式，并基于視覺和文本潛在向量的相似性對視頻幀進行分類，來評估我們的方法作為零樣本遷移學習的性能。結果表明，我們從各種手術流程中通過多模態聯合表征學到的通用外科概念，能夠對特定手術流程（如腹腔鏡膽囊切除術）有所幫助。據我們所知，這是第一項展示無需標注即可通過自監督多模態預訓練來識別手術工具、手術階段和動作三元組的研究。雖然我們的零樣本性能落后于全監督基線，特別是在需要細粒度解剖推理的任務中，但結果凸顯了 SurgVLP 作為基礎骨干模型減少下游任務標注成本的潛力。最后，我們進行了大量消融研究，以闡明我們方法的不同組件及其對結果的影響。我們工作的貢獻可簡要概括為以下四個關鍵方面： - 我們提議利用可通過開放式外科電子學習平臺獲取的手術視頻講座知識，進行視覺-文本多模態表征學習。為此，我們引入了一個大規模的手術視頻講座（SVL）數據集，包含 1.4k 個手術流程視頻。 - 我們提議利用來自兩個互補的 ASR 系統（Whisper 和 AWS）的文本轉錄，通過解決這些 ASR 系統產生的語言不準確句子，來增強表征學習過程。 - 我們提出了一種新穎的對比學習目標，該目標利用來自 ASR 系統的雙重文本轉錄和相應的視頻片段，旨在促使嵌入向量在聯合潛在空間中接近。 - 我們展示了我們提出的框架在多個視覺-語言和純視覺任務中的零樣本遷移能力。

Abatract

摘要

Recent advancements in surgical computer vision applications have been driven by vision-only models, which ?do not explicitly integrate the rich semantics of language into their design. These methods rely on manually ?annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen ?surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures ?available through open surgical e-learning platforms can provide effective vision and language supervisory ?signals for multi-modal representation learning without relying on manual annotations. We address the surgeryspecific linguistic challenges present in surgical video lectures by employing multiple complementary automatic ?speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP — ?Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new ?contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings ?by bringing them together within a joint latent space. To effectively demonstrate the representational capability ?of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various ?vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments ?across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot ?evaluations highlight SurgVLP’s potential as a general-purpose foundation model for surgical workflow analysis, ?reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation ?methods such as few-shot learning to build a scalable and data-efficient solution for various downstream ?surgical applications.

近年來，外科計算機視覺應用的進展主要由純視覺模型推動，這類模型在設計中并未明確整合豐富的語言語義。這些方法依賴人工標注的手術視頻來預測固定的物體類別集，這限制了它們對未見過的外科手術流程和下游任務的泛化能力。在本研究中，我們提出一種理念：通過開放式外科電子學習平臺獲取的手術視頻講座，無需依賴人工標注，就能為多模態表征學習提供有效的視覺和語言監督信號。為解決手術視頻講座中存在的特定于外科領域的語言挑戰，我們采用了多個互補的自動語音識別系統來生成文本轉錄內容。隨后，我們提出了一種新的方法——SurgVLP（外科視覺-語言預訓練），用于多模態表征學習。SurgVLP構建了一種新的對比學習目標，通過將視頻片段嵌入與相應的多個文本嵌入在一個聯合潛在空間中聚合，實現二者的對齊。為有效驗證所學習到的聯合潛在空間的表征能力，我們引入了多項外科領域的視覺-語言任務，并評估了多種特定于外科的純視覺任務，例如手術工具識別、手術階段識別和三元組識別。在多種外科手術流程和任務上的大量實驗表明，SurgVLP學習到的多模態表征在手術視頻分析中表現出強大的可遷移性和適應性。此外，我們的零樣本評估結果凸顯了SurgVLP作為外科工作流分析通用基礎模型的潛力，它減少了下游任務對大量人工標注的依賴，并有助于通過少樣本學習等適配方法，為各種下游外科應用構建可擴展且數據高效的解決方案。

Conclusion

結論

The expensive and laborious process of creating manually annotated ?datasets has been a main hindrance in developing scalable surgical ?computer vision AI systems. In this work, we argue that surgical ?video lectures available through open surgical e-learning platforms ?can provide a wealth of multi-modal knowledge to train a scalable ?system for multi-modal representation learning. We have harnessed ?this knowledge by creating a multi-modal and multi-procedural dataset ?comprising 1.4k surgical video lectures. In order to derive effective ?supervisory signals without manual annotations, we utilize the recent advancements in automatic speech recognition (ASR) systems to ?transcribe the audio from these videos into textual descriptions. This ?automated process has resulted in a visual–textual multi-modal surgical ?dataset consisting of descriptions of surgical events, instrument usage, ?and anatomical status across various surgical procedures. In order to tackle the surgery-specific linguistic challenges inherently present in these videos, we utilize text transcriptions from two ?complementary ASR models, namely Whisper and AWS. The AWS ?model captures specific surgical terms, whereas the Whisper model captures the overall sentence structure. By combining the complementary ?knowledge of these two systems, we overcome inherent limitations and ?inaccuracies present in each ASR system. We then propose a novel contrastive learning objective for multi-modal representation learning. Our ?approach, called SurgVLP, learns effective multi-modal representations ?by bringing embeddings of multiple text transcriptions and video clip ?to close proximity in the joint latent space. To demonstrate the efficacy of the learned joint latent space, we ?present a range of vision-and-language tasks tailored for surgical computer vision. These tasks include text-based video retrieval, temporal ?activity grounding, and video captioning, serving as benchmarks for ?evaluating the multi-modal representation capability of SurgVLP. We ?demonstrate that the learned multi-modal representations are not only ?useful for these vision-and-language tasks but can also be seamlessly ?applied to traditional vision-only surgical downstream tasks. We show ?promising results on these vision-only surgical tasks, namely surgical tool, phase, and triplet recognition, without using any manual ?annotations.

6.1. 討論 ? 6.1.1. 未來工作 ? 本研究表明，所提出的SurgVLP在零樣本性能上優于通用計算機視覺領域的最先進方法（Radford等人，2021）。這一優異性能得益于構建的大規模外科視覺-語言數據集以及采用多文本視圖的預訓練策略。然而，SurgVLP的零樣本適配未借助任何標注數據的監督，因此與全監督方法（Twinanda等人，2016；Czempiel等人，2020）相比，性能仍有不足。 ? 針對實際應用場景，一個潛在的改進方向是：利用少量標注數據對預訓練SurgVLP的多模態表征進行全監督微調，使其適配下游任務。具體而言，SurgVLP的雙分支架構能夠在編碼領域特定文本知識的同時，捕捉外科場景中的詳細視覺模式（Kan等人，2023）。因此，全監督方法（Twinanda等人，2016；Czempiel等人，2020）的特征提取器可通過文本側的互補信息得到增強。 ? 另一個未來的研究方向是探索視覺和文本模態中“成本更低”的自監督信號。典型工作包括構建外部知識庫（Shen等人，2022）以及進行檢索增強的視覺-語言預訓練（Xie等人，2023）。鑒于近年來大型語言模型（Touvron等人，2023）的興起及其所編碼的臨床知識，通過挖掘這些語言模型的知識來探索其應用，有助于彌合領域差距。 ? 此外，當前工作忽略了外科視頻中固有的層級結構。為解決這一問題，可引入層級化多模態預訓練，以進一步提升需要長時序上下文進行預測的外科下游任務的性能。

Figure

圖

Fig. 1. Examples of video clip-text pairs from SVL dataset. The video clip-text pairs are pairs of video clips and their corresponding transcripts. We generate transcripts for ?hundreds of surgical video lectures using two ASR systems, i.e., AWS Medical Transcribe (AWS, 2023) and Whisper (Radford et al., 2023). The transcripts usually illustrate the ?essential concept of surgical anatomies, instruments and events. We use large-scale video clip-text pairs to learn joint multi-modal representations.

圖1. 來自SVL數據集的視頻片段-文本對示例。視頻片段-文本對是視頻片段與其對應的轉錄文本的組合。我們使用兩個語音識別（ASR）系統，即AWS Medical Transcribe（AWS，2023）和Whisper（Radford等人，2023），為數百個手術視頻講座生成轉錄文本。這些轉錄文本通常闡釋了手術解剖結構、器械和事件的核心概念。我們利用大規模的視頻片段-文本對來學習聯合多模態表征。

Fig. 2. Pipeline of proposed SurgVLP. Figure (a) shows examples of video clip-text pairs and their construction process. We have two text views and we pair them to random ?lengths of video clips. Figure (b) presents the contrastive learning objective with AWS sentences and Whisper sentences. SurgVLP utilizes the Info-NCE and MIL-NCE losses for ?AWS and Whisper sentences, respectively. Figure (c) illustrates how to perform downstream tasks in the zero-shot setting. We show the vision-and-language tasks, e.g., text-based ?video retrieval and temporal activity grounding, at the top and the vision-only tasks at the bottom.

圖2. 所提出的SurgVLP的流程。圖（a）展示了視頻片段-文本對的示例及其構建過程。我們有兩個文本視圖，并將它們與隨機長度的視頻片段配對。圖（b）呈現了針對AWS句子和Whisper句子的對比學習目標。SurgVLP分別對AWS句子和Whisper句子采用Info-NCE損失和MIL-NCE損失。圖（c）說明了如何在零樣本設置下執行下游任務。上方展示了視覺-語言任務（如基于文本的視頻檢索和時序活動定位），下方展示了純視覺任務。

Fig. 3. Text-only-training for video captioning: We use the learned joint embedding ?space where text is encoded in a representation close to the ones of its corresponding ?video clips. During training, we train the text decoder to generate captions from text ?embeddings. During inference, the visual embeddings are fed to the visual encoder and ?then to the text decoder to generate the text captions.

圖3. 視頻字幕生成的純文本訓練：我們利用學習到的聯合嵌入空間，其中文本的編碼表示與其對應的視頻片段的編碼表示相近。在訓練階段，我們訓練文本解碼器從文本嵌入生成字幕。在推理階段，視覺嵌入被輸入到視覺編碼器，然后再輸入到文本解碼器以生成文本字幕。

Fig. 4. Qualitative results of text-based video retrieval on SVL-Retrieval dataset using SurgVLP’s learned joint multi-modal representations. For each language query, we retrieve ?3 video clips from the repository. The ground truth video clip is framed in green. It is here always mentioned in the top-3 results.

圖4. 利用SurgVLP學習到的聯合多模態表征在SVL-Retrieval數據集上進行基于文本的視頻檢索的定性結果。對于每個語言查詢，我們從庫中檢索出3個視頻片段。真值視頻片段用綠色邊框標注，且在檢索結果的前三名中均有出現。

Fig. 5. Textual-visual activation maps from different sentence queries. The first row shows the ground truth. The second row shows the predicted activation map along the time ?axis for the raw sentence. The third row shows the newly generated activation maps conditioned by modified sentences. When the whole sentence is decomposed into sub-sentences, ?the SurgVLP approach generates a focused textual-visual activation map for the sentence with clear and less ambiguous words. This shows that SurgVLP responds to specific surgical ?terms rather than general terminology.

圖5. 不同句子查詢的文本-視覺激活圖。第一行展示真值。第二行展示針對原始句子沿時間軸的預測激活圖。第三行展示由修改后的句子生成的新激活圖。當整個句子被分解為子句時，SurgVLP方法會為包含清晰且歧義較少詞匯的句子生成聚焦的文本-視覺激活圖。這表明SurgVLP對特定外科術語有響應，而非通用術語。

Fig. 6. Textual-visual activation maps of the SurgVLP model, computed on two language queries from SVL-Retrieval testing set. The language queries are shown at the top of the ?figure, and the first row shows the ground truth activation map. The second and the third row shows the activation maps of SurgVLP trained with one text view, i.e., AWS texts ?and Whisper texts, respectively. The last row shows that when the SurgVLP model is trained on both AWS and Whisper texts, it generates more concrete activation maps with less ?noise

圖6. SurgVLP模型的文本-視覺激活圖，基于SVL-Retrieval測試集中的兩個語言查詢計算得出。語言查詢顯示在圖的頂部，第一行展示真值激活圖。第二行和第三行分別展示使用單一文本視圖（即AWS文本和Whisper文本）訓練的SurgVLP的激活圖。最后一行顯示，當SurgVLP模型在AWS和Whisper兩種文本上訓練時，會生成更具體、噪聲更少的激活圖。

Fig. 7. Qualitative results of temporal activity grounding. We show the grounding results of two videos with three language queries. Each set of images represents a video clip. ?We show top-2 grounded clips for given text queries. Video clips framed in green are the ground truth to the given text. #1: top-1 grounded result. #2: top-2 grounded result.

圖7. 時序活動定位的定性結果。我們展示了兩個視頻在三個語言查詢下的定位結果。每組圖像代表一個視頻片段。對于給定的文本查詢，我們展示了排名前2的定位片段。綠色邊框標注的視頻片段是給定文本對應的真值。#1：排名第1的定位結果；#2：排名第2的定位結果。

Fig. 8. Caption results from text-only training for video captioning. Random: randomly initialized SurgVLP. CLIP (Radford et al., 2021): publicly available joint embedding space ?from OpenAI pre-trained CLIP model. SurgVLP shows more reliable captioning results with more overlap to the ground truth sentence. Also, the SurgVLP approach can generate ?detailed captions with the surgical instrument mentioned, e.g. ‘‘pledgets’’ in the top row last column.

圖8. 基于純文本訓練的視頻字幕生成結果。“Random”（隨機）：隨機初始化的SurgVLP。CLIP（Radford等人，2021）：來自OpenAI預訓練CLIP模型的公開可用聯合嵌入空間。SurgVLP的字幕生成結果更可靠，與真值句子的重疊度更高。此外，SurgVLP方法能夠生成包含所提及手術器械的詳細字幕，例如第一行最后一列中的“pledgets（小拭子）”。

Fig. 9. Effect of our designed contextual prompts to the zero-shot transfer of vision-only downstream tasks. Our contextual prompts outperform their counterparts by encoding ?more specific action and anatomy information, thus boosting phase recognition and instrument-verb recognition.

圖9. 我們設計的上下文提示對純視覺下游任務零樣本遷移的影響。我們的上下文提示通過編碼更具體的動作和解剖結構信息，性能優于其他同類提示，從而提升了階段識別和器械-動詞識別的效果。

Fig. 10. Text architecture selection. We calculate the cosine similarity score between ?the transcript texts from ASR and pre-segment texts from metadata to measure which ?text encoder retains the semantic information between these two texts.

圖10. 文本架構的選擇。我們計算來自自動語音識別（ASR）的轉錄文本與來自元數據的預分段文本之間的余弦相似度得分，以此衡量哪種文本編碼器保留了這兩類文本之間的語義信息。

Table

表

Table 1 Comparison of transcriptions generated by AWS and Whisper ASR systems.

Table 2 Manually designed contextual prompts for the class names of the surgical phase and tool recognition tasks. The main action of scissors is cutting, but this action can be performed ?by many other instruments, such as hook. Therefore, we use ‘‘I use scissors’’ as the context prompt for the ‘‘Scissors’’ class.

表2 為手術階段和工具識別任務的類別名稱手動設計的上下文提示。剪刀的主要動作是切割，但這一動作也可由許多其他器械完成，例如鉤子。因此，我們使用“我使用剪刀”作為“剪刀”類別的上下文提示。

Table 3 Comparison of different datasets in this work. Human: if the dataset requires intervention by human annotators. SVL-Caption and SVL-Retrieval require partial intervention because ?texts are not annotated from scratch by human annotators.

表3 本研究中不同數據集的對比。“人工標注”指該數據集是否需要人工標注者參與。SVL-Caption（SVL字幕數據集）和SVL-Retrieval（SVL檢索數據集）需要部分人工干預，因為其文本并非由人工標注者從頭標注生成。

Table 4 Ablation studies. We conduct three sets of experiments to demonstrate the effect of key designs in our approach, multiple text views, clips of ?random lengths, and frame sampling from video clip. {???? , ???? } ?? ??=1 : model trained with one AWS text view; {???? , ???? ?? } ?? ??=1 : model trained with one ?Whisper text view; {???? , ???? , ???? ?? } ?? ??=1 : model trained with both text views. Random: Selecting a video clip with a duration randomly chosen from ?the range of 2 to 10 s.

表4 消融研究。我們通過三組實驗來驗證本方法中關鍵設計的效果，包括多文本視圖、隨機長度的片段以及從視頻片段中進行幀采樣。{???? , ???? } ?? ??=1：使用單一AWS文本視圖訓練的模型；{???? , ???? ?? } ?? ??=1：使用單一Whisper文本視圖訓練的模型；{???? , ???? , ???? ?? } ?? ??=1：使用兩種文本視圖訓練的模型。“隨機”指：選擇時長在2到10秒范圍內隨機選取的視頻片段。

Table 5 Comparison of different methods in text-based video retrieval and temporal activity grounding tasks.

表5 不同方法在基于文本的視頻檢索和時序活動定位任務中的對比

Table 6 SVL-Retrieval dataset. We show the categorical tags of the videos in the SVL-Retrieval testing set. Each video can belong to multiple categories, reflecting the diverse range of ?surgical procedures included in the testing set

表6 SVL檢索數據集。我們展示了SVL檢索測試集中視頻的分類標簽。每個視頻可屬于多個類別，這體現了測試集中所包含的外科手術流程的多樣性。

Table 7 Quantitative results of text-only training for video captioning. We report 6 conventional metrics to measure the similarity between generated text ?and ground text. Our proposed SurgVLP significantly outperforms previous work, especially for ROUGE, which requires an accurate representation ?of not only individual words but also their correct order.

表7 基于純文本訓練的視頻字幕生成定量結果。我們采用6項常規指標來衡量生成文本與真實文本之間的相似度。我們提出的SurgVLP顯著優于先前的研究成果，尤其在ROUGE指標上表現突出——該指標不僅要求對單個詞匯有準確的表征，還要求詞匯的順序正確無誤。

Table 8 Zero-shot tool recognition on Cholec80. T1: grasper; T2: bipolar; T3: hook; T4: scissor; T5: clipper; T6: irrigator; T7: specimen bag. Fullysupervised: ResNet50 model with full supervision.

表8 Cholec80數據集上的零樣本工具識別結果。T1：抓鉗；T2：雙極電凝器；T3：鉤子；T4：剪刀；T5：夾鉗；T6：沖洗器；T7：標本袋。全監督：采用全監督方式訓練的ResNet50模型。

Table 9 Zero-shot phase recognition on Cholec80. P1: preparation; P2: calot triangle dissection; P3: clipping and cutting; P4: gallbladder dissection; ?P5: gallbladder packing; P6: cleaning and coagulation; P7: gallbladder extraction. F1-Score is used as the evaluation metric. Fully-supervised: ?ResNet50 model with full supervision.

表9 Cholec80數據集上的零樣本階段識別結果。P1：準備階段；P2：膽囊三角解剖階段；P3：夾閉與切割階段；P4：膽囊解剖階段；P5：膽囊包裹階段；P6：清理與凝固階段；P7：膽囊取出階段。評估指標為F1分數。全監督：采用全監督訓練的ResNet50模型。

Table 10 Zero-shot triplet recognition results. We report the average precision for each component and the combination of the components. i: instrument, v: verb, t: target, iv: ?instrument-verb, it: instrument-target, ivt: instrument-verb-target triplet.

表10 零樣本三元組識別結果。我們報告了每個組成部分以及各組成部分組合的平均精度。i：器械；v：動詞；t：目標；iv：器械-動詞；it：器械-目標；ivt：器械-動詞-目標三元組。