MSTNet：用于糖尿病視網膜病變分類的多尺度空間感知 Transformer 與多實例學習方法|文獻速遞-深度學習醫療AI最新文獻

Title

題目

MSTNet: Multi-scale spatial-aware transformer with multi-instance learning for diabetic retinopathy classification

MSTNet：用于糖尿病視網膜病變分類的多尺度空間感知 Transformer 與多實例學習方法

文獻速遞介紹

糖尿病視網膜病變（DR）是糖尿病的主要并發癥之一，影響全球數百萬人，嚴重威脅視力健康并可能導致成人失明（Heald等人，2020；Teo等人，2021）。約三分之一的糖尿病患者受DR影響，且這一數字呈上升趨勢。DR的早期檢測和及時治療對預防視力喪失至關重要，但其早期癥狀通常不明顯，診斷需要先進醫療設備和專業知識，導致許多患者確診時已處于晚期。因此，開發高效精準的DR早期診斷與分級方法是醫學影像領域的重大挑戰。醫學影像技術的發展使眼底攝影成為DR診斷的核心手段。在醫學領域，DR分級通常分為五個嚴重程度級別（如圖1所示）：正常、輕度、中度、重度和增殖性。隨著DR進展，各種病理特征之間存在復雜的相互作用和綜合影響。例如，微動脈瘤是毛細血管壁損傷的標志，其破裂和滲漏會導致血視網膜屏障功能障礙，進而引起視網膜水腫和液體滲出（Liu，2022）。微動脈瘤破裂導致視網膜內出血，血液中的脂質和蛋白質滲漏形成硬性滲出物，加劇視網膜損傷（Zhao等人，2017；Han等人，2015）。這些病理變化相互作用，共同增加DR的嚴重程度，尤其是在黃斑區，最終導致視力損害加重（Wang等人，2009；Yan等人，2011）。因此，這些病理特征之間的動態關系對DR的診斷和治療至關重要，但這一過程耗時且需要大量專業知識。通過計算機輔助診斷技術實現DR分級和診斷的自動化已成為研究熱點（Li等人，2022；Wang等人，2021b）。深度學習，尤其是卷積神經網絡（CNN），在醫學圖像分析中取得了顯著進展，能夠從大量標記圖像中自主學習，在DR分級和診斷中達到甚至超過人類專家的準確性（Li等人，2019b）。例如，Liu等人（2019）提出WP-CNN，通過多加權路徑集成學習增強CNN，提高可參考DR的檢測精度；Gayathri等人（2020）設計了專門的CNN模型提取視網膜圖像關鍵特征，提升DR自動檢測與分類能力；Hemalakshmi等人（2021）引入CNN-RBF框架，通過預處理、特征提取和CNN與徑向基函數（RBF）模型融合，顯著提高視網膜疾病分類精度。然而，現有CNN在處理復雜的空間相互作用和長距離依賴時存在不足，尤其在眼底圖像中病變分散的細微檢查中更為明顯，可能影響DR類型和嚴重程度診斷的準確性。CNN可能無法充分識別不同病變區域之間的相關性和全局性，而這對準確評估DR至關重要。視覺Transformer（ViT）（Dosovitskiy等人，2020）源于自然語言處理中的Transformer架構，能夠捕捉長程依賴和全局信息，在需要整體理解的復雜醫學圖像分析中具有優勢。例如，Wu等人（2021a）利用ViT識別DR不同階段，證明了純注意力機制在醫學影像分析中的有效性；Bi等人（2023）提出MIL-ViT框架，融合ViT的全局語義分析能力和多實例學習（MIL）的局部特征提取能力，解決視網膜疾病分類難題；Yang等人（2024）提出TMIL，將多實例學習引入Transformer模型，避免高分辨率視網膜成像中的信息丟失，為DR分類設定新基準。盡管Transformer架構在醫學圖像分析中具有優勢，但其存在局限性：首先，Transformer缺乏CNN固有的空間局部歸納和先驗知識，需要更大數據集才能有效學習，這在醫學影像領域（樣本采集成本高、耗時長）構成重大障礙；其次，與自然圖像中主體通常占據大而連續的像素區域不同，醫學圖像中的病變多為微小且可能間隔分布，ViT依賴單一分類頭進行分類任務可能不足（Yu等人，2021），可能忽略細微特征和相關信息；此外，CNN和Transformer使用單尺度信息特征可能導致不同觀察維度的信息丟失，這也與人類感知和識別方式不符。為解決這些問題，我們提出一種用于DR分類的新型多尺度空間感知Transformer網絡（MSTNet）。該模型通過編碼不同尺度的圖像塊并整合空間局部信息，構建雙路徑架構： ? 1. 空間感知模塊（SAM）：在輸入階段引入SAM，分為空間特征提取器和特征注入器。提取器采用ResNet主干的標準卷積結構（He等人，2016）提取三種不同分辨率的特征，構建空間特征庫；注入器利用交叉注意力機制將空間局部信息從特征庫嵌入模型的雙路徑特征序列，緩解Transformer架構缺乏空間局部信息的固有缺陷。 ? 2. 多尺度編碼器（MSE）：為增強雙路徑之間的特征信息交互，每條路徑使用預定義的分類標記（CLS token）作為MSE內的交換媒介，通過CLS交互注意力（CIA）實現跨路徑全局信息的有效交換與融合。 ? 3. 多實例學習（MIL）：為更精準捕捉細微關鍵病變區域，不僅依賴最終CLS標記分類，還引入MIL方法，通過門控注意力機制聚合層內和層間的每個特征實例，形成綜合包表示，最終與雙路徑的CLS標記交叉級聯，合成用于分類的最終特征向量。 ? 在四個公共數據集上的實驗和可視化分析表明，MSTNet優于現有先進方法。主要貢獻如下： ? ? 提出新型多尺度空間感知Transformer網絡（MSTNet），利用雙路徑架構和交互注意力機制顯著增強眼底圖像的特征表示與分析能力。 ? ? 設計空間感知模塊挖掘圖像內的結構先驗知識，有效緩解Transformer架構缺乏空間局部信息的局限性，減少模型對大規模訓練數據集的依賴，提升泛化能力。 ? ? 引入多實例學習（MIL）增強模型對視網膜圖像關鍵病變區域的敏感性，確保對圖像關鍵細節的評估更準確可靠。

Abatract

摘要

Diabetic retinopathy (DR), the leading cause of vision loss among diabetic adults worldwide, underscores the importance of early detection and timely treatment using fundus images to prevent vision loss. However, existing deep learning methods struggle to capture the correlation and contextual information of subtle lesion features with the current scale of dataset. To this end, we propose a novel Multi-scale Spatial-aware Transformer Network (MSTNet) for DR classification. MSTNet encodes information from image patches at varying scales as input features, constructing a dual-pathway backbone network comprised of two Transformer encoders of different sizes to extract both local details and global context from images. To fully leverage structural prior knowledge, we introduce a Spatial-aware Module (SAM) to capture spatial local information within the images. Furthermore, considering the differences between medical and natural images, specifically that regions of interest in medical images often lack distinct subjectivity and continuity, we employ a Multiple Instance Learning (MIL) strategy to aggregate features from diverse regions, thereby enhancing correlation to subtle lesion areas. Ultimately, a cross-fusion classifier integrates dual-pathway features to produce the final classification result. We evaluate MSTNet on four public DR datasets, including APTOS2019, RFMiD2020, Messidor, and IDRiD. Extensive experiments demonstrate that MSTNet exhibits superior diagnostic and grading accuracy, achieving improvements of up to 2.0% in terms of ACC and 1.2% in terms of F1 score, highlighting its effectiveness in accurately assessing fundus images.

糖尿病視網膜病變（DR）是全球糖尿病成年人視力喪失的主要原因，這凸顯了利用眼底圖像進行早期檢測和及時治療以預防視力喪失的重要性。然而，現有深度學習方法在當前數據集規模下難以捕捉細微病變特征的相關性和上下文信息。為此，我們提出了一種用于DR分類的新型多尺度空間感知Transformer網絡（MSTNet）。MSTNet將不同尺度的圖像補丁信息編碼為輸入特征，構建了由兩個不同尺寸的Transformer編碼器組成的雙路徑主干網絡，以提取圖像中的局部細節和全局上下文。為了充分利用結構先驗知識，我們引入了空間感知模塊（SAM）來捕捉圖像內的空間局部信息。此外，考慮到醫學圖像與自然圖像的差異，特別是醫學圖像中的感興趣區域往往缺乏明顯的主觀性和連續性，我們采用多實例學習（MIL）策略來聚合不同區域的特征，從而增強與細微病變區域的相關性。最終，通過交叉融合分類器整合雙路徑特征以生成最終分類結果。我們在四個公共DR數據集（包括APTOS2019、RFMiD2020、Messidor和IDRiD）上對MSTNet進行了評估。大量實驗表明，MSTNet表現出卓越的診斷和分級準確性，ACC提高了2.0%，F1分數提高了1.2%，突出了其在準確評估眼底圖像方面的有效性。

Method

方法

This section details the methodologies employed in our study. It is structured around three pivotal components: the spatial-aware module, the multi-scale encoder, and the cross-fusion classifier. Each of these components plays a critical role in our model’s ability to effectively process and analyze complex retinal images.

本節詳細介紹了本研究中使用的方法。它圍繞三個關鍵組件展開：空間感知模塊、多尺度編碼器和交叉融合分類器。這些組件中的每一個都在我們的模型有效處理和分析復雜視網膜圖像的能力中發揮著關鍵作用。

Conclusion

結論

This study explores the classification of DR from fundus images by introducing the MSTNet, which is designed for precise DR grading and diagnosis. To enhance the model’s performance on small-scale datasets without altering its original structure, we infused the spatial local information into both scale pathways of the model through SAM, compensating for the inherent lack of prior knowledge due to architectural constraints. Additionally, to more accurately capture subtle and critical lesion areas, we employed a gated attention-based MIL strategy, which aggregates different lesion regions and enriches the final feature representation to make it more comprehensive. Extensive experiments across multiple public datasets, including APTOS2019, RFMiD2020, Messidor, and IDRiD, demonstrated our model’s effectiveness in DR detection and grading tasks, achieving improvements of up to 2.0% in terms of ACC and 1.2% in terms of F1 score, highlighting its exceptional capability

本研究通過引入MSTNet模型探索了基于眼底圖像的糖尿病視網膜病變（DR）分類方法，該模型旨在實現精準的DR分級與診斷。為了在不改變模型原有結構的前提下提升其在小規模數據集上的性能，我們通過空間感知模塊（SAM）將空間局部信息注入模型的雙尺度路徑，彌補了因架構限制導致的先驗知識固有缺失。此外，為了更準確地捕捉細微但關鍵的病變區域，我們采用了基于門控注意力的多實例學習（MIL）策略，該策略通過聚合不同病變區域來豐富最終的特征表示，使其更具全面性。在APTOS2019、RFMiD2020、Messidor和IDRiD等多個公開數據集上的大量實驗表明，我們的模型在DR檢測和分級任務中表現出色，準確率（ACC）提升幅度最高達2.0%，F1分數提升1.2%，凸顯了其卓越的性能。

Results

結果

For our investigation, we utilized four publicly accessible datasets associated with DR: APTOS 2019, RFMiD2020, Messidor, and IDRiD, serving as our foundational data. To ensure the robustness of our study and maintain consistency in data processing, we allocated each dataset into training, validation, and testing subsets following an 8:1:1 distribution ratio. APTOS2019 (APTOS 2019 blindness detection, 2019): The APTOS 2019 blindness detection challenge was initiated by the Asia Pacific Tele-Ophthalmology Society (APTOS) in 2019 with the goal of leveraging fundus photography for the identification and severity assessment of DR. The collection includes 3662 fundus photographs, which are categorized into different stages of DR, ranging from no DR (grade 0) to proliferative DR (grade 4). RFMiD2020 (Retinal Image Analysis for Multi-Disease Detection Challenge, 2020): The Retinal Fundus Multi-disease Image Dataset (RFMiD) was specifically created for the 2020 challenge in automated detection of retinal diseases. It includes 1900 fundus images, each annotated with one or more of 45 distinct retinal conditions. Ourstudy predominantly focuses on leveraging the DR-related annotations present within this dataset. Messidor (Messidor Dataset, 2004): The Messidor dataset, contributed by several French hospitals, encompasses over 1200 fundus photographs captured through a range of digital fundus cameras. Each photograph is meticulously annotated to indicate the presence and severity level of DR, organized into five distinct stages. IDRiD (Indian Diabetic Retinopathy Image Dataset (IDRiD), 2018): The Indian Diabetic Retinopathy Image Dataset (IDRiD) stands as the first dataset specifically released for the detailed analysis and classification of diabetic retinopathy and associated conditions. Comprising 516 high-quality fundus photographs, this dataset offers detailed visual data. With a modest dataset size, our research utilizes a binary classification approach, differentiating between the absence and presence of DR.

在本研究中，我們采用了四個與糖尿病視網膜病變（DR）相關的公開數據集作為基礎數據：APTOS 2019、RFMiD2020、Messidor和IDRiD。為確保研究的穩健性和數據處理的一致性，我們將每個數據集按8:1:1的比例劃分為訓練集、驗證集和測試集。具體數據集介紹如下： ? 1. APTOS2019（APTOS 2019致盲性檢測數據集，2019） ? ? APTOS 2019致盲性檢測挑戰賽由亞太遠程眼科學會（APTOS）于2019年發起，旨在利用眼底攝影技術識別DR并評估其嚴重程度。該數據集包含3662張眼底照片，覆蓋從無DR（0級）到增殖性DR（4級）的不同階段。 ? 2. RFMiD2020（視網膜圖像多疾病檢測挑戰賽數據集，2020） ? ? 視網膜眼底多疾病圖像數據集（RFMiD）專為2020年視網膜疾病自動檢測挑戰賽設計，包含1900張眼底圖像，每張圖像標注了45種不同視網膜疾病中的一種或多種。本研究主要利用該數據集中與DR相關的標注信息。 ? 3. Messidor（Messidor數據集，2004） ? ? Messidor數據集由多家法國醫院提供，包含1200余張通過不同數碼眼底相機拍攝的眼底照片。每張照片均經過細致標注，標明DR的存在及嚴重程度，分為五個不同階段。 ? 4. IDRiD（印度糖尿病視網膜病變圖像數據集，2018） ? ? 印度糖尿病視網膜病變圖像數據集（IDRiD）是首個專門用于DR及其相關病癥詳細分析和分類的數據集，包含516張高質量眼底照片，提供了詳細的視覺數據。由于該數據集規模較小，本研究采用二分類方法，區分無DR和有DR狀態。

Figure

圖

Fig. 1. Analysis of DR severity progression from Normal to Proliferative and lesion correlation on the APTOS2019 dataset.

圖1. APTOS2019數據集上從正常到增殖性糖尿病視網膜病變（DR）嚴重程度進展及病變相關性分析。

Fig. 2. The overall architecture of the proposed method. It is mainly composed of three parts: spatial-aware module, multi-scale encoder and cross-fusion classifier. In the spatial-aware module, spatial features at different scales are extracted through an extractor to form a feature bank ???? . Then, ???? information is injected into the dual-pathway feature ???? by an injector. In the multi-scale encoder, the feature sequences ???? of different pathways pass through the corresponding Transformer layer, and then the output is aggregated by MIL aggregation applied to the sequence ?? ?? ??,???????? ? , resulting in ?? ?? ??,?? . After stacking ?? encoders, each layer’s ?? ?? ??,?? is aggregated into ?? ?? ?? via ?? ?? ?? . Subsequently, the output feature ?? ?? ??,?? ?? ?? is cross-concatenated with ?? ?? ?? from an alternate pathway. Final classification is executed by the harmonized outputs of two classification heads

圖2. 所提方法的整體架構。其主要由三部分組成：空間感知模塊、多尺度編碼器和交叉融合分類器。在空間感知模塊中，通過特征提取器提取不同尺度的空間特征以形成特征庫????，然后通過注入器將????中的信息注入雙路徑特征????。在多尺度編碼器中，不同路徑的特征序列????通過相應的Transformer層，隨后對序列?? ?? ??,???????? ? 應用多實例學習（MIL）聚合以輸出?? ?? ??,??。堆疊??個編碼器后，每層的?? ?? ??,??通過?? ?? ??聚合為?? ?? ??。隨后，將輸出特征?? ?? ??,?? ?? ??與來自另一路徑的?? ?? ??進行交叉級聯。最終通過兩個分類頭的協調輸出執行分類。

Fig. 3. The confusion matrix of the classification results from APTOS2019 and RFMiD2020 datasets.

圖3. APTOS2019和RFMiD2020數據集分類結果的混淆矩陣

Fig. 4. Classification visualization on the APTOS2019 (the first Line) and RFMiD2020 (the second line) datasets. The results as shown are the direct input, the output of ResNet34, MIL-ViT and proposed model from left to right.

圖4. APTOS2019數據集（第一行）和RFMiD2020數據集（第二行）的分類可視化結果。從左到右依次為直接輸入圖像、ResNet34輸出、MIL-ViT輸出和所提模型輸出。

Fig. 5. Comparative classification results of different models for each category on the APTOS2019 and RFMiD2020 datasets. DR0-DR4 labels on the left side of the figure represent the predictions of the correct category

圖5. 不同模型在APTOS2019和RFMiD2020數據集上各類別分類結果對比。圖左側DR0-DR4標簽表示正確類別的預測結果。