ULS23 挑戰：用于計算機斷層掃描中 3D 通用病變分割的基準模型及基準數據集|文獻速遞-深度學習醫療AI最新文獻

Title

題目

The ULS23 challenge: A baseline model and benchmark dataset for 3D universal lesion segmentation in computed tomography

ULS23 挑戰：用于計算機斷層掃描中 3D 通用病變分割的基準模型及基準數據集

文獻速遞介紹

每年進行的CT檢查數量持續增長（Masjedi等，2020），這導致放射科醫生的工作量不斷增加（McDonald等，2015）。據預測，與2020年相比，2040年全球癌癥負擔將增加47%（Sung等，2021），腫瘤放射學預計將顯著加劇這些不斷增長的需求。癌癥患者在治療及后續疾病監測期間通常需要接受多次影像學檢查（Rehani等，2020）。此外，通過影像學手段進行癌癥早期檢測的關注度也在持續上升（Crosby等，2022；Adams等，2023）。閱片過程中的計算機輔助可能有助于放射科醫生有效應對這一日益增長的工作負荷。 ? ?在放射科醫生的最少指導下，自動分割模型可減少腫瘤掃描中手動標注病變相關的時間負擔和觀察者間差異。病變分割的選擇可通過單次點擊（Tang等，2020）、邊界框標注（Mazurowski等，2023；Ma等，2024）或檢測模型（Yan等，2019）實現。從分割病變中獲得的縱向測量值可根據實體瘤療效評價標準（RECIST）等臨床指南進行分析（Eisenhauer等，2009；Schwartz等，2016）。此外，病變的自動三維分割有助于更復雜的分析，例如利用影像組學特征（Gillies等，2016）區分病變亞型。再者，配準算法可用于將分割的病變傳播至隨訪掃描中，這能在后續檢查中節省大量時間（Hering等，2024）。 ? ?在過去十年中，人工智能驅動的自動腫瘤分割模型在肝臟（Bilic等，2023）、腎臟（Heller等，2023）或肺部（Pedrosa等，2021）等特定高關注度病變類型的分割中取得了顯著進展。然而，針對更廣泛病變類型（尤其是常被檢查的胸腹部區域）的分割模型仍相對缺乏研究。這些通用病變分割（ULS）模型的開發需要多樣化的訓練數據集。盡管此前已有關于ULS的研究（見第2.4節），但大多數研究嚴重依賴DeepLesion數據集。該數據集僅包含單個軸位切片的病變直徑標注，因此不太適合開發三維分割模型。此外，先前研究（Cai等，2018；Tang等，2020）在該數據集上評估時使用的真實分割掩碼未公開，這限制了研究的可重復性。ULS模型公開發布的情況也較為少見，這阻礙了其融入研究人員的標注工作流程或用于進一步臨床評估。 ? ?### 研究貢獻 ? 鑒于上述現狀，我們發起了ULS23挑戰賽，其貢獻包括： ? - 推動模型性能提升：通過收集大規模多樣化的訓練數據集實現。我們引入了針對胰腺和骨骼病變的兩個新數據集（這些病變傳統上分割難度較高），并將10個包含病變分割組件的公開數據集整合為一個易于訪問的數據存儲庫。 ? - 增強ULS研究的可重復性：使用精心篩選的測試集（包含來自兩家荷蘭醫療中心的臨床相關病變）建立可靠的基準。 ? - 促進研究社區獲取前沿ULS模型：開發并公開發布我們的基線半監督ULS模型。

Abatract

摘要

Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatmentoutcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. Whilenumerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys,and lungs, the larger variety of lesion types encountered in clinical practice demands a more universalapproach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentationin chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across thisregion, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated adataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in aclinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publiclyaccessible at https://uls23.grand-challenge.org, enabling researchers worldwide to assess the performanceof their segmentation methods. Furthermore, we have developed and publicly released our baseline semisupervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 ± 0.240on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.

在癌癥患者的治療效果評估中，隨訪CT檢查時對腫瘤表現的尺寸測量至關重要。高效的病變分割可加速這些放射科工作流程。盡管眾多基準和挑戰賽已針對肝臟、腎臟和肺等特定器官的病變分割問題，但臨床實踐中遇到的病變類型更為多樣，這就需要一種更具通用性的方法。為填補這一空白，我們推出了適用于胸腹部盆腔CT檢查中3D通用病變分割的ULS23基準。 ULS23訓練數據集包含該區域的38,693個病變，其中包括具有挑戰性的胰腺、結腸和骨病變。為進行評估，我們精心整理了一個由284名患者的775個病變組成的數據集。這些病變在臨床場景中均被確定為目標病變，從而確保了該數據集中的多樣性和臨床相關性。ULS23基準可在https://uls23.grand-challenge.org公開獲取，使全球研究人員能夠評估其分割方法的性能。此外，我們已開發并公開發布了基線半監督3D病變分割模型。該模型在挑戰測試集上取得了0.703±0.240的平均骰子系數。我們誠邀持續提交作品，以推動未來ULS模型的發展。

Method

方法

In conjunction with the ULS23 challenge, we developed a baselinemodel using the challenge dataset and the LNDb data. To assist participants in preparing their algorithms for the challenge infrastructure,we released the model weights, training code and algorithm container.Additionally, the algorithm can be accessed on the Grand Challengeplatform,1 where users can upload their own data to be segmentedecould erroneously yield a low measurement error. Finally, a subset oflesions is included multiple times during evaluation of the validationand test set, using randomly sampled lesion foreground voxels as thecenter locations. This results in slight variations on the scan contextfor each cropped VOI. We check whether the model outputs similar predictions using these different click locations by comparing theS?rensen–Dice coefficient of the re-aligned segmentation masks. To encourage model robustness to variations in click-point locations, a 10%weight is assigned to this segmentation consistency score (SCS). Whilesecondary to segmentation performance, in cases where two modelsperform similarly, the more robust model that maintains performanceacross different input variations should be preferred.

結合ULS23挑戰賽，我們利用該挑戰數據集和LNDb數據開發了一個基線模型。為幫助參賽者將算法適配挑戰平臺架構，我們發布了模型權重、訓練代碼及算法容器。此外，該算法可在Grand Challenge平臺1上訪問，用戶可上傳自有數據進行分割。 ? 值得注意的是，模型可能因錯誤地生成低測量誤差的結果。最后，在驗證集和測試集的評估中，部分病變會被多次納入——通過隨機采樣病變前景體素作為中心位置，使每個裁剪的感興趣體積（VOI）的掃描上下文略有差異。我們通過比較重新對齊的分割掩碼的索倫森-骰子系數（S?rensen–Dice coefficient），來檢驗模型在不同點擊位置下是否輸出相似預測。為鼓勵模型對點擊位置變化的魯棒性，該分割一致性評分（SCS）將被賦予10%的權重。盡管此評分次于分割性能，但當兩個模型表現相近時，更傾向選擇在不同輸入變化中保持性能穩定性的模型。

Conclusion

結論

This paper presents the ULS23 challenge, establishing the firstpublic benchmark for the evaluation of 3D universal lesion segmentation models on computed tomography scans. We introduce noveltraining data for bone and pancreas lesions for which only limitedpublic data was previously available. The challenge training datasetfeatures a unique combination of fully- and partially-annotated data.To demonstrate the potential of this combined dataset, we developeda strategy for predicting 3D pseudo-masks from the partially-annotated2D data, allowing for their inclusion in 3D model development. Usingthis approach, we iteratively trained a semi-supervised ULS model thatleverages the entire training dataset. For evaluation purposes, we assembled a high-quality and diverse test set of lesions that were selectedfor RECIST measurement in clinical practice. By focusing on clinicallyrelevant target lesions, our benchmark is tightly integrated with thepractical requirements of radiologists. Our scaled-up, semi-supervisedmodel achieves a Dice score of 0.703 ± 0.240 on this test set, comparedto a Dice score of 0.651±0.253 for a standard, automatically-configurednnUnet. The model weights, data processing code and evaluation scriptsare publicly released to ensure transparency and reproducibility. Future work will include a meta-analysis of the methods developed bychallenge participants and an assessment of how these models couldreduce reading times for oncological scans. Subsequent iterations ofthe challenge can explore various aspects of ULS model developmentsuch as prioritizing segmentation performance over inference speed, expanding evaluation on rare lesion types or including different imagingmodalities.

本文介紹了ULS23挑戰賽，該挑戰賽建立了首個用于評估計算機斷層掃描（CT）上3D通用病變分割模型的公開基準。我們為骨骼和胰腺病變引入了新的訓練數據，而此前這些病變的公開數據非常有限。該挑戰賽的訓練數據集具有全標注數據和部分標注數據的獨特組合。 ? 為了展示這種組合數據集的潛力，我們開發了一種從部分標注的2D數據中預測3D偽掩碼的策略，使其能夠納入3D模型開發。使用這種方法，我們迭代訓練了一個半監督ULS模型，該模型利用了整個訓練數據集。為了進行評估，我們組裝了一個高質量且多樣化的病變測試集，這些病變是在臨床實踐中選擇用于實體瘤療效評價標準（RECIST）測量的。通過關注臨床相關的目標病變，我們的基準與放射科醫生的實際需求緊密結合。 ? 我們的規模化半監督模型在該測試集上實現了0.703±0.240的Dice系數，而標準自動配置的nnUnet的Dice系數為0.651±0.253。模型權重、數據處理代碼和評估腳本已公開發布，以確保透明度和可重復性。未來的工作將包括對挑戰賽參與者開發的方法進行薈萃分析，以及評估這些模型如何減少腫瘤掃描的閱片時間。該挑戰賽的后續迭代可以探索ULS模型開發的各個方面，例如將分割性能優先于推理速度、擴展對罕見病變類型的評估或納入不同的成像模式。

Results

結果

Table 2 shows the mean Dice and HD95 scores with standard deviation across the various model configurations for both the held-outtraining data and the challenge test set. We report the results for eachlesion type from the fully annotated data. Additionally, for the test set,we also include results for the lesion types that were not seen duringfully supervised training (PSUP).Fig. 4 shows the long- and short-axis axial measurement errordistributions for the different lesion types in the held-out training dataas predicted by the semi-supervised model. On the test set it achievesan overall ChallengeScore of 0.729, consisting of a mean Dice score of0.703 ± 0.240, long-axis SMAPE of 11.2% ± 15.8%, short-axis SMAPE of12.0% ±15.9%, and consistency Dice score of 0.787±0.252. Fig. 5 showsthe measurement error distribution for the full test set, and for thoselesions types which were or were not contained in the fully-annotatedtraining data.

表2顯示了在保留的訓練數據和挑戰測試集上，各種模型配置的平均Dice系數和HD95分數（含標準差）。我們報告了全標注數據中每種病變類型的結果。此外，對于測試集，我們還包括了在全監督訓練中未見過的病變類型（PSUP）的結果。 ? 圖4顯示了半監督模型在保留訓練數據中對不同病變類型的長軸和短軸軸向測量誤差分布。在測試集上，該模型的總體挑戰分數為0.729，包括平均Dice系數0.703±0.240、長軸對稱平均絕對百分比誤差（SMAPE）11.2%±15.8%、短軸SMAPE 12.0%±15.9%，以及一致性Dice系數0.787±0.252。圖5展示了整個測試集的測量誤差分布，以及全標注訓練數據中包含或未包含的病變類型的誤差分布。

Figure

圖

Fig. 1. Histograms depicting the long- and short-axis measurements in millimeters for various lesion types in the fully-annotated training data reveal notable trends. Kidney andcolon lesions tend to be larger on average. Lymph nodes, pancreas, and colon lesions exhibit a greater disparity between their long- and short-axis sizes, indicating that theselesions are more often non-spherical

圖1. 全標注訓練數據中各類病變長軸與短軸測量值（毫米）的直方圖揭示了顯著趨勢。腎臟和結腸病變平均體積更大。淋巴結、胰腺和結腸病變的長軸與短軸尺寸差異更顯著，表明這些病變多呈非球形。

Fig. 2. Examples of GrabCut pseudo-masks. From left to right, a kidney lesion, mediastinal lymph node, subcutaneous mass, and lung lesion. Note how GrabCut tends to oversegment(orange mask ■) into healthy tissues compared to the reference measurements (purple lines ■). Lung lesions are visualized using Window Level: ?500 HU, Window Width: 1400HU. Lesions outside the lungs with WL: 350 WW: 40.

圖2. GrabCut偽掩碼示例 ? 從左至右依次為腎臟病變、縱隔淋巴結、皮下腫塊和肺部病變。請注意，與參考測量值（紫色線條■）相比，GrabCut傾向于將健康組織過度分割（橙色掩碼■）。肺部病變采用窗寬窗位：-500 HU（窗位），1400 HU（窗寬）顯示，肺部外病變則使用350（窗位）/400（窗寬）。

Fig. 3. Training pipeline for the semi-supervised baseline model. (a) In the first training iteration a nnUnet is pretrained using the 2D GrabCut masks generated from the partiallyannotated data, and then fine-tuned on the fully annotated data. (b) In the second training iteration a different nnUnet is pretrained using the predicted 3D pseudo-masks for thepartially annotated data and then fine-tuned using the fully-annotated data

圖3. 半監督基線模型的訓練流程 ? （a）在第一次訓練迭代中，使用部分標注數據生成的2D GrabCut掩碼對nnUnet進行預訓練，然后在全標注數據上進行微調。 ? （b）在第二次訓練迭代中，使用部分標注數據的預測3D偽掩碼對另一個nnUnet進行預訓練，然后使用全標注數據進行微調。

Fig. 4.Boxplots of the long- and short-axis measurement errors for the baseline model on the different lesion types in the held-out training data. SAPE = Symmetric AveragePrediction Error.

圖 4. 基線模型在保留訓練數據中不同病變類型的長軸和短軸測量誤差箱線圖SAPE = 對稱平均預測誤差

Fig. 5. Boxplots of the long- and short-axis measurement errors for the baseline model on the test set. The fully-supervised types are lung, liver, kidney, colon, pancreas, bone lesionsand lymph nodes. Partially-supervised lesion types are those included in the partially annotated data e.g. adrenal, ovary, subcutaneous. SAPE = Symmetric Absolute PercentageError.

圖 5. 基線模型在測試集上長軸和短軸測量誤差的箱線圖全監督類型包括肺、肝、腎、結腸、胰腺、骨病變和淋巴結；部分監督病變類型為部分標注數據中包含的類型，如腎上腺、卵巢、皮下組織等。SAPE = 對稱絕對百分比誤差

Fig. 6. Ground truth (orange outline ■) and baseline model predictions (purple outline ■) on axial slices from the test set. The 3D Dice score for each lesion is included inthe top-left corner. The lesions visualized are: (a) spleen lesion (b) lesion in the abdominal wall (c) adrenal lesion (d) abdominal lymph node (e) liver lesion (f) lung lesion (g)mediastinal lymph node (h) kidney lesion (i) Pericardial lesion. Lung lesions are visualized using Window Level: ?500 HU, Window Width: 1400 HU. Lesions outside the lungswith WL: 350 WW: 40.

圖 6. 測試集軸向切片上的真實標注（橙色輪廓■）與基線模型預測（紫色輪廓■）每個病變的 3D Dice 系數標注于左上角。可視化的病變包括：(a) 脾臟病變 (b) 腹壁病變 (c) 腎上腺病變 (d) 腹部淋巴結 (e) 肝臟病變 (f) 肺部病變 (g) 縱隔淋巴結 (h) 腎臟病變 (i) 心包病變。肺部病變采用窗寬窗位：-500 HU（窗位），1400 HU（窗寬）顯示，肺部外病變使用 350（窗位）/400（窗寬）。

Fig. A.7. Age, sex and scanner manufacturer characteristics of the novel training data and the test set. For 3 series of the Radboudumc-Bone dataset and 13 series of theRadboudumc-Pancreas dataset the metadata could not be recovered

圖 A.7. 新增訓練數據與測試集的年齡、性別及掃描儀制造商特征Radboudumc - 骨骼數據集的 3 個序列和 Radboudumc - 胰腺數據集的 13 個序列元數據無法恢復

Fig. A.8. Study date and scan spacing distributions for the series included in the two novel training datasets.

圖 A.8. 兩個新增訓練數據集所含序列的研究日期與掃描層間距分布

Fig. A.9. Plots of the Dice score vs the long- and short-axis measurement error for the baseline model on the different lesion types in the held-out training data. SAPE = SymmetricAbsolute Percentage Error.

圖A.9. 基線模型在保留訓練數據中不同病變類型的Dice系數與長/短軸測量誤差關系圖 ? SAPE=對稱絕對百分比誤差

Fig. A.10. Plots of the Dice score vs the long- and short-axis measurement error for the baseline model on the test data, split on lesion types seen in the fully-annotated dataversus those in the partially annotated data. SAPE = Symmetric Absolute Percentage Error.

圖A.10. 基線模型在測試數據上按全標注數據病變類型與部分標注數據病變類型劃分的Dice系數與長/短軸測量誤差關系圖 ? SAPE=對稱絕對百分比誤差

Fig. A.11. Plots of the sorted pairwise difference in Dice score between the up-scaled residual encoder nnUnet trained with semi-supervision and the self-configured nnUnet.The left graphs contain the lesion types in the test set covered by the fully-annotated training data, the right graphs contain the scores for the lesion types only present in thepartially-annotated data. A negative score indicates the segmentation performance of the regular nnUnet was better for that case, a positive score indicates the semi-supervisednnUnet scored higher for that case. The black vertical line indicates 50% of the lesions, the orange line denotes where the score changed from negative to positive. TTA = TestTime Augmentations

圖A.11. 半監督訓練的放大殘差編碼器nnUnet與自配置nnUnet之間Dice系數的排序成對差異圖。左圖包含全標注訓練數據覆蓋的測試集病變類型，右圖包含僅存在于部分標注數據中的病變類型。負分表示常規nnUnet在該案例中的分割性能更好，正分表示半監督nnUnet在該案例中得分更高。黑色豎線表示50%的病變，橙色線表示分數從負變正的位置。TTA=測試時增強

Fig. A.12. Plots of the sorted pairwise difference in Dice score between the up-scaled residual encoder nnUnet trained with and without semi-supervision. The left graphs containthe lesion types in the test set covered by the fully-annotated training data, the right graphs contain the scores for the lesion types only present in the partially-annotated data. Anegative score indicates the segmentation performance of the fully-supervised nnUnet was better for that case, a positive score indicates the semi-supervised nnUnet scored higherfor that case. The black vertical line indicates 50% of the lesions, the orange line denotes where the score changed from negative to positive. TTA = Test Time Augmentations.

圖A.12. 半監督訓練與非半監督訓練的放大殘差編碼器nnUnet之間Dice系數的排序成對差異圖。左圖包含全標注訓練數據覆蓋的測試集病變類型，右圖包含僅存在于部分標注數據中的病變類型。負分表示全監督nnUnet在該案例中的分割性能更好，正分表示半監督nnUnet在該案例中得分更高。黑色豎線表示50%的病變，橙色線表示分數從負變正的位置。TTA=測試時增強。

Table

表

Table 1Overview of the data used in the ULS23 challenge. The LNDb data licence does not allow repackaging their data, so it is not released as part of the trainingarchive (de Grauw et al., 2023a). Instead, we release the code for participants to prepare the lesion VOI’s for this dataset themselves (de Grauw et al., 2023b).

表1 ULS23挑戰賽使用的數據概述 ? ?由于LNDb數據集的許可協議不允許重新打包其數據，因此該數據未作為訓練歸檔的一部分發布（de Grauw等人，2023a）。相反，我們發布了代碼，供參與者自行準備該數據集的病變感興趣體積（VOI）（de Grauw等人，2023b）。

Table 2Segmentation performance comparison on the 10% held-out training data per lesion type and the test set. Best results per category are highlighted in bold. HD95 represents the95th percentile of the Hausdorff distance, measured in millimeters. For the individual lesion types in the test set, * indicates there were ≤ 20 lesions of this type in the test set.The exact distribution of lesion types is not provided to participants. FSUP = fully supervised lesions types (i.e. Kidney - Colon). PSUP = lesion types present in the partiallysupervised training data

表 2 按病變類型在 10% 保留訓練數據及測試集上的分割性能對比每類最佳結果以粗體突出顯示。HD95 表示豪斯多夫距離的第 95 百分位數（單位：毫米）。測試集中的個別病變類型標注表示該類型在測試集中的病例數≤20。參賽者未獲得病變類型的具體分布。FSUP = 全監督病變類型（如腎臟 - 結腸），PSUP = 部分監督訓練數據中存在的病變類型。

Table A.3Segmentation performance comparison with test time augmentation on the 10% held-out training data per lesion type and the test set. Best results per category are highlighted inbold. HD95 represents the 95th percentile of the Hausdorff distance, measured in millimeters. For the individual lesion types in the test set, indicates there were ≤ 20 lesionsof this type in the test set. The exact distribution of lesion types is not provided to participants. FSUP = fully supervised lesions types (i.e. Kidney - Colon). PSUP = lesion typespresent in the partially supervised training data.D

表 A.3 結合測試時數據增強的 10% 保留訓練數據及測試集分割性能對比每類最佳結果以粗體突出顯示。HD95 表示豪斯多夫距離的第 95 百分位數（單位：毫米）。測試集中的個別病變類型標注表示該類型在測試集中的病例數≤20。參賽者未獲得病變類型的具體分布。FSUP = 全監督病變類型（如腎臟 - 結腸），PSUP = 部分監督訓練數據中存在的病變類型。

Table A.4Hyperparameters of the baseline model (nnUnet-ResEnc+SS) and the intensity properties used for data normalization, calculatedfrom the pretraining data

表 A.4 基線模型（nnUnet-ResEnc+SS）的超參數及用于數據歸一化的強度屬性（基于預訓練數據計算）