ACOUSLIC-AI挑戰報告：基于低收入國家盲掃超聲數據的胎兒腹圍測量|文獻速遞-醫學影像算法文獻分享

Title

題目

ACOUSLIC-AI challenge report: Fetal abdominal circumferencemeasurement on blind-sweep ultrasound data from low-income countries

ACOUSLIC-AI挑戰報告：基于低收入國家盲掃超聲數據的胎兒腹圍測量

文獻速遞介紹

胎兒生長受限（FGR）影響高達10%的妊娠，是導致圍產期死亡和發病的關鍵因素（Bernstein等人，2000年；Gardosi等人，1992年；Martins等人，2020年；Unterscheider等人，2014年）。胎兒生長受限與死產密切相關，還可能導致早產，給母親帶來風險（Lawn等人，2011年，2023年）。這種情況通常是由于各種母體、胎兒和胎盤因素阻礙了胎兒的遺傳生長潛力所致（Martins等人，2020年）。產前超聲檢查是通過生物測量評估胎兒大小的標準方法，包括腹圍（AC），以及用于計算估計胎兒體重（EFW；Salomon等人，2019年）的其他參數，如雙頂徑、頭圍和股骨長度。當這些測量值小于預期時，可能表明存在胎兒生長受限，該病癥與約60%的胎兒死亡有關（Lawn等人，2011年）。胎兒生長受限的診斷依賴于對胎兒腹圍、預期胎兒體重或兩者的重復測量。這些測量必須至少進行兩次，兩次之間的最短間隔為兩周，才能做出可靠診斷（Morris等人，2024年），除非觀察到極端值。在這種情況下，單次腹圍或估計胎兒體重測量可能足以確認胎兒生長受限（Gordijn等人，2016年；Lees等人，2020年）。然而，在資源匱乏地區，用于腹圍測量的常規產科生物測量超聲檢查受到限制，原因是超聲設備成本高昂且訓練有素的超聲醫師稀缺。有人提出，在這些地區，讓新手操作員使用低成本超聲設備和標準化盲掃協議來獲取產科數據（Abuhamad等人，2016年；DeStigter等人，2011年；Self等人，2022年；van den Heuvel等人，2018a）。盲掃采集協議的特點是操作員執行標準化的掃描動作，而不查看超聲圖像。這些協議會生成一系列二維超聲幀，這些幀是在超聲探頭沿著妊娠腹部的特定軌跡移動時捕獲的。與傳統臨床超聲檢查不同，在傳統臨床超聲檢查中，經驗豐富的超聲醫師會尋找標準平面進行生物測量（Yasrab等人，2022年），盲掃數據帶來了一系列獨特的挑戰。盡管以這種方式獲取的數據質量有所降低，可能不包含標準平面，但事實證明，這些數據足以進行生物測量（van den Heuvel等人，2019年）。越來越多的文獻關注使用人工智能（AI）來自動化基于標準化協議獲取的自由手超聲序列的產前評估任務，從而無需專家的超聲解釋。這些任務包括胎兒生物測量（Arroyo等人，2022年；van den Heuvel等人，2019年）、孕齡估計（Gomes等人，2022年；Lee等人，2023年；Pokaprakarn等人，2022年；van den Heuvel等人，2019年；Viswanathan等人，2024年）和妊娠風險檢測（Arroyo等人，2022年；Gleed等人，2023a，b；Gomes等人，2022年；Maraci等人，2017年；Schilpzand等人，2022年；Self等人，2020年）。正如Gomes等人（2022年）和Schilpzand等人（2022年）所展示的，這些人工智能解決方案有可能嵌入到移動設備中，提供完整、離線、低成本且便攜的解決方案，適用于資源有限的環境。據我們所知，此前沒有研究調查過在盲掃數據上自動測量胎兒腹圍的情況。雖然P?otka等人（2022年）推出了一種人工智能模型，能夠從超聲視頻中自動選擇的標準平面測量腹圍以及其他胎兒生物測量參數，但這些圖像是由經驗豐富的超聲醫師獲取的，他們得到的指示是確保每個成像序列都包含準確測量所需的標準平面。在隨后的一項研究中，作者將他們的方法擴展到從腹部超聲視頻中自動估計胎兒體重（P?otka等人，2023年）。與他們早期的工作一樣，這種方法是基于超聲醫師特意包含腹圍標準平面的腹部視頻開發和測試的。雖然這兩項研究都表明，可以從超聲掃查數據中估計胎兒腹圍，并且這種測量足以監測胎兒生長受限，但由于依賴超聲醫師包含預定義的標準平面，可能會限制這些模型在資源受限環境中的通用性。為了填補文獻中的這一空白，并探索在低收入國家改善胎兒生長受限早期檢測和管理的潛力，我們組織了ACOUSLIC-AI（使用人工智能在低收入國家進行操作員無關的腹圍超聲測量）挑戰。與之前的挑戰（如HC18、A-AFMA和PS-FH-AOP（Lu等人，2022年））專注于在臨床環境中獲取的超聲成像數據不同，ACOUSLIC-AI挑戰是首個將盲掃數據用于胎兒生物測量任務的挑戰。該挑戰旨在開發和基準測試用于從新手操作員獲取的盲掃數據中自動測量胎兒腹圍的人工智能模型，最終目標是改善資源有限地區的產前護理可及性。參與者的任務是確定最適合測量胎兒腹圍的幀，并在與所選幀對應的超聲圖像上提供腹部的二元分割掩碼。來自塞拉利昂三個公共衛生單位（PHUs）的訓練數據由經驗豐富的閱片者進行標注，并已公開。包含坦桑尼亞兩個公共衛生單位和一家歐洲醫院數據的私有驗證集和測試集由專家閱片者標注，通過Grand-Challenge平臺提供。挑戰設計確保算法和代碼可公開獲取且可重復使用。在本文中，我們使用生物醫學圖像分析挑戰（BIAS）方法（Maier-Hein等人，2020年）展示了ACOUSLIC-AI挑戰中表現最佳的三個人工智能模型的結果，并根據臨床標準評估了它們的性能。

Abatract

摘要

Fetal growth restriction, affecting up to 10% of pregnancies, is a critical factor contributing to perinatalmortality and morbidity. Ultrasound measurements of the fetal abdominal circumference (AC) are a key aspectof monitoring fetal growth. However, the routine practice of biometric obstetric ultrasounds is limited in lowresource settings due to the high cost of sonography equipment and the scarcity of trained sonographers. Toaddress this issue, we organized the ACOUSLIC-AI (Abdominal Circumference Operator-agnostic UltraSoundmeasurement in Low-Income Countries) challenge to investigate the feasibility of automatically estimatingfetal AC from blind-sweep ultrasound scans acquired by novice operators using low-cost devices. Training datacollected from three Public Health Units (PHUs) in Sierra Leone are made publicly available. Private validationand test sets, containing data from two PHUs in Tanzania and a European hospital, are provided throughthe Grand-Challenge platform. All sets were annotated by experienced readers. Sixteen international teamsparticipated in this challenge, with six teams submitting to the Final Test Phase. In this article, we present theresults of the three top-performing AI models from the ACOUSLIC-AI challenge, which are publicly accessible.We evaluate their performance in fetal abdomen frame selection, segmentation, abdominal circumferencemeasurement, and compare their performance against clinical standards for fetal AC measurement. Clinicalcomparisons demonstrated that the limits of agreement (LoA) for A2 in fetal AC measurements are comparableto the interobserver LoA reported in the literature. The algorithms developed as part of the ACOUSLIC-AIchallenge provide a benchmark for future algorithms on the selection and segmentation of fetal abdomenframes to further minimize fetal abdominal circumference measurement variability

胎兒生長受限影響高達10%的妊娠，是導致圍產期死亡和發病的關鍵因素。胎兒腹圍（AC）的超聲測量是監測胎兒生長的重要方面。然而，在資源匱乏地區，由于超聲設備成本高昂且訓練有素的超聲醫師稀缺，常規的產科生物測量超聲檢查受到限制。為解決這一問題，我們組織了ACOUSLIC-AI（低收入國家腹圍操作員無關超聲測量）挑戰，旨在研究由新手操作員使用低成本設備獲取的盲掃超聲圖像中自動估算胎兒腹圍的可行性。來自塞拉利昂三個公共衛生單位（PHUs）的訓練數據已公開。包含坦桑尼亞兩個公共衛生單位和一家歐洲醫院數據的私有驗證集和測試集通過Grand-Challenge平臺提供。所有數據集均由經驗豐富的閱片者進行標注。16個國際團隊參與了此次挑戰，其中6個團隊進入最終測試階段。在本文中，我們展示了ACOUSLIC-AI挑戰中表現最佳的三個人工智能模型的結果，這些模型可公開獲取。我們評估了它們在胎兒腹部圖像幀選擇、分割、腹圍測量方面的性能，并將其與胎兒腹圍測量的臨床標準進行了比較。臨床對比表明，A2在胎兒腹圍測量中的一致性限度（LoA）與文獻中報道的觀察者間一致性限度相當。作為ACOUSLIC-AI挑戰的一部分開發的算法，為未來用于胎兒腹部圖像幀選擇和分割的算法提供了基準，以進一步減少胎兒腹圍測量的變異性。

Results

結果

This section presents the results of the top three solutions submittedduring the Final Test Phase (see Section 4.1): A1, A2, and A3. Inaccordance with the challenge rules, each team submitted a single entryto this phase. These solutions achieved ranking score metric values(mean ± SD) of 0.71 ± 0.54, 0.65 ± 0.52, and 0.64 ± 0.52, respectively.Sections 6.1, 6.2, and 6.3 respectively evaluate their performance infetal abdomen frame selection, fetal abdomen segmentation, and fetalabdominal circumference measurement on the test set. Furthermore,Section 6.4 analyses their accuracy in abdominal circumference measurement compared to clinical standards and examines the agreementbetween blind-sweep AC reference measurements and those obtainedin clinical practice

本節介紹最終測試階段（見 4.1 節）提交的前三名解決方案的結果，分別為 A1、A2 和 A3。根據挑戰賽規則，每個團隊在該階段僅提交一份參賽作品。這些解決方案獲得的排名評分指標值（均值 ± 標準差）分別為 0.71±0.54、0.65±0.52 和 0.64±0.52。6.1 節、6.2 節和 6.3 節分別評估了它們在測試集上的胎兒腹部幀選擇、胎兒腹部分割和胎兒腹圍測量方面的性能。此外，6.4 節分析了它們在腹圍測量方面與臨床標準相比的準確性，并檢驗了盲掃腹圍參考測量值與臨床實踐中獲得的測量值之間的一致性。

Figure

圖

Fig. 1. The Obstetric Sweep Protocol (OSP). (a) OSP sweep trajectories. In orange, transverse sweeps (1–3); in violet, sagittal sweeps (4–6). (b) Example visualization (usingITK-SNAP for display) of ultrasound sweep data acquired following the OSP protocol. To the right, the top and bottom views each display a stack of six grey rectangles withsmaller black ones interspersed between them. This is a visual representation of the sweep image data, showing the individual frames in a compact manner. The main imagepanel, located in the top left corner, displays a single frame in the stack.

圖1. 產科掃查協議（OSP）。(a) OSP掃查軌跡。橙色部分為橫向掃查（1-3）；紫色部分為矢狀面掃查（4-6）。(b) 遵循OSP協議獲取的超聲掃查數據的示例可視化（使用ITK-SNAP軟件顯示）。右側的頂部視圖和底部視圖各展示了一疊六個灰色矩形，矩形之間穿插著較小的黑色矩形。這是掃查圖像數據的可視化呈現，以簡潔的方式展示了各個幀。位于左上角的主圖像面板顯示了該疊幀中的單個幀。

Fig. 2. Frame selection results for algorithms A1, A2 and A3 on the test set (n = 282). (a) Weighted frame selection score (WFSS) counts and percentages per value. WFSS is acustom metric designed to assess the accuracy of frame selection, with higher scores awarded for correctly identified clinically relevant frames. A value of 0 is given if the frameselected was not annotated, a value of 0.6 if the frame selected was annotated as sub-optimal, while optimal frames were available, and a value of 1 if the frame selected wasamong the best available according to annotations. (b) Distance to the nearest annotated frame within the same sweep (in frames, provided for comparison only, this was notused in the challenge metrics). Cases without any annotations in the same sweep were excluded. The total number of valid cases per algorithm is indicated as n. The red trianglesindicate mean distances.

圖2. 算法A1、A2和A3在測試集（n = 282）上的幀選擇結果。(a) 加權幀選擇分數（WFSS）的數量及各分數值所占百分比。加權幀選擇分數是一種用于評估幀選擇準確性的自定義指標，正確識別出臨床相關幀會獲得更高分數：若所選幀未被標注，得分為0；若所選幀被標注為次優幀（且存在最優幀），得分為0.6；若所選幀屬于標注中的最佳可用幀，得分為1。(b) 同一掃查序列中與最近的標注幀的距離（以幀數為單位，僅作比較用，未用于挑戰的評分指標）。排除了同一掃查序列中無任何標注的案例。每個算法的有效案例總數標注為n。紅色三角形表示平均距離。

Fig. 3. Results for the segmentation of the fetal abdomen for algorithms A1, A2 and A3 on the test set (n = 282). (a) Soft dice score — overlap is computed on the nearestannotated frame within the same sweep, within 15 frames of distance, weighted by the distance to the frame. (b) Dice score — computed on the nearest annotated frame withinthe same sweep, within 15 frames of distance (provided for comparison only, this was not used in the challenge metrics). Cases with predictions located more than 15 framesaway from the nearest annotation or in sweeps without any annotations were excluded. The total number of valid cases per algorithm is indicated as n. The red triangles indicatemean values

圖3. 算法A1、A2和A3在測試集（n = 282）上的胎兒腹部分割結果。(a) 軟Dice評分——基于同一掃查序列中距離在15幀以內的最近標注幀計算重疊度，并根據與該幀的距離進行加權。(b) Dice評分——基于同一掃查序列中距離在15幀以內的最近標注幀計算（僅作比較用，未用于挑戰的評分指標）。排除了預測幀與最近標注幀距離超過15幀或所在掃查序列無任何標注的案例。每個算法的有效案例總數標注為n。紅色三角形表示平均值。

Fig. 4. Fetal abdominal circumference measurement results for algorithms A1, A2 and A3 on the test set (*n*=282). (a) Normalized absolute error in abdominal circumference(scale-independent, with values from 0–1) (b) Signed abdominal circumference error (for comparison only, this was not used in the challenge metrics). Cases without a referencemeasurement in the same sweep were excluded for this metric. The total number of valid cases per algorithm is indicated as n. The red triangles indicate mean values.

圖4. 算法A1、A2和A3在測試集（n=282）上的胎兒腹圍測量結果。(a) 腹圍的歸一化絕對誤差（與尺度無關，取值范圍為0-1）。(b) 腹圍的有符號誤差（僅作比較用，未用于挑戰的評分指標）。本指標排除了同一掃查序列中無參考測量值的案例。每個算法的有效案例總數標注為n。紅色三角形表示平均值。

Fig. 5. Comparison of fetal abdominal frame selection and segmentation by algorithms A1, A2 and A3 against OSP reference annotated frames in randomly selected cases from theAfrican cohort test set. The leftmost three columns correspond to algorithm A1, the middle three to A2, and the rightmost three to A3. For each algorithm, the columns display,from left to right: the selected frame, the selected frame with an overlay of the fetal abdomen segmentation, and the most suitable OSP reference frame with its correspondingellipse. The rows present comparisons at increasing frame distances — 0, 2, and 15 — from the nearest annotated frame, presented from top to bottom.

圖5. 在非洲隊列測試集中隨機選擇的案例中，算法A1、A2和A3的胎兒腹部幀選擇和分割結果與產科掃查協議（OSP）參考標注幀的對比。最左側三列對應算法A1，中間三列對應A2，最右側三列對應A3。對于每個算法，從左到右的列依次顯示：所選幀、疊加了胎兒腹部分割結果的所選幀，以及最合適的OSP參考幀及其對應的橢圓。各行從上到下展示了與最近標注幀的幀距離分別為0、2和15時的對比情況。

Fig. 6. Abdominal circumference (AC) measurements for each case in the Radboudumc cohort test set data (n=78). (a) Comparison of raw OSP reference measurements withclinical measurements obtained during standard care on the same day. In the clinical setting, sonographers perform multiple measurements (Rumc measurements) and select oneas the accepted value (Rumc ground-truth). Raw ellipse measurements are depicted as OSP reference measurements. (b) Comparison of AC measurements from algorithms A1, A2,and A3 with reference measurements from OSP and the clinical ground truth. The OSP reference values correspond to the sweep average AC derived from reference annotations.

圖6. 拉德堡德大學醫學中心（Radboudumc）隊列測試集數據中每個案例的腹圍（AC）測量結果（n=78）。(a) 原始產科掃查協議（OSP）參考測量值與同一天標準護理中獲得的臨床測量值的對比。在臨床環境中，超聲醫師會進行多次測量（Rumc測量值）并選擇其中一個作為認可值（Rumc基準值）。原始橢圓測量值被視為OSP參考測量值。(b) 算法A1、A2和A3的腹圍測量值與OSP參考測量值及臨床基準值的對比。OSP參考值對應于源自參考標注的掃查平均腹圍。

Fig. 7. Summary of challenge outcomes and clinical performance for algorithms A1, A2 and A3. The left panel presents the mean values for the WFSS, Dice Soft, and NAE ACmetrics used in the challenge evaluation for the whole test set (African and Radboudumc cohorts). These metrics respectively assess performance in fetal abdomen frame selection,abdomen segmentation, and abdominal circumference measurement. The right panel shows the average absolute differences in fetal abdominal circumference measurements foreach algorithm relative to the clinical ground truth for cases in the Radboudumc cohort, expressed in millimetres (mm) and percentages (%). Percentage differences were calculatedas a proportion of the mean measurement to account for fetal size variability. The best-performing algorithms for each metric are highlighted in orange.

圖7. 算法A1、A2和A3的挑戰結果與臨床性能總結。左圖展示了整個測試集（非洲隊列和拉德堡德大學醫學中心隊列）在挑戰評估中使用的加權幀選擇分數（WFSS）、軟Dice評分（Dice Soft）和腹圍歸一化絕對誤差（NAE AC）的平均值。這些指標分別評估胎兒腹部幀選擇、腹部分割和腹圍測量的性能。右圖顯示了在拉德堡德大學醫學中心隊列案例中，每個算法的胎兒腹圍測量值與臨床基準值的平均絕對差異，以毫米（mm）和百分比（%）表示。百分比差異以平均測量值的比例計算，以考慮胎兒大小的變異性。每項指標的最佳性能算法以橙色突出顯示。

Fig. 8. Comparison of fetal abdominal frame selection and segmentation by algorithms A1, A2, and A3 against OSP reference annotated frames and clinical standard planes fromrandomly selected cases in the Radboudumc cohort test set. The leftmost four columns correspond to Algorithm A1, the middle four to A2, and the rightmost four to A3. Foreach algorithm, the columns display, from left to right: the selected frame, the selected frame with an overlay of the fetal abdomen segmentation, the most suitable OSP referenceframe with its corresponding ellipse, and the AC standard plane acquired in clinical practice. The rows present comparisons at increasing frame distances — 0, 2, and 5 — fromthe nearest annotated frame, presented from top to bottom. Images corresponding to the standard plane were zoomed in prior to capture, as is common practice during clinicalassessments.

圖8. 在拉德堡德大學醫學中心（Radboudumc）隊列測試集中隨機選擇的案例中，算法A1、A2和A3的胎兒腹部幀選擇及分割結果與產科掃查協議（OSP）參考標注幀和臨床標準平面的對比。最左側四列對應算法A1，中間四列對應A2，最右側四列對應A3。對于每個算法，從左到右的列依次顯示：所選幀、疊加了胎兒腹部分割結果的所選幀、最合適的OSP參考幀及其對應的橢圓，以及臨床實踐中獲取的腹圍標準平面。各行從上到下展示了與最近標注幀的幀距離分別為0、2和5時的對比情況。與標準平面對應的圖像在截取前進行了放大，這是臨床評估中的常見操作。

Table

表

Table 1Description of the ACOUSLIC-AI datasets. SL: Sierra Leone, TZ: Tanzania, Rumc: Radboudumc in the Netherlands. R1, R2, CN and AB denotethe four different readers annotating the data.

表1 ACOUSLIC-AI數據集說明。SL：塞拉利昂，TZ：坦桑尼亞，Rumc：荷蘭拉德堡德大學醫學中心。R1、R2、CN和AB分別表示對數據進行標注的四位不同閱片者。

Table 2Measurement variability (intra-observer) and measurement agreement (inter-observer and inter-modality) for fetal abdominal circumferencemeasurements in the Radboudumc cohort test set (n = 78). LoA denotes the ±1.96 SD limits of agreement. AAD denotes Average AbsoluteDifference. Measurement variability assesses the absolute differences between measurements in cases with multiple observations. This analysisincludes raw reference measurements from OSP data and multiple same-day measurements obtained during standard clinical care (Rumc).Measurement agreement evaluates the differences between the average raw OSP measurements derived from blind-sweep data and the averageRumc measurements taken in the standard pl

表2 拉德堡德大學醫學中心（Radboudumc）隊列測試集（n = 78）中胎兒腹圍測量的變異性（觀察者內）和一致性（觀察者間及模態間）。LoA表示±1.96標準差的一致性限度。AAD表示平均絕對差。測量變異性用于評估存在多次觀測的案例中測量值之間的絕對差異。本分析包括來自產科掃查協議（OSP）數據的原始參考測量值，以及標準臨床護理中獲取的同一天多次測量值（Rumc）。測量一致性用于評估源自盲掃數據的OSP原始測量平均值與標準平面獲取的Rumc測量平均值之間的差異。

Table 3Comparison of algorithm performance in fetal abdominal circumference measurements on the Radboudumc cohort (n = 78). LoA denotes the±1.96 SD limits of agreement. AAD denotes Average Absolute Difference. The top section compares the algorithms’ measurements to the meanOSP reference measurements on the predicted sweep, while the bottom section evaluates their performance against the Radboudumc clinicalground-truth. N denotes the total number of valid predicted circumferences considered for each analysis: when comparing to the OSP reference,only circumference measurements with a reference measurement in the same sweep were considered; when comparing to the Radboudumcground-truth, all predicted circumference measurements were considered (missing values correspond to cases where the algorithms found nogood frame for measurement). The best performance for each metric is highlighted in bold

表3 算法在拉德堡德大學醫學中心（Radboudumc）隊列（n = 78）中胎兒腹圍測量的性能對比。LoA表示±1.96標準差的一致性限度。AAD表示平均絕對差。上半部分將算法測量值與預測掃查序列的OSP參考測量平均值進行比較，下半部分評估算法測量值與拉德堡德大學醫學中心臨床基準值的一致性。N表示每項分析中納入的有效預測腹圍總數：與OSP參考值比較時，僅納入同一掃查序列中有參考測量值的腹圍測量結果；與拉德堡德大學醫學中心基準值比較時，納入所有預測腹圍測量結果（缺失值對應算法未找到適合測量的幀的案例）。每項指標的最佳性能以粗體突出顯示。