MonoPCC：用于內窺鏡圖像單目深度估計的光度不變循環約束|文獻速遞-深度學習醫療AI最新文獻

Title

題目

MonoPCC: Photometric-invariant cycle constraint for monocular depth ?estimation of endoscopic images

MonoPCC：用于內窺鏡圖像單目深度估計的光度不變循環約束

文獻速遞介紹

單目內窺鏡是胃腸診斷和手術的關鍵醫學成像工具，但其通常提供狹窄的視野（FOV）。3D場景重建有助于擴大視野，并通過與術前計算機斷層掃描（CT）配準實現手術導航等更高級的應用。單目內窺鏡圖像的深度估計是重建3D結構的前提，但由于缺乏真實深度（GT）標簽而極具挑戰性。單目深度估計的典型解決方案依賴自監督學習，其核心思想是真實圖像與扭曲圖像之間的光度約束。具體而言，需要構建兩個卷積神經網絡（CNN）：一個稱為DepthNet，另一個稱為PoseNet。這兩個CNN分別估計每幅圖像的深度圖和每兩個相鄰圖像的相機位姿變化，基于此，內窺鏡視頻中的源幀可以投影到3D空間并扭曲到另一幀的目標視圖。DepthNet和PoseNet通過聯合優化來最小化光度損失，該損失本質上是扭曲圖像與目標圖像之間的像素級差異。 ? 然而，光源固定在內窺鏡上并隨相機移動，導致源幀和目標幀之間存在顯著的亮度波動。由于近距離觀察下的非朗伯反射，這一問題可能進一步惡化，如圖1（a）-（b）所示。因此，在目標圖像與扭曲后的源圖像之間，亮度差異占主導地位（如圖1（b）-（c）所示），從而誤導了自監督學習中的光度約束。 ? 為了增強亮度波動下光度約束的可靠性，研究人員已做出許多努力。一種直觀的解決方案是預先校準內窺鏡視頻幀的亮度，可采用線性強度變換（Ozyoruk等人，2021）或訓練好的外觀流模型（Shao等人，2022）。然而，前者僅解決全局亮度不一致問題，后者由于引入繁重的計算而增加了訓練難度。此外，外觀流模型的可靠性也因弱自監督而無法始終得到保證，這可能導致錯誤修改與亮度變化無關的區域。 ? 本文旨在不依賴任何輔助模型的情況下解決亮度不一致的瓶頸問題。我們的靈感來源于最近一種名為TC-Depth（Ruhkamp等人，2021）的方法，該方法引入循環扭曲來解決遮擋問題。TC-Depth將目標圖像扭曲到源圖像，然后再扭曲回自身以識別每個被遮擋的像素，因為他們假設被遮擋的像素無法精確返回原始位置。我們發現，這種循環扭曲可以自然克服亮度不一致問題，并且與僅從源到目標的扭曲相比，能產生更可靠的扭曲圖像，如圖1（b）-（d）所示。然而，直接應用循環扭曲通常在光度約束中失敗，原因在于：（1）循環扭曲的兩次雙線性插值會過度模糊圖像；（2）深度和位姿估計網絡是主動學習的，這使得中間扭曲不穩定且難以收斂。 ? 基于上述分析，我們提出了基于光度不變循環約束的單目深度估計方法（MonoPCC），該方法采用循環扭曲的思想，但對其進行了顯著改進，使光度約束對不一致的亮度具有不變性。具體而言，MonoPCC從目標圖像出發，沿著閉環路徑（目標-源-目標）獲取循環扭曲的目標圖像，該圖像繼承了原始目標圖像的一致亮度。為了使這種循環扭曲在光度約束中有效，MonoPCC采用了基于快速傅里葉變換（FFT）的無學習結構移植模塊（STM），以最小化模糊效應的負面影響。STM通過從源圖像“借用”相位頻率部分來恢復中間扭曲圖像中丟失的結構細節。此外，MonoPCC沒有在目標-源和源-目標扭曲路徑中共享網絡權重，而是使用指數移動平均（EMA）策略連接這兩條路徑，以穩定第一條路徑中的中間結果。 ? 綜上所述，我們的主要貢獻如下： ? 1. 提出MonoPCC，通過簡單采用循環形式的扭曲使光度約束對亮度變化具有不變性，消除了自監督學習中不一致亮度導致的誤導。 ? 2. 引入兩種關鍵技術，即結構移植模塊（STM）和基于EMA的穩定訓練。STM恢復了插值導致的圖像細節丟失，EMA穩定了前向扭曲。這些共同保證了MonoPCC在循環扭曲下的有效訓練。 ? 3. 在四個公開的內窺鏡數據集（SCARED（Allan等人，2021）、SimCol3D（Rau等人，2023）、SERV-CT（Edwards等人，2022）和Hamlyn（Mountney等人，2010；Stoyanov等人，2010；Pratt等人，2010））以及一個公開的自然數據集KITTI（Geiger等人，2012）上進行了全面且廣泛的實驗。與八種最先進方法的比較結果表明，MonoPCC在四個內窺鏡數據集上的絕對相對誤差分別降低了7.27%、9.38%、9.90%和3.17%，顯示出其優越性，以及在訓練中抵抗亮度不一致的強大能力。此外，在KITTI上的比較結果進一步驗證了MonoPCC即使在亮度變化通常不顯著的自然場景中的競爭力。

Abatract

摘要

Photometric constraint is indispensable for self-supervised monocular depth estimation. It involves warping a ?source image onto a target view using estimated depth&pose, and then minimizing the difference between the ?warped and target images. However, the endoscopic built-in light causes significant brightness fluctuations, and ?thus makes the photometric constraint unreliable. Previous efforts only mitigate this relying on extra models to ?calibrate image brightness. In this paper, we propose MonoPCC to address the brightness inconsistency radically ?by reshaping the photometric constraint into a cycle form. Instead of only warping the source image, MonoPCC ?constructs a closed loop consisting of two opposite forward–backward warping paths: from target to source and ?then back to target. Thus, the target image finally receives an image cycle-warped from itself, which naturally ?makes the constraint invariant to brightness changes. Moreover, MonoPCC transplants the source image’s ?phase-frequency into the intermediate warped image to avoid structure lost, and also stabilizes the training ?via an exponential moving average (EMA) strategy to avoid frequent changes in the forward warping. The ?comprehensive and extensive experimental results on five datasets demonstrate that our proposed MonoPCC ?shows a great robustness to the brightness inconsistency, and exceeds other state-of-the-arts by reducing the ?absolute relative error by 7.27%, 9.38%, 9.90% and 3.17% on four endoscopic datasets, respectively; superior ?results on the outdoor dataset verify the competitiveness of MonoPCC for the natural scenario.

光度約束對于自監督單目深度估計至關重要。它涉及使用估計的深度和位姿將源圖像扭曲到目標視圖，然后最小化扭曲圖像與目標圖像之間的差異。然而，內窺鏡的內置光源會導致顯著的亮度波動，從而使光度約束不可靠。以往的研究僅依賴額外的模型來校準圖像亮度以緩解這一問題。在本文中，我們提出MonoPCC，通過將光度約束重塑為循環形式，從根本上解決亮度不一致問題。 ? MonoPCC不再僅對源圖像進行扭曲，而是構建了一個由兩條相反的前向-后向扭曲路徑組成的閉環：從目標到源，再回到目標。因此，目標圖像最終會收到一個由其自身循環扭曲的圖像，這自然使約束對亮度變化具有不變性。此外，MonoPCC將源圖像的相位頻率移植到中間扭曲圖像中，以避免結構丟失，并通過指數移動平均（EMA）策略穩定訓練，避免前向扭曲的頻繁變化。 ? 在五個數據集上進行的全面且廣泛的實驗結果表明，我們提出的MonoPCC對亮度不一致具有很強的魯棒性，在四個內窺鏡數據集上，其絕對相對誤差分別比其他最先進方法降低了7.27%、9.38%、9.90%和3.17%；在戶外數據集上的優異結果驗證了MonoPCC在自然場景中的競爭力。

Method

方法

Fig. 2 illustrates the pipeline of MonoPCC, consisting of both forward and backward warping paths in the training phase. We first ?explain how to warp images for self-supervised learning in Section 3.1, ?and then detail the photometric-invariant principle of MonoPCC in ?Section 3.2, as well as its two key enabling techniques in Section 3.3 ?and Section 3.4, i.e., structure transplant module (STM) for avoiding ?detail lost and EMA between two paths for stabilizing the training.

圖2展示了MonoPCC的訓練流程，其中包括訓練階段的前向和后向扭曲路徑。我們首先在3.1節中解釋如何為自監督學習進行圖像扭曲，然后在3.2節中詳細介紹MonoPCC的光度不變性原理，并在3.3節和3.4節中闡述其兩項關鍵支持技術，即用于避免細節丟失的結構移植模塊（STM）和用于穩定訓練的雙路徑指數移動平均（EMA）策略。

Conclusion

結論

Self-supervised monocular depth estimation is challenging for endoscopic scenes due to the severe negative impact of brightness fluctuations on the photometric constraint. In this paper, we propose a ?cycle-form warping to naturally overcome the brightness inconsistency ?of endoscopic images, and develop a MonoPCC for robust monocular depth estimation by using a re-designed photometric-invariant ?cycle constraint. To make the cycle-form warping effective in the ?photometric constraint, MonoPCC is equipped with two enabling techniques, i.e., structure transplant module (STM) and exponential moving ?average (EMA) strategy. STM alleviates image detail degradation to ?validate the backward warping, which uses the result of forward warping as input. EMA bridges the learning of network weights in the ?forward and backward warping, and stabilizes the intermediate warped ?image to ensure an effective convergence. The comprehensive and ?extensive comparisons with 8 state-of-the-arts on five public datasets, ?i.e., SCARED, SimCol3D, SERV-CT, Hamlyn, and KITTI, demonstrate ?that MonoPCC achieves a superior performance by decreasing the ?absolute relative error by 7.27%, 9.38%, 9.90% and 3.17% on four ?endoscopic datasets, respectively, and shows the competitiveness even ?for the natural scenario. Additionally, two ablation studies are conducted to confirm the effectiveness of three developed modules and ?the advancement of MonoPCC over other similar techniques against ?brightness fluctuations. Limitations. The current pipeline relies on a single frame to infer ?the depth map. Since each prediction is made independently, the ?model lacks perception of temporal consistency, meaning that the depth ?values at the same location may vary over time. This temporal inconsistency can lead to artifacts, such as overlapping tissue surfaces, as shown ?in the 3D reconstruction visualization in Fig. 12. Furthermore, our ?method is primarily designed for static endoscopic scenes. In dynamic ?scenarios involving tissue deformation, MonoPCC may not perform ?effectively. This is because the depth values at corresponding positions ?between source–target paired images can change locally, making it ?difficult to establish the cycle warping path consistently. Potential Future Application. In this paper, we have demonstrated ?the effectiveness of the PCC strategy for self-supervised monocular ?depth estimation in endoscopic images. We believe that our framework ?can be seamlessly integrated into other related tasks, such as stereo ?matching (Shi et al., 2023) and metric depth estimation (Wei et al., ?2022, 2024), both of which face challenges due to brightness fluctuations. Additionally, in the field of NeRF-based scene reconstruction, ?several depth-prior-assisted methods (Wang et al., 2022; Li et al., 2024; ?Huang et al., 2024) utilize estimated depth to guide model training. ?Therefore, our depth estimator, designed specifically for endoscopic ?scenes, could also enhance the performance of such downstream tasks.

由于亮度波動對光度約束的嚴重負面影響，自監督單目深度估計在內窺鏡場景中具有挑戰性。本文提出一種循環形式的扭曲方法以自然克服內窺鏡圖像的亮度不一致性，并通過重新設計的光度不變循環約束開發了用于魯棒單目深度估計的MonoPCC模型。為使循環扭曲在光度約束中有效，MonoPCC配備兩項關鍵技術：結構移植模塊（STM）和指數移動平均（EMA）策略。STM通過利用前向扭曲結果作為輸入，緩解圖像細節退化以支持后向扭曲；EMA則橋接前向與后向扭曲的網絡權重學習，穩定中間扭曲圖像以確保有效收斂。在SCARED、SimCol3D、SERV-CT、Hamlyn和KITTI五個公開數據集上與8種最先進方法的全面對比表明，MonoPCC在四個內窺鏡數據集上分別將絕對相對誤差降低7.27%、9.38%、9.90%和3.17%，性能顯著優于現有方法，甚至在自然場景中也展現出競爭力。此外，兩項消融實驗驗證了三個模塊的有效性，以及MonoPCC在抗亮度波動方面相較其他類似技術的先進性。局限性：當前流程依賴單幀推斷深度圖，由于獨立預測缺乏時間一致性感知，同一位置的深度值可能隨時間變化，導致3D重建中出現組織表面重疊等偽影（如圖12所示）。此外，方法主要針對靜態內窺鏡場景設計，在涉及組織變形的動態場景中可能失效——源-目標圖像對的對應位置深度值可能局部變化，難以建立一致的循環扭曲路徑。潛在未來應用：本文驗證了PCC策略在內窺鏡圖像自監督單目深度估計中的有效性。我們認為該框架可無縫集成到其他相關任務中，例如面臨亮度波動挑戰的立體匹配（Shi等，2023）和度量深度估計（Wei等，2022, 2024）。此外，在基于NeRF的場景重建領域，若干深度先驗輔助方法（Wang等，2022；Li等，2024；Huang等，2024）利用估計深度指導模型訓練，因此專為內窺鏡場景設計的深度估計器也可提升此類下游任務的性能。

Figure

圖

Fig. 1. (a)–(b) are the source ???? and target ???? frames. (c) is the warped image from ?the source to target. (d) is the cycle-warped image along the target–source–target path ?for reliable photometric constraint. Box contour colors distinguish different brightness ?patterns

圖1 ? (a)-(b)為源圖像(I_s)和目標圖像(I_t)。 ? (c)為從源圖像扭曲到目標視圖的扭曲圖像。 ? (d)為沿目標-源-目標路徑的循環扭曲圖像，用于構建可靠的光度約束。 ? 方框輪廓顏色區分了不同的亮度模式。

Fig. 2. The training pipeline of MonoPCC, which consists of forward and backward cascaded warping paths bridged by two enabling techniques, i.e., structure transplant module ?(STM) and exponential moving average (EMA). The training has two phases, i.e., warm-up to initialize the network weights for reasonable forward warping, and follow-up to resist ?the brightness changes. Different box contour colors code different brightness patterns. ? means concatenation

圖2 MonoPCC的訓練流程，包含由兩項關鍵技術——結構移植模塊（STM）和指數移動平均（EMA）連接的前向和后向級聯扭曲路徑。訓練分為兩個階段：一是用于初始化網絡權重以實現合理前向扭曲的預熱階段，二是用于抵抗亮度變化的后續階段。不同方框輪廓顏色編碼不同的亮度模式，?表示拼接操作。

Fig. 3. Details of STM, which utilizes the phase-frequency of the source image ???? to ?replace that of the warped image ????→?? to avoid image detail lost.

圖3 結構移植模塊（STM）的細節示意圖，該模塊利用源圖像(I_s)的相位頻率替換扭曲圖像(I_s\rightarrow I_t)的相位頻率，以避免圖像細節丟失。

Fig. 4. The auxiliary perception constraint by backward warping the encoding feature ?maps instead of raw images.

圖4 通過對編碼特征圖（而非原始圖像）進行反向扭曲的輔助感知約束。

Fig. 5. The Abs Rel error maps of comparison methods on SCARED and SimCol3D, with close-up details highlighted. The regions of interest (ROIs) are outlined with red dashed ?lines, and the Opencv Jet Colormap is used for visualization

圖5 在SCARED和SimCol3D數據集上對比方法的絕對相對誤差（Abs Rel）圖，突出顯示了特寫細節。感興趣區域（ROI）用紅色虛線勾勒，可視化采用OpenCV Jet顏色映射。

Fig. 6. The Abs Rel error maps of comparison methods on SERV-CT and Hamlyn, with close-up details highlighted. The regions of interest (ROIs) are outlined with red dashed ?lines, and the Opencv Jet Colormap is used for visualization.

圖6 對比方法在SERV-CT和Hamlyn數據集上的絕對相對誤差（Abs Rel）圖，突出顯示了特寫細節。感興趣區域（ROI）用紅色虛線勾勒，可視化采用OpenCV Jet顏色映射。

Fig. 7. The Abs Rel error maps of seven ablation variants, including effectiveness of three components, with close-up details highlighted. (a)–(g) correspond to the 1st to the 7th ?rows in Table 4. The regions of interest (ROIs) are outlined with red dashed lines, and the Opencv Jet Colormap is used for visualization.

圖7 七種消融變體的絕對相對誤差（Abs Rel）圖，包括三個組件的有效性分析，突出顯示了特寫細節。（a）-（g）對應表4中的第1行至第7行。感興趣區域（ROI）用紅色虛線勾勒，可視化采用OpenCV Jet顏色映射。

Fig. 8. The Abs Rel error maps of MonoPCC and other similar modules against ?photometric inconsistency, with close-up details highlighted. (a)–(d) correspond to the ?1st to the 4th rows in Table 5. The regions of interest (ROIs) are outlined with red ?dashed lines, and the Opencv Jet Colormap is used for visualization

圖8 ?MonoPCC與其他類似抗光度不一致模塊的絕對相對誤差（Abs Rel）圖，突出顯示了特寫細節。（a）-（d）對應表5中的第1行至第4行。感興趣區域（ROI）用紅色虛線勾勒，可視化采用OpenCV Jet顏色映射

Fig. 9. An example of created brightness perturbation. From left to right is the original ?image, globally perturbated (?? = 1.2), and both globally and locally (bright spots) ?perturbated. The color-coded maps above them describe the subtractive difference ?between the perturbated image and its original one

圖9 生成的亮度擾動示例。從左到右分別為原始圖像、全局擾動圖像（(\gamma=1.2)）、全局和局部（亮點）聯合擾動圖像。圖像上方的彩色映射圖表示擾動圖像與原始圖像的差值（減法運算結果）。

Fig. 10. The Abs Rel errors of different methods trained on the two brightnessperturbed copies of SCARED and the original SCARED

圖10 ?在SCARED數據集的兩個亮度擾動版本及原始數據集上訓練的不同方法的絕對相對誤差（Abs Rel）。

Fig. 11. The qualitative pose estimation comparison based on two SCARED trajectories.

圖11 基于兩條SCARED軌跡的定性位姿估計對比。

Fig. 12. Qualitative comparison results on the 3D scene reconstruction based on the ?estimated depth maps of two methods. The three sequences are selected from SCARED.

圖12 基于兩種方法估計深度圖的3D場景重建定性對比結果。三個序列選自SCARED數據集。

Table

表

Table 1 Evaluation metrics of monocular depth estimation, ?where ?? refers to the number of valid pixels in depth ?maps, ???? and ?? ? ?? denote the estimated and GT depth of ???th pixel, respectively. The Iverson bracket [?] yields 1 ?if the statement is true, otherwise 0.

表1 單目深度估計的評估指標，其中 (N) 表示深度圖中有效像素的數量，(\hat{d}i) 和 (d^*i) 分別表示第 (i) 個像素的估計深度和真實深度（GT）。艾弗森括號 ([\cdot]) 在條件為真時取值為1，否則為0。

Table 2 Quantitative comparison results on SCARED and SimCol3D. The best results are marked in bold and the second-best underlined. The paired p-values between MonoPCC and others ?are all less than 0.05.

表2 在SCARED和SimCol3D數據集上的定量對比結果。最佳結果以粗體標注，次佳結果加下劃線。MonoPCC與其他方法的配對p值均小于0.05。

Table 3 Quantitative comparison results on SERV-CT and Hamlyn. The best results are marked in bold and the second-best underlined. The paired p-values between MonoPCC and others ?are all less than 0.05, except for ?? on SERV-CT compared to MonoViT

表3 在SERV-CT和Hamlyn數據集上的定量對比結果。最佳結果以粗體標注，次佳結果加下劃線。除SERV-CT數據集上的??指標與MonoViT相比外，MonoPCC與其他方法的配對p值均小于0.05。

Table 4 The rows except the first one are the comparison results of the five variants and the complete MonoPCC, which are all cycle-constrained. The first row is the backbone MonoViT ?using the regular non-cycle constraint

表4 除第一行外，其余行均為五種變體與完整MonoPCC的對比結果，所有方法均采用循環約束。第一行為使用常規非循環約束的主干網絡MonoViT。

Table 5 Comparison results of different techniques for addressing the brightness fluctuations in ?self-supervised learning. The last row is the technique used in MonoPCC

表5 自監督學習中解決亮度波動的不同技術對比結果。最后一行為MonoPCC所使用的技術。

Table 6 Quantitative comparison results on KITTI. The best results are marked in bold and the ?second best ones are underlined

表6 在KITTI數據集上的定量對比結果。最佳結果以粗體標注，次佳結果加下劃線。

Table 7 Quantitative comparison results (Absolute Trajectory Error) of pose estimation on two ?trajectories of SCARED. The best results are marked in bold and the second-best ?underlined

表7 基于SCARED數據集兩條軌跡的位姿估計定量對比結果（絕對軌跡誤差）。最佳結果以粗體標注，次佳結果加下劃線。