EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba論文精讀（逐段解析）

論文地址：https://arxiv.org/abs/2403.09977
CVPR 2024

Abstract. Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational deremains a significant hurdle. Recently, state space models (SSMs), such mands $O(N2)\mathcal{O}(N^{2})$ . This ongoing trade-off between accuracy and efficiency as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to $O(N)\mathcal O(N)$ . Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with 1 . 3 G FLOPs improves VimTi with 1 . 5 G FLOPs by a large margin of $5.6%5.6\%$ accuracy on ImageNet. Code is available at: https://github.com/TerryPei/EfficientVMamba .

【翻譯】摘要。先前在輕量級模型開發方面的努力主要集中在基于CNN和Transformer的設計上，但仍面臨持續的挑戰。CNN擅長局部特征提取但會損害分辨率，而Transformer提供全局覆蓋但計算成本急劇上升，形成了一個顯著的障礙。最近，狀態空間模型（SSMs），如Mamba，在語言建模和計算機視覺等各種任務中表現出卓越的性能和競爭力，同時將全局信息提取的時間復雜度降低到 $O(N)\mathcal O(N)$ 。受此啟發，本工作提出探索視覺狀態空間模型在輕量級模型設計中的潛力，并引入了一種名為EfficientVMamba的新型高效模型變體。具體而言，我們的EfficientVMamba通過高效跳躍采樣集成了基于空洞的選擇性掃描方法，構成了旨在利用全局和局部表征特征的構建塊。此外，我們研究了SSM塊與卷積之間的集成，并引入了一個與額外卷積分支相結合的高效視覺狀態空間塊，進一步提升了模型性能。實驗結果表明，EfficientVMamba在縮小計算復雜度的同時，在各種視覺任務中產生了具有競爭力的結果。例如，我們的EfficientVMamba-S僅需1.3G FLOPs，相比需要1.5G FLOPs的VimTi在ImageNet上提升了 $5.6%5.6\%$ 的準確率。代碼可在以下鏈接獲取：https://github.com/TerryPei/EfficientVMamba。

【解析】當我們要構建一個既準確又高效的視覺模型時，傳統的兩大主流架構都有各自的明顯缺點。CNN雖然在提取圖像的局部特征方面表現出色（比如邊緣、紋理等），但這種局部性導致它在處理需要全局理解的任務時力不從心，而且為了獲得更大的感受野往往需要增加網絡深度，這又會降低特征圖的分辨率。Transformer架構恰恰相反，它的自注意力機制天生具備全局建模能力，可以讓圖像中任意兩個位置的像素直接交互，但這種全連接的特性導致計算復雜度隨輸入長度的平方增長，在高分辨率圖像上變得極其昂貴。

狀態空間模型為這個困境提供了一個很好的解決方案。它其實源自控制理論，最初用于描述動態系統的狀態演化過程。在深度學習的語境下，SSM可以被理解為一種序列建模工具，它通過維護一個隱藏狀態來捕獲序列中的長程依賴關系。關鍵的突破在于，SSM能夠以線性復雜度 $O(N)\mathcal O(N)$ 實現全局信息的傳播和聚合，這打破了Transformer中二次復雜度的桎梏。Mamba作為SSM的一個重要實現，通過引入選擇性機制和硬件友好的算法設計，進一步提升了這種架構的實用性。

EfficientVMamba的核心創新體現在兩個方面：一是提出了基于空洞卷積思想的選擇性掃描策略，通過跳躍采樣減少需要處理的token數量，在保持全局感受野的同時大幅降低計算開銷；二是設計了一種雙路徑架構，將SSM的全局建模能力與卷積的局部特征提取優勢有機結合，通過通道注意力機制實現兩者的動態融合。

Keywords: Light-weight Architecture Efficient Network State Space Model

【翻譯】關鍵詞：輕量級架構高效網絡狀態空間模型

1 Introduction

Convolutional networks, exemplified by models such as ResNet [ 14 ], Inception [ 38 , 39 ], and EfficientNet [ 41 ], and Transformer-based networks, such as SwinTransformer [ 25 ], Beit [ 1 ], and Resformer [ 59 ] have been extensively applied to visual tasks including image classification, detection, and segmentation, achieving remarkable results. Recently, Mamba [ 7 ], a network based on state-space models (SSMs) [ 9 – 11 , 20 , 31 ], has demonstrated competitive performance to Transformers [ 47 ] in sequence modeling tasks such as language modeling. Inspired by this, some works [ 22 , 24 , 33 , 55 , 61 ] are pioneering in introducing SSMs into vision tasks. Among these methods, Vmamba [ 24 ] stands out by introducing an SS2D method to preserve 2D spatial dependencies by scanning images from multiple directions.

【翻譯】卷積網絡，以ResNet [14]、Inception [38, 39]和EfficientNet [41]等模型為代表，以及基于Transformer的網絡，如SwinTransformer [25]、Beit [1]和Resformer [59]，已被廣泛應用于包括圖像分類、檢測和分割在內的視覺任務，取得了顯著的成果。最近，基于狀態空間模型（SSMs）[9-11, 20, 31]的網絡Mamba [7]在語言建模等序列建模任務中展現出了與Transformers [47]相競爭的性能。受此啟發，一些工作[22, 24, 33, 55, 61]開始將SSMs引入視覺任務。在這些方法中，Vmamba [24]通過引入SS2D方法來保持2D空間依賴性，通過從多個方向掃描圖像而脫穎而出。

【解析】這段話說明了當前視覺任務的主流方法和最新進展。在深度學習的視覺領域，長期以來存在兩大主流架構：CNN和Transformer。CNN通過卷積操作能夠很好地捕獲圖像的局部特征和空間結構，而Transformer則通過自注意力機制實現了全局信息的建模能力。但是，這兩種架構都有各自的局限性——CNN難以建模長程依賴關系，而Transformer的計算復雜度過高。狀態空間模型作為一種新興的序列建模方法，最初在自然語言處理領域取得了成功，特別是Mamba模型在保持全局建模能力的同時大幅降低了計算復雜度。因此激發了研究者將SSM引入計算機視覺領域的興趣。Vmamba是這一方向的重要代表，它通過SS2D（二維選擇性掃描）方法解決了如何將原本設計用于一維序列的SSM適配到二維圖像數據的關鍵問題。

However, the impressive performance achieved by these various architectures usually comes from the scaling up of model sizes, making a critical challenge in applying them on resource-constrained devices. In pursue of light-weight models, many studies have been conducted to reduce the resource consumption of vision models while keeping a competitive performance. Early works on efficient CNNs mainly focus on narrowing the original convolutional block with efficient group convolutions [ 4 , 15 , 34 ], light skipping connections [ 12 , 27 ], e.t.c . While recently, due to the remarkable success of including the global representation ability of Transformers into vision tasks, some works are proposed to reduce the computation complexity of ViTs [ 17 , 25 , 46 , 49 ] and fuse ViTs with CNNs in light-weight models [ 19 , 28 , 46 ]. However, the lightening of ViTs are usually obtained with the lost of global capture capability in self-attention. Due to the $O(N)\mathcal O(N)$ time complexity of global self-attention, its computation and memory costs increase dramastically at large resolutions. As a result, existing efficient ViT methods have to perform local self-attentions within partitioned windows [ 17 , 25 , 46 ], or only conduct global self-attentions in deeper stages with low resolutions [ 19 , 28 ]. The embarrassing trade-off and rollback of ViTs to CNNs hinders the ability of improving the light-weight models further.

【翻譯】然而，這些各種架構所取得的令人印象深刻的性能通常來自于模型規模的擴大，這使得在資源受限的設備上應用它們成為一個關鍵挑戰。為了追求輕量級模型，許多研究已經進行，以減少視覺模型的資源消耗，同時保持競爭性能。早期關于高效CNN的工作主要集中在通過高效的分組卷積[4, 15, 34]、輕量跳躍連接[12, 27]等來縮小原始卷積塊。而最近，由于將Transformers的全局表示能力引入視覺任務取得了顯著成功，一些工作被提出來降低ViTs的計算復雜度[17, 25, 46, 49]，并將ViTs與CNNs融合到輕量級模型中[19, 28, 46]。然而，ViTs的輕量化通常是以失去自注意力中的全局捕獲能力為代價獲得的。由于全局自注意力的 $O(N2)\mathcal O(N^2)$ 時間復雜度，其計算和內存成本在大分辨率下急劇增加。因此，現有的高效ViT方法必須在分割的窗口內執行局部自注意力[17, 25, 46]，或只在具有低分辨率的深層階段進行全局自注意力[19, 28]。ViTs向CNNs的尷尬權衡和回退阻礙了進一步改進輕量級模型的能力。

【解析】這段話點出了當前輕量級模型設計面臨的難題。模型性能的提升往往依賴于參數量和計算復雜度的增加，但這與移動設備、邊緣計算等實際應用場景的資源限制產生了根本性矛盾。在輕量化的演進過程中，CNN領域的研究相對成熟，通過分組卷積、深度可分離卷積、跳躍連接等技術手段實現了效率與性能的較好平衡。Transformer在視覺任務中的成功主要源于其強大的全局建模能力，但這種能力的代價是二次復雜度的自注意力計算。當圖像分辨率增加時，這種復雜度會呈現爆炸式增長，使得模型在資源受限環境下難以實用。為了解決這個問題，現有的輕量化ViT方法不得不做出妥協：要么將注意力限制在局部窗口內（如Swin Transformer， https://blog.csdn.net/weixin_46248968/article/details/149156228?spm=1001.2014.3001.5502），要么只在網絡的深層使用全局注意力（此時特征圖分辨率已經較低）。這種妥協實際上是在向CNN的設計理念倒退，違背了Transformer全局建模的初衷，也限制了輕量級模型進一步突破的可能性。

Fig. 1: Lightweight Model Performance Comparison on ImageNet. EfficientV- Mamba outperforms previous work across various model variants in terms of both accuracy and computational complexity.

【翻譯】圖1：ImageNet上輕量級模型性能比較。EfficientVMamba在各種模型變體中，在準確率和計算復雜度方面都優于之前的工作。

In this paper, recalling the previously metioned linear scaling complexity in SSMs, we are inspired to obtain efficient global capture ability in light-weight vision models by involving SSMs into model design. Its outstanding performance is demonstrated in Figure 1 . We achieve this by first introducing a skip-sampling mechanism, which reduces the number of tokens that need to be scanned in the spatial dimension, and saves multiple times of computation cost in sequence modeling of SSMs while keeping the global receptive field among tokens, as illustrated in Figure 2 . On the other hand, acknowledging that the convolutions provide a more efficient way for feature extraction in the case when only local representations suffice, we introduce a convolution branch in supplement of the original global SSM branch, and perform feature fusion of them through the channel attention module, SE [ 16 ]. Finally, for an optimal allocation of capabilities of various block types, we construct our network with global SSM blocks in the shallow and high-resolution layers, while adopting efficient convolution blocks (MobileNetV2 blocks [ 34 ]) in the deeper layers. The final network, achieving efficient SSM computation and efficient integration of convolutions, has showcased significant improvements compared to previous CNN and ViT based light-weight models through our experiments on image classification, object detection, and semantic segmentation tasks.

【翻譯】在本文中，回顧前面提到的SSMs中的線性縮放復雜度，我們受到啟發，通過將SSMs引入模型設計來在輕量級視覺模型中獲得高效的全局捕獲能力。其卓越性能在圖1中得到了展示。我們通過首先引入跳躍采樣機制來實現這一點，該機制減少了在空間維度中需要掃描的標記數量，并在保持標記間全局感受野的同時節省了SSMs序列建模中數倍的計算成本，如圖2所示。另一方面，認識到卷積在僅需要局部表示的情況下提供了更高效的特征提取方式，我們引入了一個卷積分支來補充原始的全局SSM分支，并通過通道注意力模塊SE [16]對它們進行特征融合。最后，為了優化各種塊類型的能力分配，我們在淺層和高分辨率層構建了具有全局SSM塊的網絡，而在深層采用了高效的卷積塊（MobileNetV2塊[34]）。最終的網絡實現了高效的SSM計算和卷積的高效集成，通過我們在圖像分類、目標檢測和語義分割任務上的實驗，與之前基于CNN和ViT的輕量級模型相比展現了顯著的改進。

【解析】這段話指出了EfficientVMamba的設計思路。作者從SSM的線性復雜度特性出發，提出了一個多層次的解決方案。首先是跳躍采樣機制，通過有策略地跳過某些token來減少計算量，但同時保持全局感受野不變。這就像在保持能看到整個畫面的前提下，選擇性地關注一些關鍵點，從而大幅降低計算開銷。其次是雙分支架構的設計，SSM擅長全局建模但在局部特征提取上可能不如卷積高效，而卷積在局部特征提取上有天然優勢。通過將兩者結合并用通道注意力進行動態權重分配，實現了優勢互補。最后是網絡層次化設計，這其實是進一步優化了特征提取過程：在淺層高分辨率階段，全局信息的建模更為重要，而在深層低分辨率階段，局部特征的進一步抽象和整合更加關鍵。這種分層設計提高效率的同時，更符合視覺特征處理的層次化規律。

Fig. 2: Illustration of efficient 2D scan methods (ES2D). (a.) Vmamba [ 24 ] employs SS2D method in vision tasks, traversing entire row or column axes, which incurs heavy computational resources. (b.) We present an efficient 2D scanning method, ES2D, which organizes patches by omitting sampling steps, and then proceeds with an intra-group traversal (with a skipping step of 2 in the Figure). The proposed scan approach reduces computational demands ( $4N→N4N\rightarrow N$ ) while preserving global feature maps ( e.g. Each group contains eye-related patches.)

【翻譯】圖2：高效2D掃描方法（ES2D）的圖示。(a.) Vmamba [24]在視覺任務中采用SS2D方法，遍歷整個行或列軸，這會產生大量計算資源開銷。(b.) 我們提出了一種高效的2D掃描方法ES2D，通過省略采樣步驟來組織補丁，然后進行組內遍歷（圖中跳躍步長為2）。所提出的掃描方法在保持全局特征圖的同時減少了計算需求（ $4N→N4N\rightarrow N$ ）（例如，每組都包含與眼部相關的補丁）。

In summary, the contributions of this paper are as follows.

We propose an atrous-based selective scanning strategy, which is realized through a novel skip sampling and regrouping patched in the spatial respective field. The strategy refines the building blocks to efficiently extract global dependencies while reducing computation complexity ( $O(N)→O(N/p2))\mathcal{O}(N)\to\mathcal{O}(N/p^{2}))$ with step $p$ ).
We introduce a dual-pathway module that combines our efficient scanning strategy for global feature capture and a convolution branch for efficient local feature extraction, along with a channel attention module to balance the integration of both global and local features. Besides, we propose a better allocation of SSM and CNN blocks by promoting SSMs in early stages with high resolutions for better global capture, while adopting CNNs in low resolutions for better efficiency.
We conduct extensive experiments on image classification, object detection, and semantic segmentation tasks. The results and illustration shown in Figure 1 demonstrate that, our EfficientVMamba effectively reduces the FLOPs of the models while achieving significant performance improvements compared to existing light-weight models.

【翻譯】總結而言，本文的貢獻如下：

我們提出了一種基于空洞的選擇性掃描策略，通過在空間感受野中新穎的跳躍采樣和重新分組補丁來實現。該策略優化了構建塊，以高效提取全局依賴關系，同時降低計算復雜度（從 $O(N)\mathcal{O}(N)$ 到 $O(N/p2)\mathcal{O}(N/p^{2})$ ，步長為 $p$ ）。
我們引入了一個雙路徑模塊，結合我們的高效掃描策略進行全局特征捕獲和卷積分支進行高效局部特征提取，以及一個通道注意力模塊來平衡全局和局部特征的集成。此外，我們提出了SSM和CNN塊的更好分配，通過在高分辨率的早期階段推廣SSMs以獲得更好的全局捕獲，而在低分辨率中采用CNNs以獲得更好的效率。
我們在圖像分類、目標檢測和語義分割任務上進行了廣泛的實驗。圖1中顯示的結果和說明表明，我們的EfficientVMamba有效降低了模型的FLOPs，同時與現有輕量級模型相比實現了顯著的性能改進。

【解析】EfficientVMamba三個貢獻點。第一個貢獻是算法層面的創新，空洞掃描策略的數學表達 $O(N)→O(N/p2)\mathcal{O}(N)\to\mathcal{O}(N/p^{2})$ 說明了計算復雜度的顯著降低——當跳躍步長為 $p$ 時，計算量會按 $p^2$ 的比例減少，這在大分辨率圖像處理中具有巨大價值。第二個貢獻是架構層面的創新，雙路徑設計考慮了不同計算模式特點的高效利用，而層次化的塊分配策略則改善了視覺特征處理過程。第三個貢獻驗證了理論設計的實際效果，跨多個視覺任務的實驗結果證明了其方法的通用性，也展示了在效率和性能雙重指標上的優越性。可以說，EfficientVMamba是輕量級視覺模型的優秀成果。

2 Related Work

2.1 輕量級視覺模型

In recent years, the realm of vision tasks has been predominantly governed by Convolutional Neural Networks (CNNs) and Visual Transformer (ViT) architectures. The focus on making these architectures lightweight to enhance efficiency has emerged as a pragmatic and promising direction in research. For CNNs, notable advancements have been made in improving image classification accuracy, as evidenced by the development of influential architectures like ResNet [ 14 ], RegNet [ 35 ], and DenseNet [ 18 ]. These advancements have set new benchmarks in accuracy but also introduced a need for lightweight architectures [ 51 , 52 ]. This need has been addressed through various factorization-based methods, making CNNs more mobile-friendly. For instance, separable convolutions introduced by Xception have been instrumental in this regard, leading to the development of state-of-the-art lightweight CNNs, such as MobileNets [ 15 ], ShuffleNetv2 [ 27 ], ESPNetv2 [ 29 ], MixConv [ 42 ], MNASNet [ 40 ], and GhostNets [ 13 ]. These models are not only versatile but also relatively simpler to train. Following CNNs, Transformers have gained significant traction in various vision tasks, such as image classification, object detection, and autonomous driving, rapidly becoming the mainstream approach. The lightweight versions of Transformers have been achieved through diverse methods. On the training front, sophisticated data augmentation strategies and techniques like Mixup [ 60 ], CutMix [ 58 ], and RandAugment [ 6 ] have been employed, as seen in models like CaiT [ 45 ] and DeiTIII [ 44 ], which demonstrate exceptional performance without the need for large proprietary datasets. From the architectural design perspective, efforts have been concentrated on optimizing self-attention input resolution and devising attention mechanisms that incur lower computational costs. Innovations like PVT-v1 [ 49 ]'s emulation of CNN’s feature map pyramid, Swin-T [ 25 ] and LightViT [ 17 ]'s hierarchical feature map and shifted-window mechanisms, and the introduction of (multi-scale) deformable attention modules in Deformable DETR [ 62 ] exemplify these advancements. There is also NAS for ViTs [ 37 ].

【翻譯】近年來，視覺任務領域主要由卷積神經網絡（CNNs）和視覺Transformer（ViT）架構主導。專注于使這些架構輕量化以提高效率已成為研究中一個實用且有前景的方向。對于CNNs，在提高圖像分類準確性方面取得了顯著進步，這體現在ResNet [14]、RegNet [35]和DenseNet [18]等有影響力架構的發展上。這些進步在準確性方面設立了新的基準，但也引入了對輕量級架構的需求[51, 52]。這種需求已通過各種基于分解的方法得到解決，使CNNs更適合移動設備。例如，Xception引入的可分離卷積在這方面發揮了重要作用，導致了最先進的輕量級CNNs的發展，如MobileNets [15]、ShuffleNetv2 [27]、ESPNetv2 [29]、MixConv [42]、MNASNet [40]和GhostNets [13]。這些模型不僅多功能，而且訓練相對簡單。繼CNNs之后，Transformers在各種視覺任務中獲得了顯著關注，如圖像分類、目標檢測和自動駕駛，迅速成為主流方法。Transformers的輕量級版本通過多種方法實現。在訓練方面，采用了復雜的數據增強策略和技術，如Mixup [60]、CutMix [58]和RandAugment [6]，在CaiT [45]和DeiTIII [44]等模型中可以看到，這些模型在不需要大型專有數據集的情況下展現出卓越性能。從架構設計角度來看，努力集中在優化自注意力輸入分辨率和設計產生較低計算成本的注意力機制上。PVT-v1 [49]模擬CNN特征圖金字塔、Swin-T [25]和LightViT [17]的層次化特征圖和移位窗口機制，以及在Deformable DETR [62]中引入（多尺度）可變形注意力模塊等創新都體現了這些進步。還有針對ViTs的NAS [37]。

【解析】早期的ResNet、DenseNet等模型雖然在精度上取得突破，但參數量和計算量的增長使得實際部署變得困難。為解決這個問題，研究者們開發了多種分解技術，其中可分離卷積是重要的突破之一。可分離卷積將標準卷積分解為深度卷積和逐點卷積兩個步驟，大幅降低了參數量和計算量。基于這一思想，MobileNet系列、ShuffleNet系列等輕量級模型應運而生，它們在保持相當精度的同時將模型大小壓縮到了可以在移動設備上運行的程度。Transformer在視覺領域的輕量化則面臨著不同的挑戰。研究者從兩個維度來解決這個問題：訓練策略的優化和架構設計的改進。在訓練方面，通過更好的數據增強技術可以讓模型在較小的數據集上達到更好的性能，從而降低了對大型專有數據集的依賴。在架構方面，研究者們設計了各種巧妙的注意力機制變種，比如分層注意力、局部窗口注意力、可變形注意力等，這些設計在保持Transformer全局建模能力的同時顯著降低了計算復雜度。神經架構搜索（NAS）技術的引入進一步推動了這一發展，它能夠自動發現更優的網絡架構設計。

2.2 狀態空間模型

The State Space Model (SSM) [ 9 – 11 , 20 , 31 ] is a family architecture encapsulates a sequence-to-sequence transformation has the potential to handle tokens with long dependencies, but it is challenging to train due to its high computational and memory usage. Nevertheless, recent works $[7 ? 9, 11, 36]$ have enabled deep State Space Models to become progressively more competitive with CNN and Transformer. In particular, S4 [ 9 ] employs a Normal Plus Low-Rank (NPLR) representation to efficiently compute the convolution kernel by leveraging the Woodbury identity for matrix inversion. And then Mamba [ 7 ] enhances SSMs with input-specific parameterization and a scalable, hardware-optimized algorithm, achieving simpler design and superior efficiency in processing long sequences for language and genomics. Following success of SSM, there has been a surge in applying the framework to computer vision tasks. S4ND [ 30 ] first introduce the SSM blocks into vision tasks, facilitating the modeling of visual data across 1D, 2D, and 3D as continuous signals. Vmamba [ 24 ] pioneers a mamba-based vision backbone a cross-scan module to address the direction-sensitivity issue arising from the differences between 1D sequences and multi-channels images. Similarly, Vim [ 61 ] introduces an efficient state space model for vision tasks by leveraging bidirectional state space modeling for data-dependent global visual context without image-specific biases. The impressive performance of the Mamba backbone in various vision tasks has inspired a wave of research [ 2 , 22 , 33 , 33 , 48 ] focusing on adapting Mamba-based models for specialized vision applications. Recent works like Vm-unet [ 33 ], U-Mamba [ 22 ], and SegMamba [ 55 ] have adapted Mambabased backbones for medical image segmentation, integrating unique features such as a U-shaped architecture in Vm-unet, an encoder-decoder framework in U-Mamba, and whole volume feature modeling in SegMamba. In the domain of graph representation, GraphMamba [ 48 ] integrates Graph Guided Message Passing (GMB) with Message Passing Neural Networks (MPNN) within the Graph GPS architecture, which enhances the training and contextual filtration for graph embeddings. Furthermore, GMNs [ 2 ] present a comprehensive framework that encompasses tokenization, optional positional or structural encoding, localized encoding, sequencing of tokens, and utilizes a series of bidirectional Mamba layers for processing graphs.

【翻譯】狀態空間模型（SSM）[9-11, 20, 31]是一類架構族，封裝了序列到序列的變換，具有處理長依賴標記的潛力，但由于其高計算和內存使用量而難以訓練。然而，最近的工作[7-9,11,36]使深度狀態空間模型在與CNN和Transformer的競爭中變得越來越具有競爭力。特別是，S4 [9]采用正態加低秩（NPLR）表示，通過利用Woodbury恒等式進行矩陣求逆來高效計算卷積核。然后Mamba [7]通過輸入特定的參數化和可擴展的硬件優化算法增強了SSMs，在處理語言和基因組學的長序列方面實現了更簡單的設計和卓越的效率。隨著SSM的成功，將該框架應用于計算機視覺任務的研究激增。S4ND [30]首先將SSM塊引入視覺任務，促進了將1D、2D和3D視覺數據建模為連續信號。Vmamba [24]開創了基于mamba的視覺骨干網絡，采用交叉掃描模塊來解決1D序列和多通道圖像差異產生的方向敏感性問題。類似地，Vim [61]通過利用雙向狀態空間建模為視覺任務引入了高效的狀態空間模型，實現了數據依賴的全局視覺上下文而無圖像特定偏差。Mamba骨干網絡在各種視覺任務中的卓越性能激發了一波研究浪潮[2, 22, 33, 33, 48]，專注于將基于Mamba的模型適配到專門的視覺應用中。最近的工作如Vm-unet [33]、U-Mamba [22]和SegMamba [55]已將基于Mamba的骨干網絡適配用于醫學圖像分割，集成了獨特的特征，如Vm-unet中的U形架構、U-Mamba中的編碼器-解碼器框架，以及SegMamba中的全體積特征建模。在圖表示領域，GraphMamba [48]在Graph GPS架構中集成了圖引導消息傳遞（GMB）和消息傳遞神經網絡（MPNN），這增強了圖嵌入的訓練和上下文過濾。此外，GMNs [2]提出了一個綜合框架，包括標記化、可選的位置或結構編碼、局部編碼、標記排序，并利用一系列雙向Mamba層來處理圖。

【解析】狀態空間模型本質上是一種特殊的序列建模架構，它最初來源于控制理論中的狀態空間概念。在深度學習中，SSM的核心思想是通過隱藏狀態來建模序列數據的長期依賴關系。與RNN和LSTM相比，SSM具有更強的數學理論基礎和更好的長序列處理能力，但傳統SSM的訓練確實存在計算復雜度高的問題。S4模型的突破在于引入了NPLR（正態加低秩）分解技術，這種技術巧妙地利用了矩陣的結構特性，通過Woodbury矩陣求逆恒等式將原本復雜的矩陣運算轉化為更高效的形式。Mamba進一步優化了這個過程，不僅在算法層面進行了改進，還考慮了硬件實現的效率，特別是針對現代GPU的并行計算特性進行了專門的優化。當SSM開始應用到計算機視覺領域時，面臨的主要挑戰是如何將原本設計用于一維序列的模型適配到二維圖像數據。S4ND通過將圖像視為連續信號的多維擴展解決了這個問題，而Vmamba則提出了更為精妙的交叉掃描機制，通過多個方向的掃描來確保圖像的二維結構信息不會丟失。

3 預備知識

3.1 State Space Models (S4)

State Space Models (SSMs) are a general family of sequence model used in deep learning that are influenced by systems capable of mapping one-dimensional sequences in a continuous manner. These models transform input $D$ -dimensional sequence $x(t)∈RL×Dx(t)\in\mathbb{R}^{L\times D}$ into output sequence $y(t)∈RL×Dy(t)\in\mathbb{R}^{L\times D}$ by utilizing a learnable latent state $h(t)∈RN×Dh(t)\in\mathbb{R}^{N\times D}$ that is not directly observable. The mapping process could be denoted as:

【翻譯】狀態空間模型（SSMs）是深度學習中使用的序列模型的一個通用族群，受能夠以連續方式映射一維序列的系統影響。這些模型通過利用不可直接觀察的可學習潛在狀態 $h(t)∈RN×Dh(t)\in\mathbb{R}^{N\times D}$ ，將輸入的 $D$ 維序列 $x(t)∈RL×Dx(t)\in\mathbb{R}^{L\times D}$ 轉換為輸出序列 $y(t)∈RL×Dy(t)\in\mathbb{R}^{L\times D}$ 。映射過程可以表示為：

【解析】在傳統的序列建模中，模型需要直接處理輸入序列到輸出序列的映射，這在處理長序列時會遇到梯度消失和計算復雜度高的問題。SSM引入了一個中間的"隱藏狀態"概念，這個隱藏狀態就像是系統的內部記憶，它能夠捕獲和保持序列中的重要信息。這個隱藏狀態 $h (t)$ 的維度是 $N×DN\times D$ ，其中 $N$ 是狀態維度， $D$ 是特征維度。通過這種設計，模型不再需要直接處理復雜的長距離依賴關系，而是通過狀態的連續演化來間接實現這種建模能力。這種方法的優勢在于它能夠以線性的計算復雜度處理任意長度的序列，這是相對于Transformer的二次復雜度的巨大優勢。

$h′(t)=Ah(t)+Bx(t),y(t)=Ch(t),\begin{array}{r}{h^{\prime}(t)=A h(t)+B x(t),}\ {y(t)=C h(t),\qquad}\end{array}$

where $A∈RN×N\pmb{A}\in\mathbb{R}^{N\times N}$ , $B∈RD×N\pmb{{B}}\in\mathbb{R}^{D\times N}$ and $C∈RD×N\pmb{C}\in\mathbb{R}^{D\times N}$ .

【翻譯】其中 $A∈RN×N\pmb{A}\in\mathbb{R}^{N\times N}$ ， $B∈RD×N\pmb{{B}}\in\mathbb{R}^{D\times N}$ 和 $C∈RD×N\pmb{C}\in\mathbb{R}^{D\times N}$ 。

【解析】這三個矩陣是SSM的核心參數，它們定義了系統的動態行為。矩陣 $A$ 是狀態轉移矩陣，它控制著隱藏狀態如何隨時間演化，可以理解為系統的"記憶衰減"機制。矩陣 $B$ 是輸入矩陣，它決定了當前輸入如何影響隱藏狀態的更新。矩陣 $C$ 是輸出矩陣，它控制如何從隱藏狀態中提取有用信息來生成最終輸出。這種設計將復雜的序列建模問題分解為三個相對獨立的子問題：狀態演化、輸入處理和輸出生成。通過學習這三個矩陣的參數，模型能夠自適應地發現序列數據中的模式和規律。

Discretization. Discretization aims to convert the continuous differential equations into discrete functions, aligning the model to the input signal’s sampling frequency for more efficient computation [ 10 ]. Following the work [ 11 ], the continuous parameters $(A, B)$ can be discretized by zero-order hold (ZOH) method to be the discrete parameters $(Aˉ,Bˉ)(\bar{A},\bar{B})$ with a time step $Δ\Delta$ :

【翻譯】離散化。離散化旨在將連續微分方程轉換為離散函數，使模型與輸入信號的采樣頻率對齊以實現更高效的計算[10]。根據工作[11]，連續參數 $(A, B)$ 可以通過零階保持（ZOH）方法離散化為具有時間步長 $Δ\Delta$ 的離散參數 $(Aˉ,Bˉ)(\bar{A},\bar{B})$ ：

【解析】在實際的深度學習應用中，我們處理的都是離散的數據序列，而不是連續的信號。因此需要將連續時間的狀態空間模型轉換為離散時間版本。零階保持方法是信號處理中的一種標準技術，它假設在每個采樣間隔內，輸入信號保持常數值。這種方法的核心思想是在時間步長 $Δ\Delta$ 內，系統的輸入保持不變，從而可以通過積分得到精確的離散化形式。這種離散化不僅保持了原始連續系統的數學性質，還使得模型能夠直接處理數字化的序列數據，同時保證了數值計算的穩定性和效率。

$Aˉ=exp?(ΔA),Bˉ=(ΔA)?1(exp?(ΔA)?I)ΔB.\bar{A}=\exp(\Delta A), \quad \bar{B}=(\Delta A)^{-1}(\exp(\Delta A)-I) \Delta B .$

where $Aˉ∈RN×N\bar{\pmb{A}}\in\mathbb{R}^{N\times N}$ , $Bˉ∈RD×N\bar{B}\in\mathbb{R}^{D\times N}$ and $Cˉ∈RD×N\bar{C}\in\mathbb{R}^{D\times N}$ .

【翻譯】其中 $Aˉ∈RN×N\bar{\pmb{A}}\in\mathbb{R}^{N\times N}$ ， $Bˉ∈RD×N\bar{B}\in\mathbb{R}^{D\times N}$ 和 $Cˉ∈RD×N\bar{C}\in\mathbb{R}^{D\times N}$ 。

【解析】這個公式保證了即使在離散化后，系統仍然保持其原有的穩定性和收斂性質。矩陣 $Aˉ\bar{A}$ 通過矩陣指數來計算，這確保了狀態轉移的平滑性。而 $Bˉ\bar{B}$ 的計算涉及到矩陣逆和矩陣指數的組合，這個公式實際上是對連續時間內輸入對狀態影響的精確積分結果。

To simplify calculations, the repeated application of Equation 2 can be efficiently performed simultaneously using a global convolution approach.

【翻譯】為了簡化計算，方程2的重復應用可以使用全局卷積方法同時高效地執行。

【解析】這里提到的全局卷積方法是SSM計算效率的關鍵突破。傳統的遞歸計算需要逐步計算每個時間步的狀態，這種順序計算無法充分利用現代GPU的并行計算能力。通過將遞歸形式轉換為卷積形式，我們可以利用高度優化的卷積算子來并行處理整個序列。這種轉換不僅大幅提升了計算速度，還使得SSM能夠享受到深度學習框架中針對卷積操作的各種優化技。

$y=x?K￣withK￣=(CB￣,CA￣B￣,...,CA￣L?1B￣),\begin{array}{c}{{y=x\circledast\overline{{K}}}}\\ {{\mathrm{with}\overline{{K}}=(C\overline{{B}},C\overline{{A}}\overline{{B}},...,C\overline{{A}}^{L-1}\overline{{B}}),}}\end{array}$

$?\circledast$ denotes convolution operation, and $K￣∈RL\overline{{\pmb K}}\in\mathbb{R}^{L}$ is the SSM kernel.

【翻譯】 $?\circledast$ 表示卷積操作， $K￣∈RL\overline{{\pmb K}}\in\mathbb{R}^{L}$ 是SSM核。

【解析】SSM核 $K￣\overline{K}$ 是整個狀態空間模型的核心，它將復雜的遞歸計算壓縮成了一個簡單的卷積核。這個核的每個元素都對應著不同時間步長的影響權重，從 $CB￣C\overline{B}$ （當前時間步）到 $CA￣L?1B￣C\overline{A}^{L-1}\overline{B}$ （最遠的歷史時間步）。通過這種方式，模型能夠在一次卷積操作中同時考慮所有歷史信息對當前輸出的影響。這種設計的好處在于它將時間維度上的復雜依賴關系轉化為空間維度上的卷積操作，從而實現了計算效率和建模能力的完美平衡。這也解釋了為什么SSM能夠以線性復雜度處理長序列——因為卷積操作的復雜度是線性的。

3.2 Selective State Space Models (S6)

Mamba [ 7 ] improves the performance of SSM by introducing Selective State Space Models (S6), allowing the continuous parameters to vary with the input enhances selective information processing across sequences, which extend the discretization process by selection mechanism:

【翻譯】Mamba [7] 通過引入選擇性狀態空間模型（S6）來改善SSM的性能，允許連續參數隨輸入變化，增強了跨序列的選擇性信息處理，通過選擇機制擴展了離散化過程：

【解析】這里介紹了Mamba模型的核心創新點。傳統的SSM使用固定的參數矩陣來處理所有輸入，就像用一把萬能鑰匙去開所有的鎖。但Mamba意識到這種"一刀切"的方法并不高效，因為不同的輸入內容應該有不同的處理方式。比如在處理一張圖片時，如果圖片中有重要的物體，我們希望模型能夠重點關注這些區域；如果某些區域是背景噪聲，我們希望模型能夠適當忽略。S6的核心思想就是讓模型的參數能夠根據當前的輸入內容動態調整，實現"因材施教"的效果。這種選擇性機制使得模型能夠更智能地決定哪些信息需要重點處理，哪些可以輕度處理，從而大幅提升了模型的表達能力和計算效率。

$Bˉ=sB(x),Cˉ=sC(x),Δ=τA(Parameter+sA(x)),\begin{array}{l}{{\bar{B}=s_{B}(x),}}\\ {{\bar{C}=s_{C}(x),}}\\ {{\varDelta=\tau_{A}(\mathrm{Parameter}+s_{A}(x)),}}\end{array}$

where $s_{B}(x)$ and $s_{C}(x)$ are linear functions that project input $x$ into an Ndimensional space, while $s_{A}(x)$ broadens a $D$ -dimensional linear projection to the necessary dimensions. In terms of visual tasks, VMamba proposed the 2D Selective Scan (SS2D) [ 24 ], which maintains the integrity of 2D image structures by scanning four directed feature sequences. Each sequence is processed independently within an S6 block and then being combined to form a comprehensive 2D feature map.

【翻譯】其中 $s_{B}(x)$ 和 $s_{C}(x)$ 是將輸入 $x$ 投影到N維空間的線性函數，而 $s_{A}(x)$ 將D維線性投影擴展到必要的維度。在視覺任務方面，VMamba提出了二維選擇性掃描（SS2D）[24]，通過掃描四個方向的特征序列來保持2D圖像結構的完整性。每個序列在S6塊內獨立處理，然后組合形成綜合的2D特征圖。

【解析】這組公式展示了選擇性機制的具體實現方式。與傳統SSM中固定的參數矩陣不同，這里的 $Bˉ\bar{B}$ 、 $Cˉ\bar{C}$ 和 $Δ\Delta$ 都變成了輸入 $x$ 的函數，說明這些關鍵參數現在能夠根據輸入內容進行自適應調整。函數 $s_B(x)$ 和 $s_C(x)$ 負責處理輸入到狀態和狀態到輸出的映射關系，它們通過線性變換將輸入特征投影到合適的維度空間。而 $s_A(x)$ 則控制時間步長的選擇，這個參數特別重要，因為它決定了模型在處理序列時的"采樣密度"。當遇到重要信息時，模型可以選擇更小的時間步長來精細處理；當遇到不那么重要的信息時，可以選擇更大的時間步長來快速跳過。對于視覺任務，VMamba將這種一維的選擇性掃描擴展到二維空間，通過四個方向（通常是從左到右、從右到左、從上到下、從下到上）的掃描來全面捕獲圖像的空間信息。這種設計保證了模型既能處理局部細節，又能建立全局的空間關聯，最終通過融合四個方向的信息來構建完整的特征表示。

4 Method

In order to design light-weight models that are friendly to resource-limited devices, we propose EfficientVMamaba, which is summarized in Figure 3 . We introduce an efficient selective scan approach to reduce the computational complexity in Section 4.1 , and build a block considering both global and local feature extraction with the integration of SSMs and CNNs in Section 4.2 . Regarding the design of architecture, Section 4.4 then offers an in-depth look at various architectural variations tailored to different model sizes.

【翻譯】為了設計對資源受限設備友好的輕量化模型，我們提出了EfficientVMamba，如圖3所示。我們在4.1節中引入了一種高效的選擇性掃描方法來降低計算復雜度，并在4.2節中構建了一個同時考慮全局和局部特征提取的塊，該塊集成了SSM和CNN。關于架構設計，4.4節深入探討了針對不同模型大小量身定制的各種架構變體。

作者提出的解決方案包含三個關鍵層面：首先是算法層面的創新，通過改進選擇性掃描機制來降低計算復雜度；其次是架構層面的融合，將SSM的全局建模能力與CNN的局部特征提取優勢相結合，實現優勢互補；最后是系統層面的優化，針對不同的應用場景和硬件約束提供多種模型規格。

4.1 Efficient 2D Scanning (ES2D)

In deep neural networks, downsampling via pooling or strided convolution is employed to broaden the receptive field with a lower computational cost; however, this comes at the expense of spatial resolution. Previous work [ 46 , 57 ] demonstrate apply atrous-based strategy benefits broadening the receptive field without sacrificing resolution. Inspired by this observation and aiming to alleviate and light the computational complexity of selective scanning, we propose an efficient 2D scanning (ES2D) method to scale down the visual selective scan block (SS2D) via skipping sampling for each patches on the feature map. Given a input feature map $X∈RC×H×W\pmb{X}\in\mathbb{R}^{C\times H\times W}$ , instead of cross-scan whole patches, we skip scan patches with a step size $p$ and partition into selected spatial dimensional features ${O_{i}\}_{i=1}^{4}$ :

【翻譯】在深度神經網絡中，通過池化或帶步長的卷積進行下采樣可以以較低的計算成本擴大感受野；然而，這是以犧牲空間分辨率為代價的。先前的工作[46, 57]證明了應用基于空洞（atrous）的策略有利于在不犧牲分辨率的情況下擴大感受野。受此觀察啟發，并旨在減輕和簡化選擇性掃描的計算復雜度，我們提出了一種高效的2D掃描（ES2D）方法，通過對特征圖中每個補丁進行跳躍采樣來縮減視覺選擇性掃描塊（SS2D）。給定輸入特征圖 $X∈RC×H×W\pmb{X}\in\mathbb{R}^{C\times H\times W}$ ，我們不是交叉掃描整個補丁，而是以步長 $p$ 跳躍掃描補丁并分割成選定的空間維度特征 ${O_{i}\}_{i=1}^{4}$ ：

【解析】這段話介紹了ES2D方法的思路。傳統的下采樣方法雖然能夠減少計算量，但會損失圖像的細節信息，這對于需要精確空間信息的視覺任務來說是不可接受的。空洞卷積技術提供了一個解決方案，它通過在卷積核中插入空洞來擴大感受野，同時保持輸出分辨率不變。ES2D方法借鑒了這種思想，但應用在狀態空間模型的掃描機制上。與傳統SSM需要逐個處理所有空間位置不同，ES2D采用跳躍采樣策略，按照固定步長 $p$ 選擇性地處理特征圖中的部分位置。這種策略的巧妙之處在于它不是隨機丟棄信息，而是有規律地采樣，確保在減少計算量的同時盡可能保留重要的空間結構信息。通過將完整的特征圖分解為四個不同方向的子特征圖，ES2D能夠在降低計算復雜度的同時維持對空間信息的全面捕獲。

$Oi←scanX[:,m::p,n::p],O_{i} \xleftarrow{\text{scan}} X[:, m::p, n::p],$

${O~i}i=14←SS2D?({Oi}i=14),\{\tilde{\mathbf{O}}_{i}\}_{i=1}^{4} \leftarrow \operatorname{SS2D}(\{\mathbf{O}_{i}\}_{i=1}^{4}),$

$\xleftarrow{\text{merge}} \tilde{O}_{i},$

$\left(\left\lfloor\frac{1}{2} + \frac{1}{2}\sin\left(\frac{\pi}{2}(i-2)\right)\right\rfloor, \left\lfloor\frac{1}{2} + \frac{1}{2}\cos\left(\frac{\pi}{2}(i-2)\right)\right\rfloor\right),$

where $Oi,O~i∈RC×Hp×Wp\boldsymbol{O}_{i},\tilde{\boldsymbol{O}}_{i}\in\mathbb{R}^{C\times\frac{H}{p}\times\frac{W}{p}}$ ∈ and the operation $[:, m :: p, n :: p]$ represents slicing the matrix for each channel, starting at $m$ on height $(H)$ and $n$ on width $(W)$ , skipping every $p$ steps. The process decompose the fully scanning method into both local and global sparse forms. Skip sampling for local receptive fields reduces computational complexity by selectively scanning smaller patches of the feature map. With a step size $p$ , we sample the $(C, H / p, W / p)$ patches at intervals of $p$ , compared to $(C, H, W)$ in the SS2D, decreasing the number of tokens processed from $N$ to $Np2\textstyle{\frac{N}{p^{2}}}$ for each scan and merge operation, which improves feature extraction efficiency. Re-grouping for global spatial feature maps in ES2D involves combining the processed patches to reconstruct the global structure of the feature map. This integration captures broader contextual information, balancing local detail and global context in feature extraction. Accordingly, our design is intended to streamline the scanning and merging modules while maintaining the essential benefit of global integration in the state-space architecture, with the aim of ensuring that the feature extraction remains comprehensive on the spatial axis.

【翻譯】其中 $Oi,O~i∈RC×Hp×Wp\boldsymbol{O}_{i},\tilde{\boldsymbol{O}}_{i}\in\mathbb{R}^{C\times\frac{H}{p}\times\frac{W}{p}}$ ，操作 $[:, m :: p, n :: p]$ 表示對每個通道的矩陣進行切片，從高度 $(H)$ 上的 $m$ 和寬度 $(W)$ 上的 $n$ 開始，每 $p$ 步跳躍一次。該過程將完全掃描方法分解為局部和全局稀疏形式。局部感受野的跳躍采樣通過選擇性掃描特征圖的較小補丁來降低計算復雜度。使用步長 $p$ ，我們以 $p$ 的間隔采樣 $(C, H / p, W / p)$ 補丁，與SS2D中的 $(C, H, W)$ 相比，將每次掃描和合并操作處理的標記數量從 $N$ 減少到 $Np2\frac{N}{p^{2}}$ ，這提高了特征提取效率。ES2D中全局空間特征圖的重新分組涉及組合處理過的補丁以重建特征圖的全局結構。這種集成捕獲更廣泛的上下文信息，在特征提取中平衡局部細節和全局上下文。因此，我們的設計旨在簡化掃描和合并模塊，同時保持狀態空間架構中全局集成的基本優勢，目標是確保特征提取在空間軸上保持全面性。

【解析】這組公式展示了ES2D方法的具體實現細節。第一個公式說明如何從原始特征圖中按照特定模式提取子區域，這里 $(m, n)$ 的計算方式確保了四個不同的起始位置，分別對應四個掃描方向。通過三角函數的巧妙運用，當 $i$ 從1到4變化時， $(m, n)$ 會生成 $(0, 1)$ 、 $(1, 1)$ 、 $(1, 0)$ 、 $(0, 0)$ 這四個起始點，確保了對整個特征圖的均勻覆蓋。第二個公式將提取的子區域輸入到標準的SS2D模塊中進行處理，這保證了狀態空間模型的核心功能得以保留。第三個公式則將處理后的結果重新組裝回原始的空間布局。計算復雜度的大幅降低：原本需要處理 $\times W$ 個位置，現在只需要處理 $Np2\frac{N}{p^2}$ 個位置，當 $p = 2$ 時，計算量就降低到原來的四分之一。但這種降低不是簡單的信息丟失，而是通過智能的采樣和重組策略，確保重要的空間關系得以保留。重組過程通過將四個方向的處理結果融合，能夠重建出接近完整分辨率的特征表示，從而在效率和性能之間找到了良好的平衡點。

4.2 Efficient Visual State Space Block (EVSS)

Based on the efficient selected scan approach, we introduce the Efficient Visual State Space (EVSS) block, which is designed to synergistically merge global and local feature representations while maintaining computational efficiency. It leverages a SqueezeEdit-modified ES2D for global information capture and a convolutional branch tailored to extract critical local features, with both branches undergoing a subsequent Squeeze-Excitation (SE) block [ 16 ]. The ES2D module aims to efficiently abstract global contextual information by implementing an intelligent skipping mechanism presented in 4.1 . It selectively scans the map with a step size $p$ , reducing redundancy without sacrificing the representational quality of the global context in the resultant spatial dimensional features. Parallel to this, empirical evidence concurs that convolutional operations offer a more proficient approach to feature extraction, particularly in scenarios where local representations are adequate. We add the convolutional branch concentrates on discerning fine-grained local details through a $3×33\times3$ convolution of stride 1. The subsequent SE block adaptively recalibrates the features, allowing the network to auto re-balanced the local and global respective field on the feature map.

【翻譯】基于高效選擇性掃描方法，我們引入了高效視覺狀態空間（EVSS）塊，該塊旨在協同融合全局和局部特征表示，同時保持計算效率。它利用經過SqueezeEdit修改的ES2D進行全局信息捕獲，以及一個專門用于提取關鍵局部特征的卷積分支，兩個分支都經過后續的壓縮激勵（SE）塊[16]。ES2D模塊旨在通過實施4.1節中提出的智能跳躍機制來高效地抽象全局上下文信息。它以步長 $p$ 選擇性地掃描特征圖，在不犧牲結果空間維度特征中全局上下文表示質量的情況下減少冗余。與此同時，經驗證據表明卷積操作在特征提取方面提供了更熟練的方法，特別是在局部表示足夠的場景中。我們添加的卷積分支專注于通過步長為1的 $3×33\times3$ 卷積來識別細粒度的局部細節。后續的SE塊自適應地重新校準特征，允許網絡自動重新平衡特征圖上的局部和全局相應字段。

【解析】EVSS塊是作者提出的核心創新模塊，傳統的網絡架構往往偏向于某一種特征提取方式，要么專注于局部細節（如CNN），要么擅長全局建模（如Transformer），很難同時兼顧兩者的優勢。EVSS塊通過雙分支并行設計巧妙地解決了這個問題。第一個分支使用改進的ES2D模塊，它繼承了狀態空間模型強大的全局序列建模能力，能夠捕獲圖像中遠距離像素之間的依賴關系，這對于理解圖像的整體結構和語義至關重要。第二個分支采用傳統的 $3×33\times3$ 卷積操作，專門負責提取局部紋理、邊緣等細節特征，這些特征對于圖像的精確識別不可或缺。更重要的是，每個分支后面都配備了SE注意力機制，能夠根據輸入內容的特點動態調整全局和局部特征的重要性權重。當圖像內容需要更多全局理解時（比如場景識別），SE模塊會增強全局分支的輸出；當需要關注細節時（比如紋理分析），則會強化局部分支的貢獻。這種自適應的特征融合機制使得EVSS塊能夠根據不同的視覺任務和輸入內容自動調整其行為模式，實現真正的智能化特征提取。

Fig. 3: Architecture overview of EfficientVMamba. We hightlight our contributions with corresponding colors in the Figure. (1) ES2D 4.1 : Atrous-based selective scanning strategy via skip sampling and regrouping in the spatial space. (2) EVSS 4.2 : The EVSS block merges global and local feature extraction with modified ES2D and convolutional approaches enhanced by Squeeze-Excitation blocks for refined dual-pathway feature representation. Inverted Fusion 4.3 : Inverted Fusion places local-representation modules in deep layers, deviating from traditional designs by utilizing EVSS blocks early for global representation and inverted residual blocks later for local feature extraction.

【翻譯】圖3：EfficientVMamba的架構概覽。我們在圖中用相應的顏色突出顯示了我們的貢獻。（1）ES2D 4.1：通過在空間中進行跳躍采樣和重新分組的基于空洞的選擇性掃描策略。（2）EVSS 4.2：EVSS塊將全局和局部特征提取與修改的ES2D和卷積方法相結合，通過壓縮激勵塊增強，以實現精細的雙路徑特征表示。倒置融合4.3：倒置融合將局部表示模塊放置在深層，通過在早期利用EVSS塊進行全局表示和在后期使用倒置殘差塊進行局部特征提取，偏離了傳統設計。

【解析】這個架構圖展示了EfficientVMamba的整體設計思路和三個主要創新點。從圖中可以看出，作者采用了一種與傳統輕量級網絡截然不同的設計策略。傳統的輕量級網絡通常在前面幾層使用計算效率高的卷積操作來快速降低特征圖尺寸，然后在后面幾層引入全局建模模塊。但EfficientVMamba反其道而行之，在網絡的前期就引入了具有全局建模能力的EVSS塊，這樣做的好處是能夠在高分辨率特征圖上就開始建立全局的空間關聯，為后續的特征處理奠定良好的基礎。而在網絡的后期，當特征圖尺寸已經較小時，使用計算高效的倒置殘差塊來進行局部特征的精細化處理。這種"倒置"的設計理念充分利用了狀態空間模型計算復雜度為線性的優勢，使得在高分辨率下進行全局建模變得可行。同時，ES2D的跳躍采樣策略進一步降低了計算成本，使得這種設計在實際應用中具備了可行性。

The outputs of the respective SE blocks are combined via element-wise summation to construct the EVSS’s output and the dual pathway could be denoted as:

【翻譯】各自SE塊的輸出通過逐元素求和組合來構建EVSS的輸出，雙路徑可以表示為：

【解析】這里描述的是EVSS塊中兩個并行分支的融合機制。逐元素求和是最簡單也是最有效的特征融合方式之一，不需要額外的參數，計算成本極低，但能夠有效地整合來自不同路徑的信息，保持特征的維度和空間結構不變，同時允許兩個分支的特征在每個空間位置上進行直接的信息交換和增強。相比于其他融合方式如拼接或復雜的注意力機制，逐元素求和既保證了計算效率，又避免了參數量的增加，有利于輕量級網絡的設計。

$Xl+1=SE(ES2D(Xl))+SE(Conv(Xl)),\begin{array}{r}{{\cal X}^{l+1}=\mathrm{SE}(\mathrm{ES2D}({\cal X}^{l}))+\mathrm{SE}(\mathrm{Conv}({\cal X}^{l})),}\end{array}$

where $X^{l}$ represent the feature map of the $l\it l$ -layer and $SE?(?)\operatorname{SE}(\cdot)$ is the SqueezeExcitation operation. With each pathway utilizing a SE block, the EVSS ensures that the respective features of global and local information are dynamically rebalanced to emphasize the most salient features. This fusion aims to preserve the integrity of both the expansive global perspective and the intricate local specifics, facilitating a comprehensive feature representation.

【翻譯】其中 $X^{l}$ 表示第 $l$ 層的特征圖， $SE?(?)\operatorname{SE}(\cdot)$ 是壓縮激勵操作。通過每個路徑都使用SE塊，EVSS確保全局和局部信息的相應特征被動態重新平衡，以強調最顯著的特征。這種融合旨在保持廣闊的全局視角和復雜的局部細節的完整性，促進全面的特征表示。

【解析】ES2D分支負責捕獲全局上下文信息，它通過跳躍采樣策略在保持計算效率的同時獲得長距離的空間依賴關系。Conv分支則專注于局部特征提取，使用標準的卷積操作來識別紋理、邊緣等局部模式。兩個分支分別通過SE模塊進行特征重要性的自適應調節，這種設計確保了網絡能夠根據輸入內容的特點來動態調整全局和局部特征的貢獻權重。最終的逐元素相加操作不僅實現了特征融合，更重要的是創建了一種互補的特征表示，其中全局信息為局部特征提供上下文指導，而局部細節為全局理解提供精確的基礎。這種雙向的信息增強機制使得EVSS塊能夠產生既具有全局一致性又富含局部細節的綜合特征表示，為后續的視覺任務提供了高質量的特征基礎。SE模塊的引入進一步增強了這種協同效應，通過通道級別的注意力機制來突出最重要的特征通道，抑制噪聲和冗余信息。

4.3 Inverted Insertion of EfficientNet Blocks（EfficientNet塊的倒置插入）

As a well-established consensus, the computational efficiency of convolutional operations is more efficient than that of the global-based block such as Transformer. Prior light-weight work efforts have predominantly employed computation-efficient convolutions in the former stages to scale down the token numbers to reduce computational complexity, subsequently integrating global-based blocks ( e.g. , Transthe latter stages. For example, MobileViT [ former with the computational complexity of 28 $O(N2)\mathcal{O}(N^{2})$ ] adopts pure MobileNetV2 blocks ) to capture global context in in the first two downsampling stages, while only integrating self-attention operations in the latter stages at low resolutions. EfficientFormer [ 19 ] introduces two types of base blocks, the convolution-based blocks with local pooling are used in the first three stages, and the transformer-like self-attention blocks are only leveraged in the last stage.

【翻譯】作為一個已確立的共識，卷積操作的計算效率比基于全局的塊（如Transformer）更高效。先前的輕量級工作主要在前期階段采用計算高效的卷積來縮減標記數量以減少計算復雜度，隨后在后期階段集成基于全局的塊（例如，具有 $O(N2)\mathcal{O}(N^{2})$ 計算復雜度的Transformer）來捕獲全局上下文。例如，MobileViT在前兩個下采樣階段采用純MobileNetV2塊，僅在低分辨率的后期階段集成自注意力操作。EfficientFormer[19]引入了兩種類型的基礎塊，基于卷積的塊與局部池化在前三個階段使用，而類似transformer的自注意力塊僅在最后階段使用。

【解析】卷積操作的計算復雜度相對較低，特別是在處理高分辨率特征圖時，而全局建模模塊（如Transformer的自注意力機制）的計算復雜度往往是平方級別的，在高分辨率下會產生巨大的計算開銷。因此，幾乎所有的輕量級網絡都采用了"前卷積后全局"的設計策略：在網絡的前幾層使用高效的卷積操作快速降低特征圖的空間分辨率，減少后續處理的數據量，然后在較低分辨率的特征圖上應用計算密集的全局建模模塊。MobileViT和EfficientFormer就是這種設計理念的典型代表，它們在網絡的早期階段大量使用MobileNet的深度可分離卷積或普通卷積來進行特征提取和尺寸壓縮，只在網絡的最后幾層引入自注意力機制來建立全局的特征關聯。這種設計在計算資源受限的環境下是合理的，但也限制了網絡在高分辨率下進行全局建模的能力。

However, the observation is contrast on the Mamba-based block. In the SSM framework, the computational complexity for global representation is $O(N)\mathcal O(N)$ , indicating that placing local representation modules at either the front or the back of the stage could be reasonable. Through empirical observation in Table 6 , we found positioning these local-representation modules towards the latter layers of the stage yields better results. This discovery significantly deviates from the design principles of previous CNN-based and Transformer-based lightweight models, thereby we call it inverted insertion. Consequently, our designed $L$ stages architecture is an inverted insertion of EfficientNet Blocks (MobileNetV2 blocks with SE modules), which utilizes EVSS blocks 4.2 in the former two stages to capture global-representation and Inverted Residual blocks InRes $(?)(\cdot)$ [ 34 ] in the subsequent stages to extract local feature maps:

【翻譯】然而，在基于Mamba的塊上觀察結果是相反的。在SSM框架中，全局表示的計算復雜度是 $O(N)\mathcal O(N)$ ，這表明將局部表示模塊放置在階段的前面或后面都可能是合理的。通過表6中的經驗觀察，我們發現將這些局部表示模塊定位在階段的后層會產生更好的結果。這一發現顯著偏離了先前基于CNN和基于Transformer的輕量級模型的設計原則，因此我們稱之為倒置插入。因此，我們設計的 $L$ 階段架構是EfficientNet塊（帶有SE模塊的MobileNetV2塊）的倒置插入，它在前兩個階段利用EVSS塊4.2來捕獲全局表示，在后續階段使用倒置殘差塊InRes $(?)(\cdot)$ [34]來提取局部特征圖：

【解析】狀態空間模型的最大優勢在于其線性的計算復雜度 $O(N)\mathcal O(N)$ ，線性復雜度說明即使在高分辨率的特征圖上進行全局建模，計算成本也是可以接受的，這就為網絡設計提供了全新的可能性。作者通過大量實驗發現，將全局建模模塊放在網絡前期、局部特征提取模塊放在網絡后期，能夠獲得更好的性能表現。這種"倒置"的設計，背后的原理在于：在網絡前期，特征圖分辨率較高，包含豐富的細節信息，此時進行全局建模能夠更好地建立像素間的長距離依賴關系，為后續的特征處理提供全局的上下文指導；而在網絡后期，特征圖已經經過多次抽象，空間尺寸較小，此時使用高效的卷積操作進行局部特征的精細化處理更為合適。

$Xl+1={EVSS(Xl)ifXl∈{stage1,stage2};InRes(Xl)otherwise,\begin{array}{r}{X^{l+1}=\left\{\begin{array}{l l}{\mathrm{EVSS}(X^{l})}&{\mathrm{if}\quad X^{l}\in\{\mathrm{stage1,stage2}\};}\\ {\mathrm{InRes}(X^{l})}&{\mathrm{otherwise,}}\end{array}\right.}\end{array}$

【解析】這個分段函數清晰地定義了EfficientVMamba的倒置插入策略。在網絡的前兩個階段（stage1和stage2），使用EVSS塊進行特征處理，這些階段對應于較高分辨率的特征圖，EVSS塊中的ES2D組件能夠高效地進行全局特征建模，建立長距離的空間依賴關系。在其余階段，使用倒置殘差塊（InRes）進行局部特征提取，這些階段的特征圖分辨率較低，使用計算高效的卷積操作更為合適。不同網絡深度處有著不同的特征特性：淺層特征富含空間細節信息，適合進行全局關聯建模；深層特征已經高度抽象，更適合進行局部模式的精細化識別。這種設計充分利用了狀態空間模型和卷積操作各自的優勢，在不同的網絡階段發揮最適合的特征處理能力。

where $X^{l}$ is the feature map in the $l\it l$ -layer. The inverted insertion design of using the shortcuts directly between the bottlenecks is considerably more memory efficient [ 34 ].

【翻譯】其中 $X^{l}$ 是第 $l$ 層的特征圖。使用瓶頸間直接快捷連接的倒置插入設計在內存方面相當高效[34]。

【解析】倒置殘差塊的核心設計理念是在低維瓶頸層之間建立快捷連接，而不是在高維的擴展層之間。這種設計的內存優勢：首先，快捷連接建立在維度較小的特征圖之間，需要保存的中間結果更少，顯著降低了內存占用；其次，這種設計允許在前向傳播過程中更早地釋放一些中間特征圖的內存，提高了內存的使用效率；最后，在反向傳播過程中，梯度可以更直接地傳播，減少了需要緩存的中間梯度信息。這種內存高效的設計對于資源受限的移動設備和邊緣計算場景尤為重要，結合EVSS塊的全局建模能力和倒置殘差塊的內存效率，EfficientVMamba實現了性能和效率的良好平衡。

4.4 模型變體（Model Variants）

To sufficiently demonstrate the effectiveness of our proposed model, we detail architectural variants rooted in plain structures as referenced in [ 61 ]. These variants are designated as EfficientVMamba-T, EfficientVMamba-S, and EfficientVMambaB, shown as Table 1 , corresponding to different scales of the model. EfficientVMambaT is the most lightweight with 6M parameters, followed by EfficientVMamba-S with 11M, and EfficientVMamba-B being the most complex with 33M. In terms of computational load, measured in FLOPs, the models exhibit a parallel increase with 0.8G for EfficientVMamba-T, 1.3G for EfficientVMamba-S, and 4.0G for EfficientVMamba-B, correlating directly with their complexity and feature size.

【翻譯】為了充分證明我們提出模型的有效性，我們詳細介紹了基于[61]中所引用的簡單結構的架構變體。這些變體被命名為EfficientVMamba-T、EfficientVMamba-S和EfficientVMamba-B，如表1所示，對應于模型的不同規模。EfficientVMamba-T是最輕量級的，具有6M參數，其次是EfficientVMamba-S的11M參數，而EfficientVMamba-B是最復雜的，具有33M參數。在以FLOPs衡量的計算負載方面，模型呈現平行增長，EfficientVMamba-T為0.8G，EfficientVMamba-S為1.3G，EfficientVMamba-B為4.0G，這與它們的復雜性和特征大小直接相關。

Table 1: Model variants of EfficientVMamba.

表1：EfficientVMamba的模型變體。

5 Experiments

To rigorously evaluate the performance of our diverse model variants, we demonstrate the results of image classification task in Section 5.1 , investigate object detection performance in Section 5.2 and explore the image semantic segmentation in Section 5.3 . In section 5.4 We further pursued ablation study to comprehensively examine the effects of atrous selective scanning , the impact of SSM-Conv fusion blocks, and the implications of incorporating convolution blocks at different stages of the models.

【翻譯】為了嚴格評估我們多樣化模型變體的性能，我們在第5.1節展示了圖像分類任務的結果，在第5.2節研究了目標檢測性能，在第5.3節探索了圖像語義分割。在第5.4節中，我們進一步進行了消融研究，以全面檢驗空洞選擇性掃描的效果、SSM-Conv融合塊的影響，以及在模型不同階段引入卷積塊的意義。

5.1 ImageNet圖像分類

Training strategies. Following previous works [ 24 , 25 , 43 , 61 ], we train our models for 300 epochs with a base batch size of 1024 and an AdamW optimizer, a cosine annealing learning rate schedule is adopted with initial value $10^{-3}$ and 20-epoch warmup. For training data augmentation, we use random cropping, AutoAugment [ 5 ] with policy rand-m9-mstd0.5 , and random erasing of pixels with a probability of 0.25 on each image, then a MixUp [ 60 ] strategy with ratio 0.2 is adopted in each batch. An exponential moving average on model is adopted with decay rate 0.9999 .

【翻譯】訓練策略。遵循先前的工作[24, 25, 43, 61]，我們使用基礎批次大小為1024的AdamW優化器訓練我們的模型300個epochs，采用余弦退火學習率調度，初始值為 $10^{-3}$ ，并進行20個epoch的預熱。對于訓練數據增強，我們使用隨機裁剪、帶有策略rand-m9-mstd0.5的AutoAugment[5]，以及在每張圖像上以0.25的概率隨機擦除像素，然后在每個批次中采用比例為0.2的MixUp[60]策略。采用衰減率為0.9999的模型指數移動平均。

Tiny Models ( $FLOPs(G)∈[0,1]FLOPs(G)\in[0,1]$ ) . In the pursuit of efficiency, the results of tiny models are shown in Table 2 . EfficientVMamba-T achieves state-of-art performance with a Top-1 accuracy of $76.5%76.5\%$ , rivalling its counterparts that demand higher computational costs. With a modest expenditure of only 0.8 GFLOPs, our model surpasses the PVTv2-B0 by a 6% margin in accuracy and outperforms the MobileViT-XS by 1.7%, all with less computational demand.

【翻譯】微型模型( $FLOPs(G)∈[0,1]FLOPs(G)\in[0,1]$ )。在追求效率的過程中，微型模型的結果如表2所示。EfficientVMamba-T以76.5%的Top-1準確率實現了最先進的性能，與需要更高計算成本的同類模型相匹敵。僅需0.8 GFLOPs的適度開銷，我們的模型在準確率上以6%的優勢超越PVTv2-B0，并以1.7%的優勢超越MobileViT-XS，所有這些都是在更低的計算需求下實現的。

Small Models ( $FLOPs(G)∈[1,2],FLOPs(G)\in[1,2],$ ). Our model, EfficientVMamba-S, ex- hibits a significant improvement in accuracy, achieving a Top-1 accuracy of $78.7%78.7\%$ . This represents a substantial increase over DeiT-Ti and MobileViTS, which achieve $72.2%72.2\%$ and $78.4%78.4\%$ respectively. Notably, EfficientVMamba-S maintains this high accuracy level with computational efficiency, requiring only 1 . 3 GFLOPs, which is on par with DeiT-Ti and lower than MobileViT-S’s 2.0 GFLOPs.

【翻譯】小型模型( $FLOPs(G)∈[1,2]FLOPs(G)\in[1,2]$ )。我們的模型EfficientVMamba-S在準確率上表現出顯著改進，實現了78.7%的Top-1準確率。這相對于分別達到72.2%和78.4%的DeiT-Ti和MobileViT-S來說是大幅提升。值得注意的是，EfficientVMamba-S在保持高準確率水平的同時具有計算效率，僅需要1.3 GFLOPs，這與DeiT-Ti相當，并且低于MobileViT-S的2.0 GFLOPs。

Table 2: Comparison of different backbones on ImageNet-1K classification.

【翻譯】表2：不同骨干網絡在ImageNet-1K分類上的比較。

Base Models ( $FLOPs(G)∈[4,5],FLOPs(G)\in[4,5],$ ). EfficientVMamba-B achieves an impressive Top-1 accuracy of $81.8%81.8\%$ , surpassing DeiT-S by 2% and Vim-S by $1.5%1.5\%$ , as indicated in the third group of the Table 2 . This base model demonstrates the feasibility of coupling a substantial parameter count of 33 M with a modest computational demand of 4.0 GFLOPs. In comparison, VMamba-T, with a similar parameter count of 22 M requires a higher 5.6 GFLOPs.

【翻譯】基礎模型( $FLOPs(G)∈[4,5]FLOPs(G)\in[4,5]$ )。EfficientVMamba-B實現了令人印象深刻的81.8%的Top-1準確率，如表2第三組所示，超越DeiT-S 2%，超越Vim-S 1.5%。這個基礎模型證明了將33M的大量參數數量與4.0 GFLOPs的適度計算需求相結合的可行性。相比之下，具有類似22M參數數量的VMamba-T需要更高的5.6 GFLOPs。

5.2 Object Detection

Training strategies. We evaluate the efficacy of our EfficientVMamba model for object detection tasks on the MSCOCO 2017 [ 21 ] dataset. Our evaluation framework relies on the mmdetection library [ 3 ]. For comparisons with lightweight backbones, we follow PvT [ 49 ] to use RetinaNet as the detector and adopt 1 $×\times$ training schedule. While for comparisons with larger backbones, our experiment follows the hyperparameter settings detailed in Swin [ 25 ] We use the AdamW optimization method to refine the weights of our pre-trained networks on ImageNet-1K for durations of 12 and 36 epochs. We apply drop path rates of $0.2%0.2\%$ across the board for EfficientVMamba-T/S/B variants. The learning rate begins at $1 e ? 5$ and is decreased tenfold at epochs 9 and 11. Multi-scale training and random flipping are implemented during training with a batch size of 16, adhering to standard procedures for evaluating object detection systems.

【翻譯】訓練策略。我們在MSCOCO 2017 [21]數據集上評估EfficientVMamba模型在目標檢測任務中的有效性。我們的評估框架依賴于mmdetection庫[3]。對于與輕量級骨干網絡的比較，我們遵循PvT [49]使用RetinaNet作為檢測器并采用1×訓練計劃。而對于與較大骨干網絡的比較，我們的實驗遵循Swin [25]中詳述的超參數設置。我們使用AdamW優化方法在ImageNet-1K上細化預訓練網絡的權重，訓練持續12和36個epoch。我們對EfficientVMamba-T/S/B變體全面應用0.2%的drop path率。學習率從1e-5開始，在第9和第11個epoch時降低十倍。在訓練過程中實施多尺度訓練和隨機翻轉，批次大小為16，遵循評估目標檢測系統的標準程序。

Table 3: COCO detection results on RetinaNet.

【翻譯】表3：RetinaNet上的COCO檢測結果。

Results. We summarize the results of RetinaNet detector in Table 3 . Remarkably, each variants competitively reducing the sizes while simultaneously exhibits a performance enhancement. The EfficientVMamba-T model stands out with 13M parameters and an AP of $37.5%37.5\%$ , slightly higher by 5 . 7% compared to the ResNet-18, which has $21.3 M$ parameters. The performance of EfficientVMamba-T also surpasses PVTv1-Tiny by 0 . 8% while matching it in terms of parameter count. EfficientVMamba-S, with only 19M parameters, achieves a commendable AP of $39.1%39.1\%$ , outstripping the larger ResNet50 model, which shows a lower AP of $36.3%36.3\%$ despite having 37.7M parameters. In the higher echelons, EfficientVMamba-B, which boasts 44M parameters, secures an AP of $42.8%42.8\%$ , signifying a significant lead over both ResNet101 and ResNeXt101-32x4d, highlighting the efficiency of our models even with a smaller parameter footprint. Notably, PVTv2-b0 with 13M parameters achieves an AP of $37.2%37.2\%$ , which EfficientVMamba-T closely follows, indicating competitive performance with a similar parameter budget. For the comparisons with other backbones on Mask R-CNN, see Appendix.

【翻譯】結果。我們在表3中總結了RetinaNet檢測器的結果。值得注意的是，每個變體在競爭性地減小尺寸的同時都表現出性能提升。EfficientVMamba-T模型以13M參數和37.5%的AP表現突出，相比具有21.3M參數的ResNet-18高出5.7%。EfficientVMamba-T的性能也超越PVTv1-Tiny 0.8%，同時在參數數量上與其匹配。EfficientVMamba-S僅用19M參數就實現了令人稱贊的39.1%的AP，超越了更大的ResNet50模型，后者盡管有37.7M參數但AP僅為36.3%。在更高層次上，擁有44M參數的EfficientVMamba-B獲得了42.8%的AP，相對于ResNet101和ResNeXt101-32x4d都有顯著領先，突出了我們模型即使在較小參數占用下的效率。值得注意的是，具有13M參數的PVTv2-b0實現了37.2%的AP，EfficientVMamba-T緊隨其后，表明在相似參數預算下具有競爭性能。關于與其他骨干網絡在Mask R-CNN上的比較，請參見附錄。

5.3 Semantic Segmentation

Training strategies. Aligning with Vmamba [ 24 ] settings, we integrate an UperHead into the pre-trained model structure. Utilizing the AdamW optimizer, we initiate the learning rate at $6×10?56\times10^{-5}$ . The fine-tuning stage consists of $160 k$ iterations, using a batch size of 16. While the standard input resolution stands at $512×512512\times512$ , we also conduct experiments with $640×640640\times640$ inputs and apply multi-scale (MS) testing to broaden our evaluation.

【翻譯】訓練策略。與Vmamba [24]設置保持一致，我們將UperHead集成到預訓練模型結構中。使用AdamW優化器，我們將學習率初始化為 $6×10?56\times10^{-5}$ 。微調階段包含 $160 k$ 次迭代，使用批次大小為16。雖然標準輸入分辨率為 $512×512512\times512$ ，我們也使用 $640×640640\times640$ 輸入進行實驗，并應用多尺度(MS)測試來擴大我們的評估范圍。

Results. The EfficientVMamba-T model yields mIoUs of $38.9%38.9\%$ (SS) and $39.3%39.3\%$ (MS), surpassing the ResNet-50’s $42.1%42.1\%$ mIoU with far fewer parameters. EfficientVMamba-S achieves $41.5%41.5\%$ (SS) and 42.1 (MS) mIoUs, bettering the DeiT-S $+^+$ MLN despite having a lower computational footprint. The EfficientVMamba-B reaches $46.5%46.5\%$ (SS) and $47.3%47.3\%$ (MS), outperforming the heavier VMamba-S. These findings attest to the EfficientVMamba series’ balance of accuracy and computational efficiency in semantic segmentation.

【翻譯】結果。EfficientVMamba-T模型產生 $38.9%38.9\%$ (SS)和 $39.3%39.3\%$ (MS)的mIoU，以更少的參數超越了ResNet-50的 $42.1%42.1\%$ mIoU。EfficientVMamba-S實現了 $41.5%41.5\%$ (SS)和42.1(MS)的mIoU，盡管計算占用更低，但仍優于DeiT-S $+^+$ MLN。EfficientVMamba-B達到 $46.5%46.5\%$ (SS)和 $47.3%47.3\%$ (MS)，超越了更重的VMamba-S。這些發現證明了EfficientVMamba系列在語義分割中準確性和計算效率的平衡。

Table 4: Results of semantic segmentation on ADE20K using UperNet [ 53 ]. We measure the mIoU with single-scale (SS) and multi-scale (MS) testings on the val set. The FLOPs are measured with an input size of $512×2048512\times2048$ . MLN: multi-level neck.

【翻譯】表4：使用UperNet [53]在ADE20K上的語義分割結果。我們在驗證集上使用單尺度(SS)和多尺度(MS)測試來測量mIoU。FLOPs是在輸入大小為 $512×2048512\times2048$ 時測量的。MLN：多級頸部。

Table 5: Ablation Analysis: Evaluating the Efficacy of Enhanced Spatially Selective Dilatation (ES2D), Assessing the Synergistic Effect of Convolutional Branch Fusion Enhanced with Squeeze-and-Excitation (SE) Techniques, and Investigating the Role of Inverted Residual Blocks in Model Performance. For comparison with the baseline VMamba, we adjust the dimensions and number of layers of it to match the FLOPs.

【翻譯】表5：消融分析：評估增強空間選擇性擴張(ES2D)的功效，評估通過擠壓激勵(SE)技術增強的卷積分支融合的協同效應，并研究倒殘差塊在模型性能中的作用。為了與基線VMamba進行比較，我們調整其維度和層數以匹配FLOPs。

5.4 Ablation Study

Effect of atrous selective scan. We implement experiment to validate the efficacy of atrous selective scan in Table 5 . The upgrade from SS2D to ES2D significantly reduces the computational complexity from 0 . 8 GFLOPs while retains competitive accuracy at $73.6%73.6\%$ , a $1.5%1.5\%$ improvement on the tiny variant. Similarly, In the case of base variant, the model utilizing ES2D not only reduces the GFLOPs to 4 . 0 from VMamba-B’s 4 . 2 but also exhibits an increase in accuracy from $80.2%80.2\%$ to $80.9%80.9\%$ . The results suggest that the incorporation of ES2D in our EfficientVMamba models is one of the key factor in achieving the reduction of computational complexity by skip sampling while preserve the global respective field to keep competitive performance. The reduction of GLOPs also reveals the potency of ES2D in maintaining, and even enhancing, model accuracy while significantly reducing computational overhead, demonstrating its viability for resource-constrained scenarios.

【翻譯】多孔選擇性掃描的效果。我們在表5中實施實驗來驗證多孔選擇性掃描的有效性。從SS2D升級到ES2D顯著降低了計算復雜度從0.8 GFLOPs，同時在tiny變體上保持了73.6%的競爭性準確率，提升了1.5%。同樣，在base變體的情況下，利用ES2D的模型不僅將GFLOPs從VMamba-B的4.2降低到4.0，還表現出準確率從80.2%增加到80.9%。結果表明，在我們的EfficientVMamba模型中納入ES2D是通過跳躍采樣實現計算復雜度降低同時保持全局感受野以保持競爭性能的關鍵因素之一。GLOPs的減少也揭示了ES2D在保持甚至增強模型準確性的同時顯著減少計算開銷的效力，證明了其在資源受限場景中的可行性。

Table 6: Comparisons of injecting convolution blocks at different stages on ImageNet dataset. We use EfficientVMamba-T in the experiments.

【翻譯】表6：在ImageNet數據集上不同階段注入卷積塊的比較。我們在實驗中使用EfficientVMamba-T。

Effect of SSM-Conv fusion block. The integration of a convolutional branch following with a SE block enhances the performance of our model. For tiny variance, adding the local fusion feature extraction improves accuracy from $73.6%73.6\%$ to $75.1%75.1\%$ . In the case of EfficientVMamba-B, introducing fusion mechanism increases accuracy from 80.9% to $81.2%81.2\%$ . The observed performance gains reveals the additional convolutional branch enhance the local feature extraction. By integrating Fusion, the models likely benefit from a more diversified feature set that captures a wider range of spatial details, improving the model’s ability to generalize and thus boosting accuracy. This suggests that the strategic addition of such branches can effectively enhance the model’s performance by providing a comprehensive and more nuanced respective field of the input feature map.

【翻譯】SSM-Conv融合塊的效果。集成一個后跟SE塊的卷積分支增強了我們模型的性能。對于tiny變體，添加局部融合特征提取將準確率從73.6%提高到75.1%。在EfficientVMamba-B的情況下，引入融合機制將準確率從80.9%增加到81.2%。觀察到的性能提升表明額外的卷積分支增強了局部特征提取。通過集成融合，模型可能受益于更多樣化的特征集，捕獲更廣泛的空間細節，提高模型的泛化能力，從而提升準確率。這表明這種分支的戰略性添加可以通過提供輸入特征圖的全面而更細致的感受野有效增強模型的性能。

Comparisons of injecting convolution block in different stages. In this paper, we obtain an interesting observation that our SSM based block, EVSS, is more beneficial in the early stages of the network. In contrast, previous works on light-weight ViTs usually inject the convolution blocks in the early stages and adopt Transformer blocks in the deep stages. As shown in Table 6 , we compare the performance of injecting convolution blocks in different stages of EfficientVMamba-T, and the results indicate that, adopting Inverted Residual blocks in the deep stages with performs better than that in early stages. A explanation to the opposite phenomenons between our light-weight VSSMs and ViTs is that, the self-attention in Transformers has higher computation complexity and thus its computation at high resolutions is inefficient; while the SSMs, tailored for efficient modeling of long sequences, is more efficient and beneficial on capturing information globally at high resolutions.

【翻譯】在不同階段注入卷積塊的比較。在本文中，我們得到了一個有趣的觀察，即我們基于SSM的塊EVSS在網絡的早期階段更有益。相比之下，以往關于輕量級ViTs的工作通常在早期階段注入卷積塊，在深層階段采用Transformer塊。如表6所示，我們比較了在EfficientVMamba-T的不同階段注入卷積塊的性能，結果表明，在深層階段采用倒殘差塊比在早期階段表現更好。我們的輕量級VSSMs和ViTs之間相反現象的解釋是，Transformers中的自注意力具有更高的計算復雜度，因此其在高分辨率下的計算效率低下；而SSMs專為長序列的高效建模而定制，在高分辨率下全局捕獲信息更高效且更有益。

6 Conclusion

This paper proposed EfficientVMamba, a lightweight state-space network architecture that adeptly combines the strengths of global and local information extraction, addressing the trade-off between model accuracy and computational efficiency. By incorporating an atrous-based selective scan with efficient skip sampling, EfficientVMamba ensures comprehensive global receptive field coverage while minimizing computational load. The integration of this scanning approach with a convolutional branch, followed by optimization through a Squeeze-andExcitation module, allows for a robust re-balancing of global and local features. Additionally, the innovative use of inverted residual insertion further refines the model’s multi-layer stages, enhancing its depth and effectiveness. Experimental results affirm that EfficientVMamba not only scales down computational complexity to $O(N)\mathcal O(N)$ but also delivers competitive performance across various vision tasks. The achievements of EfficientVMamba highlight its potential as a formidable framework in the evolution of lightweight, efficient, and generalpurpose visual models.

【翻譯】本文提出了EfficientVMamba，一種輕量級狀態空間網絡架構，巧妙地結合了全局和局部信息提取的優勢，解決了模型準確性和計算效率之間的權衡。通過采用基于多孔的選擇性掃描和高效跳躍采樣，EfficientVMamba確保了全面的全局感受野覆蓋，同時最小化計算負載。將這種掃描方法與卷積分支集成，然后通過擠壓激勵模塊進行優化，允許對全局和局部特征進行穩健的重新平衡。此外，倒殘差插入的創新使用進一步細化了模型的多層階段，增強了其深度和有效性。實驗結果證實，EfficientVMamba不僅將計算復雜度縮減到 $O(N)\mathcal O(N)$ ，還在各種視覺任務中提供了競爭性能。EfficientVMamba的成就突出了其作為輕量級、高效和通用視覺模型演進中強大框架的潛力。

Table 7: Object detection and instance segmentation results on COCO val set.

【翻譯】表7：COCO驗證集上的目標檢測和實例分割結果。

Appendix

與Mask R-CNN上其他骨干網絡的比較

We also investigate the performance dynamics of our EfficientVMamba as a lightweight backbone within the Mask R-CNN schedule, as shown in Table 7 . For the Mask R N 1 $×\times$ schedule, our EfficientVMamba-T model, with 11M parameters and 60 $60 G$ G FLOPs, achieves an Average Precision (AP) of $35.6%35.6\%$ . This is $1.6%1.6\%$ higher than ResNet-18, which has 31M parameters and 207G FLOPs. EfficientVMamba-S, with a greater number of parameters at 31M and 197G FLOPs, reaches an AP of $39.3%39.3\%$ , which is $0.5%0.5\%$ above the ResNet-50 model with 44M parameters and 260G FLOPs. Our largest model, EfficientVMamba-B, shows a superior AP of 43.7% with 53M parameters and a reduced computational requirement of 252G FLOPs, outperforming VMamba-T by $2.8%2.8\%$ . In terms of Mask R-CNN 3 $×\times$ MS schedule, EfficientVMamba-T maintains an AP of $38.3%38.3\%$ , surpassing ResNet-18’s performance by $1.4%1.4\%$ . The small variant records an AP of $41.5%41.5\%$ , which is a $0.5%0.5\%$ improvement over PVT-T with a similar parameter count. Finally, EfficientVMamba-B achieves an AP of $45.0%45.0\%$ , indicating a notable advancement of $2.2%2.2\%$ over VMamba-T.

【翻譯】我們還研究了EfficientVMamba作為Mask R-CNN調度中輕量級骨干網絡的性能動態，如表7所示。對于Mask R-CNN 1×調度，我們的EfficientVMamba-T模型具有11M參數和60G FLOPs，實現了35.6%的平均精度(AP)。這比具有31M參數和207G FLOPs的ResNet-18高1.6%。EfficientVMamba-S具有更多參數31M和197G FLOPs，達到39.3%的AP，比具有44M參數和260G FLOPs的ResNet-50模型高0.5%。我們最大的模型EfficientVMamba-B顯示出卓越的43.7% AP，具有53M參數和更低的252G FLOPs計算需求，超越VMamba-T 2.8%。在Mask R-CNN 3×MS調度方面，EfficientVMamba-T保持38.3%的AP，超越ResNet-18的性能1.4%。小變體記錄了41.5%的AP，比具有相似參數數量的PVT-T提高0.5%。最后，EfficientVMamba-B實現了45.0%的AP，表明比VMamba-T顯著提升2.2%。

與MobileNetV2骨干網絡的比較

We compare variant architectures and reveal a significant performance difference based on the integration of our innovative block, EVSS, versus Inverted Residual (InRes) blocks at specific stages. This results in Table 8 shows that using InRes consistently across all stages in both tiny and base variants achieves a good performance, with the base variant notably reaching an accuracy of $81.4%81.4\%$ . When EVSS is applied across all stages (the strategy of MobileNetV2 [ 34 ]), we observe a slight decrease in accuracy for both variants, suggesting a nuanced balance between architectural consistency and computational efficiency. Our fusion approach that combines EVSS in the initial stages with InRes in the later stages enhances accuracy to $76.5%76.5\%$ and $81.8%81.8\%$ for the tiny and base variants, respectively. This strategy benefits from the early-stage efficiency of EVSS and the advanced-stage convolutional capabilities of InRes, thus optimizing network performance by leveraging the strengths of both block types with limit computational resources.

【翻譯】我們比較了不同的架構變體，揭示了基于我們創新的EVSS塊與倒殘差(InRes)塊在特定階段集成的顯著性能差異。表8中的結果顯示，在tiny和base變體的所有階段一致使用InRes都能取得良好性能，其中base變體顯著達到了81.4%的準確率。當EVSS應用于所有階段時(MobileNetV2的策略[34])，我們觀察到兩個變體的準確率都略有下降，這表明架構一致性和計算效率之間存在微妙的平衡。我們的融合方法在初始階段結合EVSS，在后期階段結合InRes，將tiny和base變體的準確率分別提升到76.5%和81.8%。這種策略受益于EVSS的早期階段效率和InRes的高級階段卷積能力，從而通過在有限的計算資源下利用兩種塊類型的優勢來優化網絡性能。

Table 8: Comparisons of MobileNetV2 (All stages composed with EVSS.) on ImageNet dataset. We assess both tiny and base models on the ImageNet.

【翻譯】表8：MobileNetV2(所有階段均由EVSS組成)在ImageNet數據集上的比較。我們在ImageNet上評估tiny和base模型。

局限性

Visual state space models that operate with a linear-time complexity $O(N)\mathcal O(N)$ relative to sequence length demonstrate marked enhancements, particularly in high-resolution downstream tasks, which contrasted with prior CNN-based and Transformer-based models. However, the computational design of SSMs inherently exhibits increased computational sophistication than both convolutional and self-attention mechanisms, which complicates the performance of efficient parallel processing. There remains promising potential for future investigation on optimizing the computational efficiency and scalability of visual state space models (SSMs).

【翻譯】相對于序列長度具有線性時間復雜度 $O(N)\mathcal O(N)$ 的視覺狀態空間模型展現出顯著的增強效果，特別是在高分辨率下游任務中，這與之前基于CNN和Transformer的模型形成對比。然而，SSMs的計算設計本質上比卷積和自注意力機制表現出更高的計算復雜性，這使得高效并行處理的性能變得復雜。在優化視覺狀態空間模型(SSMs)的計算效率和可擴展性方面，仍有很大的研究潛力。

【解析】SSMs通過狀態傳播機制實現了線性復雜度，既能處理長序列又能保持全局建模能力，這在高分辨率視覺任務中特別有價值，因為圖像被展開成序列后往往非常長。然而，這段話也指出了SSMs當前面臨的主要問題：雖然理論復雜度更優，但實際計算過程比傳統方法更加復雜。其實這種復雜性主要體現在狀態更新的遞歸性質上，使得并行化變得困難。