YOLOv11: AN OVERVIEW OF THE KEY ARCHITECTURAL ENHANCEMENTS目標檢測論文精讀（逐段解析）

論文地址：https://www.arxiv.org/abs/2410.17725
Rahima Khanam and Muhammad Hussain
Ultralytics公司發布
CVPR 2024

論文寫的比較簡單，比較核心的改進包括：

C3K2高效特征提取機制。對C2f模塊的改進，其主要改進是可選擇地（配置True或False，False時退化為C2f，True時為使用C3k）使用C3k塊替代標準Bottleneck塊作為內部特征處理單元，從而在保持C2f高效性的同時，提供更強的特征提取能力。其中，C3k是對C3的改進，將C3所包含的固定K的Bottleneck（ $\times 1$ 和 $\times 3$ ）改為了允許調整卷積核大小（K可變）的Bottleneck，C3中堆疊的Bottleneck是使用經典的1×1+3×3卷積組合，為的是計算效率高，內存占用相對較小，而C3k允許調整k值以獲得不同大小的感受野，不同的k值捕獲不同尺度的特征，根據具體任務需求調整卷積核大小。
在SPPF模塊后新增C2PSA模塊。C2PSA結合了通道分離和位置敏感注意力，其技術流程為：首先通過1×1卷積將輸入特征通道擴展為原來的兩倍，然后將擴展后的特征沿通道維度分為兩個相等的部分，其中一部分直接保留作為跳躍連接，另一部分通過多個PSABlock進行位置敏感注意力處理（每個PSABlock包含多頭自注意力機制和前饋網絡，并采用殘差連接），最后將處理后的特征與保留的特征拼接，再通過1×1卷積壓縮回原始通道數。C2PSA在保持計算效率的前提下增強特征的全局上下文感知能力，更有效地關注圖像中的重要區域。

ABSTRACT

This study presents an architectural analysis of YOLOv11, the latest iteration in the YOLO (You Only Look Once) series of object detection models. We examine the models architectural innovations, including the introduction of the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling - Fast), and C2PSA (Convolutional block with Parallel Spatial Attention) components, which contribute in improving the models performance in several ways such as enhanced feature extraction. The paper explores YOLOv11’s expanded capabilities across various computer vision tasks, including object detection, instance segmentation, pose estimation, and oriented object detection (OBB). We review the model’s performance improvements in terms of mean Average Precision (mAP) and computational efficiency compared to its predecessors, with a focus on the trade-off between parameter count and accuracy. Additionally, the study discusses YOLOv11’s versatility across different model sizes, from nano to extra-large, catering to diverse application needs from edge devices to high-performance computing environments. Our research provides insights into YOLOv11’s position within the broader landscape of object detection and its potential impact on real-time computer vision applications.

【翻譯】本研究對YOLOv11進行了架構分析，YOLOv11是YOLO（You Only Look Once）目標檢測模型系列的最新版本。我們研究了該模型的架構創新，包括引入C3k2（Cross Stage Partial with kernel size 2）塊、SPPF（Spatial Pyramid Pooling - Fast）和C2PSA（Convolutional block with Parallel Spatial Attention）組件，這些組件在多個方面改進了模型性能，如增強特征提取。本文探討了YOLOv11在各種計算機視覺任務中的擴展能力，包括目標檢測、實例分割、姿態估計和有向目標檢測（OBB）。我們回顧了與前代相比，該模型在平均精度均值（mAP）和計算效率方面的性能改進，重點關注參數數量與精度之間的權衡。此外，本研究討論了YOLOv11在不同模型規模上的通用性，從納米級到超大級，滿足從邊緣設備到高性能計算環境的不同應用需求。我們的研究為YOLOv11在更廣泛的目標檢測領域中的地位以及其對實時計算機視覺應用的潛在影響提供了見解。

Keywords Automation; Computer Vision; YOLO; YOLOV11; Object Detection; Real-Time Image processing; YOLO version comparison

【翻譯】關鍵詞：自動化；計算機視覺；YOLO；YOLOV11；目標檢測；實時圖像處理；YOLO版本比較

1 Introduction

Computer vision, a rapidly advancing field, enables machines to interpret and understand visual data [ 1 ]. A crucial aspect of this domain is object detection[ 2 ], which involves the precise identification and localization of objects within images or video streams[ 3 ]. Recent years have witnessed remarkable progress in algorithmic approaches to address this challenge [4].

【翻譯】計算機視覺是一個快速發展的領域，使機器能夠解釋和理解視覺數據[1]。該領域的一個關鍵方面是目標檢測[2]，它涉及在圖像或視頻流中精確識別和定位目標[3]。近年來，在解決這一挑戰的算法方法方面取得了顯著進展[4]。

A pivotal breakthrough in object detection came with the introduction of the You Only Look Once (YOLO) algorithm by Redmon et al. in 2015 [ 5 ]. This innovative approach, as its name suggests, processes the entire image in a single pass to detect objects and their locations. YOLO’s methodology diverges from traditional two-stage detection processes by framing object detection as a regression problem [ 5 ]. It employs a single convolutional neural network to simultaneously predict bounding boxes and class probabilities across the entire image [ 6 ], streamlining the detection pipeline compared to more complex traditional methods.

【翻譯】目標檢測的一個關鍵性突破是Redmon等人在2015年提出的You Only Look Once (YOLO)算法[5]。這種創新方法正如其名稱所示，在單次處理中處理整個圖像以檢測目標及其位置。YOLO的方法通過將目標檢測框架化為回歸問題，偏離了傳統的兩階段檢測過程[5]。它采用單個卷積神經網絡同時預測整個圖像上的邊界框和類別概率[6]，相比于更復雜的傳統方法，簡化了檢測流水線。

【解析】傳統的檢測方法通常分為兩個步驟：先找出可能包含目標的區域，再對這些區域進行分類。這種方法雖然準確，但計算量大且速度慢。YOLO則采用了完全不同的策略，它把整個檢測過程看作一個回歸問題，也就是直接從圖像像素預測出目標的位置坐標和類別。通過將圖像劃分成網格，每個網格負責預測其中可能存在的目標，整個網絡只需要運行一次就能得到所有檢測結果。這種"一次性"的處理方式大大提高了檢測速度，為實時目標檢測應用奠定了基礎。

YOLOv11 is the latest iteration in the YOLO series, building upon the foundation established by YOLOv1. Unveiled at the YOLO Vision 2024 (YV24) conference, YOLOv11 represents a significant leap forward in real-time object detection technology. This new version introduces substantial enhancements in both architecture and training methodologies, pushing the boundaries of accuracy, speed, and efficiency.

【翻譯】YOLOv11是YOLO系列的最新版本，建立在YOLOv1奠定的基礎之上。在YOLO Vision 2024 (YV24)會議上發布的YOLOv11代表了實時目標檢測技術的重大飛躍。這個新版本在架構和訓練方法論方面都引入了實質性的增強，推動了準確性、速度和效率的邊界。

YOLOv11’s innovative design incorporates advanced feature extraction techniques, allowing for more nuanced detail capture while maintaining a lean parameter count. This results in improved accuracy across a diverse range of computer vision (CV) tasks, from object detection to classification. Furthermore, YOLOv11 achieves remarkable gains in processing speed, substantially enhancing real-time performance capabilities.

【翻譯】YOLOv11的創新設計融合了先進的特征提取技術，在保持精簡參數數量的同時，允許更細致的細節捕獲。這導致在從目標檢測到分類的各種計算機視覺(CV)任務中準確性的提高。此外，YOLOv11在處理速度方面取得了顯著進展，大幅增強了實時性能能力。

【解析】其實從Anchor free之后，YOLO系列改進工作就體現了現在的一個重要趨勢：因為YOLO網絡整個范式比較固定了，只能用更聰明的架構設計來替代簡單的參數或者模塊堆疊（去精細化各模塊作用）。傳統上，提高模型性能往往意味著增加更多的層數和參數，但這會導致計算成本的急劇上升。YOLOv11通過引入更高效的特征提取模塊，能夠在較少的參數下捕獲更豐富的圖像信息。

In the following sections, this paper will provide a comprehensive analysis of YOLOv11’s architecture, exploring its key components and innovations. We will examine the evolution of YOLO models, leading up to the development of YOLOv11. The study will delve into the model’s expanded capabilities across various CV tasks, including object detection, instance segmentation, pose estimation, and oriented object detection. We will also review YOLOv11’s performance improvements in terms of accuracy and computational efficiency compared to its predecessors, with a particular focus on its versatility across different model sizes. Finally, we will discuss the potential impact of YOLOv11 on real-time CV applications and its position within the broader landscape of object detection technologies.

【翻譯】在接下來的章節中，本文將對YOLOv11的架構進行全面分析，探索其關鍵組件和創新。我們將檢查YOLO模型的演進，直到YOLOv11的開發。研究將深入探討該模型在各種CV任務中的擴展能力，包括目標檢測、實例分割、姿態估計和有向目標檢測。我們還將回顧YOLOv11與其前代相比在準確性和計算效率方面的性能改進，特別關注其在不同模型尺寸上的多功能性。最后，我們將討論YOLOv11對實時CV應用的潛在影響以及其在更廣泛的目標檢測技術領域中的地位。

2 Evolution of YOLO models

Table 1 illustrates the progression of YOLO models from their inception to the most recent versions. Each iteration has brought significant improvements in object detection capabilities, computational efficiency, and versatility in handling various CV tasks.

【翻譯】表1展示了YOLO模型從誕生到最新版本的發展歷程。每次迭代都在目標檢測能力、計算效率以及處理各種計算機視覺任務的多功能性方面帶來了顯著改進。

【解析】從最初的YOLOv1開始，每個新版本都不僅僅是簡單的性能提升，而是在解決前代模型存在的具體問題的同時，引入新的技術理念。比如早期版本主要解決檢測精度和速度的平衡問題，后期版本則開始關注多尺度目標檢測、小目標檢測等更復雜的場景。Ultralytics的發布使YOLO不再局限于單純的目標檢測，而是發展成為一個能夠處理分割、姿態估計、OBB等多種視覺任務的統一框架。

Table 1: YOLO: Evolution of models

This evolution showcases the rapid advancement in object detection technologies, with each version introducing novel features and expanding the range of supported tasks. From the original YOLO’s groundbreaking single-stage detection to YOLOv10’s NMS-free training, the series has consistently pushed the boundaries of real-time object detection.

【翻譯】這種演進展現了目標檢測技術的快速發展，每個版本都引入了新穎特性并擴展了支持任務的范圍。從最初YOLO開創性的單階段檢測到YOLOv10的無NMS訓練，該系列始終在推動實時目標檢測的邊界。

【解析】YOLOv10無NMS可以使網絡直接輸出最終結果而無需額外的后處理，這進一步提升了推理效率。

The latest iteration, YOLO11, builds upon this legacy with further enhancements in feature extraction, efficiency, and multi-task capabilities. Our subsequent analysis will delve into YOLO11’s architectural innovations, including its improved backbone and neck structures, and its performance across various computer vision tasks such as object detection, instance segmentation, and pose estimation.

【翻譯】最新版本YOLO11在這一傳承基礎上進一步增強了特征提取、效率和多任務能力。我們后續的分析將深入探討YOLO11的架構創新，包括其改進的主干網絡和頸部結構，以及其在目標檢測、實例分割和姿態估計等各種計算機視覺任務中的性能表現。

3 What is YOLOv11?

The evolution of the YOLO algorithm reaches new heights with the introduction of YOLOv11 [ 16 ], representing a significant advancement in real-time object detection technology. This latest iteration builds upon the strengths of its predecessors while introducing novel capabilities that expand its utility across diverse CV applications.

【翻譯】YOLO算法的演進隨著YOLOv11的推出達到了新的高度[16]，代表了實時目標檢測技術的重大進步。這一最新版本在其前代優勢的基礎上構建，同時引入了新穎的能力，擴展了其在各種計算機視覺應用中的實用性。

YOLOv11 distinguishes itself through its enhanced adaptability, supporting an expanded range of CV tasks beyond traditional object detection. Notable among these are posture estimation and instance segmentation, broadening the model’s applicability in various domains. YOLOv11’s design focuses on balancing power and practicality, aiming to address specific challenges across various industries with increased accuracy and efficiency.

【翻譯】YOLOv11通過其增強的適應性來區別于其他模型，支持超越傳統目標檢測的擴展CV任務范圍。其中值得注意的是姿態估計和實例分割，這擴大了模型在各個領域的適用性。YOLOv11的設計專注于平衡功能強度和實用性，旨在以更高的準確性和效率解決各行業的特定挑戰。

This latest model demonstrates the ongoing evolution of real-time object detection technology, pushing the boundaries of what’s possible in CV applications. Its versatility and performance improvements position YOLOv11 as a significant advancement in the field, potentially opening new avenues for real-world implementation across diverse sectors.

【翻譯】這一最新模型展示了實時目標檢測技術的持續演進，推動了計算機視覺應用中可能性的邊界。其多功能性和性能改進將YOLOv11定位為該領域的重大進步，可能為跨不同行業的現實世界實施開辟新途徑。

4 Architectural footprint of Yolov11

The YOLO framework revolutionized object detection by introducing a unified neural network architecture that simultaneously handles both bounding box regression and object classification tasks [ 17 ]. This integrated approach marked a significant departure from traditional two-stage detection methods, offering end-to-end training capabilities through its fully differentiable design.

【翻譯】YOLO框架通過引入一個統一的神經網絡架構徹底改變了目標檢測，該架構同時處理邊界框回歸和目標分類任務[17]。這種集成方法標志著與傳統兩階段檢測方法的重大偏離，通過其完全可微分設計提供端到端的訓練能力。

【解析】這里又強調了傳統兩階段方法：首先生成候選區域，然后對這些區域進行分類和位置精修。YOLO將這兩個任務融合到一個神經網絡中，直接從輸入圖像預測目標的位置和類別。完全可微分設計是說整個網絡的所有組件都可以使用梯度下降算法進行聯合優化，這使得訓練過程更加簡潔高效，避免了傳統方法中需要分階段訓練和手工調參的復雜性。

At its core, the YOLO architecture consists of three fundamental components. First, the backbone serves as the primary feature extractor, utilizing convolutional neural networks to transform raw image data into multi-scale feature maps. Second, the neck component acts as an intermediate processing stage, employing specialized layers to aggregate and enhance feature representations across different scales. Third, the head component functions as the prediction mechanism, generating the final outputs for object localization and classification based on the refined feature maps.

【翻譯】在其核心，YOLO架構由三個基本組件組成。首先，主干網絡(backbone)作為主要的特征提取器，利用卷積神經網絡將原始圖像數據轉換為多尺度特征圖。其次，頸部(neck)組件作為中間處理階段，采用專門的層來聚合和增強跨不同尺度的特征表示。第三，頭部(head)組件作為預測機制，基于精煉的特征圖生成目標定位和分類的最終輸出。

【解析】主干網絡負責從原始像素中提取基礎視覺特征，通過多層卷積逐步抽象圖像信息，從低級的邊緣紋理特征到高級的語義特征。頸部組件解決不同尺度目標檢測的關鍵問題，因為同一類目標在圖像中可能呈現不同的大小，需要融合來自不同層級的特征信息，融合過程包括上采樣、下采樣和特征連接操作。頭部組件則專門負責最終的預測任務，將處理好的特征映射到具體的檢測結果，包括目標的空間位置、尺寸和類別概率，這種模塊化設計使得每個組件都可以獨立優化和替換。

Building on this established architecture, YOLO11 extends and enhances the foundation laid by YOLOv8, introducing architectural innovations and parameter optimizations to achieve superior detection performance as illustrated in Figure 1. The following sections detail the key architectural modifications implemented in YOLO11:

【翻譯】基于這一既定架構，YOLO11擴展和增強了YOLOv8奠定的基礎，引入了架構創新和參數優化，以實現卓越的檢測性能，如圖1所示。以下章節詳細介紹了YOLO11中實施的關鍵架構修改：

Figure 1: Key architectural modules in YOLO11.

【翻譯】YOLOv11的架構核心模塊。

4.1 Backbone

The backbone is a crucial component of the YOLO architecture, responsible for extracting features from the input image at multiple scales. This process involves stacking convolutional layers and specialized blocks to generate feature maps at various resolutions.

【翻譯】主干網絡是YOLO架構的關鍵組件，負責從輸入圖像中提取多尺度特征。這一過程涉及堆疊卷積層和專門的模塊來生成不同分辨率的特征圖。

4.1.1 Convolutional Layers

YOLOv11 maintains a structure similar to its predecessors, utilizing initial convolutional layers to downsample the image. These layers form the foundation of the feature extraction process, gradually reducing spatial dimensions while increasing the number of channels. A significant improvement in YOLO11 is the introduction of the C3k2 block, which replaces the C2f block used in previous versions [ 18 ]. The C3k2 block is a more computationally efficient implementation of the Cross Stage Partial (CSP) Bottleneck. It employs two smaller convolutions instead of one large convolution, as seen in YOLOv8 [ 13 ]. The “k2” in C3k2 indicates a smaller kernel size, which contributes to faster processing while maintaining performance.

【翻譯】YOLOv11保持了與其前代類似的結構，利用初始卷積層對圖像進行下采樣。這些層構成了特征提取過程的基礎，逐漸減少空間維度的同時增加通道數。YOLO11的一個重大改進是引入了C3k2模塊，它取代了先前版本中使用的C2f模塊[18]。C3k2模塊是跨階段部分(CSP)瓶頸的更具計算效率的實現。它采用兩個較小的卷積而不是一個大卷積，如YOLOv8中所見[13]。C3k2中的"k2"表示較小的內核大小，有助于在保持性能的同時實現更快的處理。

4.1.2 SPPF and C2PSA

YOLO11 retains the Spatial Pyramid Pooling - Fast (SPPF) block from previous versions but introduces a new Cross Stage Partial with Spatial Attention (C2PSA) block after it [ 18 ]. The C2PSA block is a notable addition that enhances spatial attention in the feature maps. This spatial attention mechanism allows the model to focus more effectively on important regions within the image. By pooling features spatially, the C2PSA block enables YOLO11 to concentrate on specific areas of interest, potentially improving detection accuracy for objects of varying sizes and positions.

【翻譯】YOLO11保留了先前版本的空間金字塔池化-快速(SPPF)模塊，但在其后引入了一個新的帶空間注意力的跨階段部分(C2PSA)模塊[18]。C2PSA模塊是一個顯著的新增功能，它增強了特征圖中的空間注意力。這種空間注意力機制使模型能夠更有效地關注圖像中的重要區域。通過空間特征池化，C2PSA模塊使YOLO11能夠專注于特定的感興趣區域，潛在地提高對不同大小和位置目標的檢測精度。

【解析】SPPF模塊本質上是一種多尺度特征融合技術，它通過不同大小的池化操作來捕獲不同尺度的上下文信息，這對于檢測不同大小的目標非常重要。而新引入的C2PSA模塊則在SPPF的基礎上進一步引入了空間注意力機制。空間注意力的核心思想是讓網絡學會自動識別圖像中哪些區域更重要，哪些區域相對不重要。傳統的卷積操作對圖像的每個位置都給予相同的權重，但現實中圖像的不同區域包含的信息價值是不同的。C2PSA通過計算每個空間位置的重要性權重，讓網絡能夠將更多的計算資源和注意力集中在真正包含目標或重要特征的區域上。這種機制有助于處理復雜場景中的目標檢測問題，比如當目標被部分遮擋、處于復雜背景中或者尺寸變化很大時。空間池化操作進一步增強了這種能力，它不僅關注單個像素點，而是考慮局部區域的特征分布，這樣可以更好地理解目標的空間結構和上下文關系。

4.2 Neck

The neck combines features at different scales and transmits them to the head for prediction. This process typically involves upsampling and concatenation of feature maps from different levels, enabling the model to capture multi-scale information effectively.

【翻譯】頸部組件結合不同尺度的特征并將其傳輸到頭部進行預測。這個過程通常涉及來自不同層級的特征圖的上采樣和連接，使模型能夠有效地捕獲多尺度信息。

【解析】頸部組件在整個網絡架構中起到承上啟下的關鍵作用。主干網絡在提取特征的過程中會產生多個不同分辨率的特征圖，淺層特征圖保留了更多的空間細節信息但語義信息較少，而深層特征圖包含更豐富的語義信息但空間分辨率較低。頸部組件的核心任務就是將這些具有互補性質的特征進行有機融合。其中的上采樣操作將低分辨率的特征圖通過插值（或反卷積等）恢復到更高的分辨率，以將深層的語義信息傳播到更精細的空間位置上。特征連接將不同來源的特征圖在通道維度上進行拼接，讓網絡同時訪問多種類型的特征信息。多尺度特征融合策略對于目標檢測任務特別重要，因為現實場景中的目標往往具有不同的尺寸，小目標需要高分辨率的細節信息來準確定位，而大目標則更依賴于高級的語義特征來正確分類。通過頸部組件的處理，網絡能夠在單一的特征表示中同時保持空間精度和語義豐富性，為后續的檢測頭提供最優質的特征輸入。

4.2.1 C3k2 Block

YOLO11 introduces a significant change by replacing the C2f block in the neck with the C3k2 block. The C3k2 block is designed to be faster and more efficient, enhancing the overall performance of the feature aggregation process. After upsampling and concatenation, the neck in YOLO11 incorporates this improved block, resulting in enhanced speed and performance [18].

【翻譯】YOLO11通過在頸部用C3k2模塊替換C2f模塊作為改進變化。C3k2模塊設計得更快、更高效，增強了特征聚合過程的整體性能。在上采樣和連接之后，YOLO11的頸部組件整合了這個改進的模塊，從而提高了速度和性能[18]。

【解析】C2f模塊雖然在之前的版本中表現良好，但在特征聚合的計算開銷方面仍有優化空間。特征聚合是指將不同分辨率和不同語義層次的特征信息有機結合的過程，這個過程需要在計算效率和信息保留之間找到最佳平衡點。C3k2模塊通過更優化的內部結構和計算路徑，能夠在更短的時間內完成同樣質量的特征處理工作，上采樣和連接操作本身就是計算密集型的操作，如果后續的處理模塊能夠更加高效，就能顯著提升整個頸部組件的處理速度，進而影響整個網絡的實時性能。

4.2.2 Attention Mechanism

A notable addition to YOLO11 is its increased focus on spatial attention through the C2PSA module. This attention mechanism enables the model to concentrate on key regions within the image, potentially leading to more accurate detection, especially for smaller or partially occluded objects. The inclusion of C2PSA sets YOLO11 apart from its predecessor, YOLOv8, which lacks this specific attention mechanism [18].

【翻譯】YOLO11的一個顯著新增功能是通過C2PSA模塊加強了對空間注意力的關注。這種注意力機制使模型能夠專注于圖像中的關鍵區域，可能會帶來更準確的檢測，特別是對于較小或部分被遮擋的目標。C2PSA的引入使YOLO11與其前代YOLOv8區別開來，后者缺乏這種特定的注意力機制[18]。

【解析】在傳統的卷積神經網絡中，每個像素位置都會被同等對待，但這顯然不符合人類視覺系統的工作方式——我們看圖片時會自然地把注意力集中在重要的區域上。C2PSA模塊通過學習生成一個空間權重圖，這個權重圖告訴網絡哪些區域更值得關注。當處理包含小目標的圖像時，小目標往往只占據很少的像素，容易被大面積的背景信息淹沒，而空間注意力機制能夠讓網絡學會識別這些小目標所在的區域并給予更多關注。對于被遮擋的目標，注意力機制幫助網絡聚焦于目標可見的部分，即使只有一部分可見，網絡也能通過注意力權重的指導更好地推斷出完整目標的存在。

4.3 Head

The head of YOLOv11 is responsible for generating the final predictions in terms of object detection and classification. It processes the feature maps passed from the neck, ultimately outputting bounding boxes and class labels for objects within the image.

【翻譯】YOLOv11的頭部負責生成目標檢測和分類的最終預測結果。它處理從頸部傳遞來的特征圖，最終輸出圖像中目標的邊界框和類別標簽。

【解析】頭部組件是整個檢測網絡的最后一個環節，可以說是決定最終檢測效果的核心部分。經過主干網絡和頸部組件的層層處理，原始圖像已經被轉換成了包含豐富語義信息和空間信息的特征表示。頭部組件的任務就是將這些抽象的特征表示轉換成人類可以理解和使用的具體檢測結果。這個轉換過程涉及兩個主要的預測任務：首先是回歸任務，即預測目標在圖像中的精確位置，這通過邊界框的坐標來表示，包括目標的中心點坐標以及寬度和高度信息；其次是分類任務，即判斷檢測到的目標屬于哪個類別，比如是人、車、動物等。頭部組件需要在保持高精度的同時確保推理速度，因為它直接影響整個檢測系統的實時性能。

4.3.1 C3k2 Block

In the head section, YOLOv11 utilizes multiple C3k2 blocks to efficiently process and refine the feature maps. The C3k2 blocks are placed in several pathways within the head, functioning to process multi-scale features at different depths. The C3k2 block exhibits flexibility depending on the value of the c3k parameter:

【翻譯】在頭部模塊中，YOLOv11利用多個C3k2塊來高效處理和精煉特征圖。C3k2塊被放置在頭部的幾個路徑中，用于處理不同深度的多尺度特征。C3k2塊根據c3k參數的值表現出靈活性：

? When $\mathrm{c3k=}$ False, the C3k2 module behaves similarly to the C2f block, utilizing a standard bottleneck structure.
? When $\mathrm{c3k=}$ True, the bottleneck structure is replaced by the C3 module, which allows for deeper and more complex feature extraction.

【翻譯】? 當 $\mathrm{c3k=}$ False時，C3k2模塊的行為類似于C2f塊，使用標準的瓶頸結構。
? 當 $\mathrm{c3k=}$ True時，瓶頸結構被C3模塊替換，允許更深層和更復雜的特征提取。

Key characteristics of the C3k2 block:
? Faster processing: The use of two smaller convolutions reduces the computational overhead compared to a single large convolution, leading to quicker feature extraction.
? Parameter efficiency: C3k2 is a more compact version of the CSP bottleneck, making the architecture more efficient in terms of the number of trainable parameters.

【翻譯】C3k2塊的關鍵特征：
? 更快的處理：使用兩個較小的卷積相比單個大卷積減少了計算開銷，實現更快的特征提取。
? 參數效率：C3k2是CSP瓶頸的更緊湊版本，在可訓練參數數量方面使架構更加高效。

Another notable addition is the C3k block, which offers enhanced flexibility by allowing customizable kernel sizes. The adaptability of $\mathrm{C}3\mathrm{k}$ is particularly useful for extracting more detailed features from images, contributing to improved detection accuracy.

【翻譯】另一個值得注意的新增是C3k塊，它通過允許自定義內核大小提供增強的靈活性。 $\mathrm{C}3\mathrm{k}$ 的適應性在從圖像中提取更詳細的特征方面特別有用，有助于提高檢測精度。

【解析】可自定義內核大小讓網絡結構更加靈活和自適應。傳統普遍使用3x3卷積核。但不同的特征和不同的目標其實需要不同尺寸的感受野來最好地捕獲。小的卷積核善于捕獲細節特征，如邊緣、紋理等，而大的卷積核更適合捕獲全局的結構信息。C3k塊允許根據具體任務來調整卷積核大小，網絡可以更加精確地匹配不同類型特征的提取需求。當檢測小目標時，可以使用較小的卷積核來保持空間精度；當檢測大目標或需要理解全局上下文時，可以使用較大的卷積核來擴大感受野。這種自適應能力讓單一的網絡架構能夠更好地處理多尺度目標檢測的挑戰。

4.3.2 CBS Blocks

The head of YOLOv11 includes several CBS (Convolution-BatchNorm-Silu) [ 19 ] layers after the C3k2 blocks. These layers further refine the feature maps by:

? Extracting relevant features for accurate object detection.
? Stabilizing and normalizing the data flow through batch normalization.
? Utilizing the Sigmoid Linear Unit (SiLU) activation function for non-linearity, which improves model performance.

CBS blocks serve as foundational components in both feature extraction and the detection process, ensuring that the refined feature maps are passed to the subsequent layers for bounding box and classification predictions.

【翻譯】YOLOv11的頭部在C3k2塊之后包含多個CBS（卷積-批歸一化-Silu）[19]層。這些層通過以下方式進一步優化特征圖：
? 提取相關特征以實現準確的目標檢測。
? 通過批歸一化穩定和規范化數據流。
? 利用Sigmoid線性單元（SiLU）激活函數實現非線性，從而提高模型性能。
CBS塊在特征提取和檢測過程中都作為基礎組件，確保精煉的特征圖傳遞給后續層進行邊界框和分類預測。

【解析】CBS塊是標準組合模塊。

4.3.3 Final Convolutional Layers and Detect Layer

Each detection branch ends with a set of Conv2D layers, which reduce the features to the required number of outputs for bounding box coordinates and class predictions. The final Detect layer consolidates these predictions, which include:
? Bounding box coordinates for localizing objects in the image.
? Objectness scores that indicate the presence of objects.
? Class scores for determining the class of the detected object.

【翻譯】每個檢測分支以一組Conv2D層結束，這些層將特征減少到邊界框坐標和類別預測所需的輸出數量。最終的檢測層整合這些預測，包括：
? 用于在圖像中定位目標的邊界框坐標。
? 指示目標存在的目標性得分。
? 用于確定檢測目標類別的類別得分。

【解析】這段描述了YOLO檢測頭的最終輸出層結構。其實就是最后的Conv2D層，作用：將高維的特征表示映射到具體的數值輸出。

5 Key Computer Vision Tasks Supported by YOLO11

YOLO11 supports a diverse range of CV tasks, showcasing its versatility and power in various applications. Here’s an overview of the key tasks:

【翻譯】YOLO11支持多種計算機視覺任務，展示了其在各種應用中的多功能性和強大功能。以下是關鍵任務的概述：

Object Detection: YOLO11 excels in identifying and localizing objects within images or video frames, providing bounding boxes for each detected item [ 20 ]. This capability finds applications in surveillance systems, autonomous vehicles, and retail analytics, where precise object identification is crucial [21].

【翻譯】1. 目標檢測：YOLO11在識別和定位圖像或視頻幀中的目標方面表現出色，為每個檢測到的項目提供邊界框[20]。這種能力在監控系統、自動駕駛車輛和零售分析中得到應用，在這些領域中精確的目標識別至關重要[21]。

Instance Segmentation: Going beyond simple detection, YOLO11 can identify and separate individual objects within an image down to the pixel level [ 20 ]. This fine-grained segmentation is particularly valuable in medical imaging for precise organ or tumor delineation, and in manufacturing for detailed defect detection [21].

【翻譯】2. 實例分割：超越簡單檢測，YOLO11可以識別和分離圖像中的單個目標，精確到像素級別[20]。這種細粒度分割在醫學成像中用于精確的器官或腫瘤描繪特別有價值，在制造業中用于詳細的缺陷檢測[21]。

Image Classification: YOLOv11 is capable of classifying entire images into predetermined categories, making it ideal for applications like product categorization in e-commerce platforms or wildlife monitoring in ecological studies [21].

【翻譯】3. 圖像分類：YOLOv11能夠將整個圖像分類到預定義的類別中，使其非常適合電商平臺的產品分類或生態研究中的野生動物監測等應用[21]。

Pose Estimation: The model can detect specific key points within images or video frames to track movements or poses. This capability is beneficial for fitness tracking applications, sports performance analysis, and various healthcare applications requiring motion assessment [21].

【翻譯】4. 姿態估計：該模型可以檢測圖像或視頻幀中的特定關鍵點來跟蹤運動或姿勢。這種能力對健身跟蹤應用、運動表現分析以及需要運動評估的各種醫療應用都很有益[21]。

Oriented Object Detection (OBB): YOLO11 introduces the ability to detect objects with an orientation angle, allowing for more precise localization of rotated objects. This feature is especially valuable in aerial imagery analysis, robotics, and warehouse automation tasks where object orientation is crucial [21].

【翻譯】5. 定向目標檢測（OBB）：YOLO11引入了檢測帶有方向角的目標的能力，允許更精確地定位旋轉的目標。這個特性在航空圖像分析、機器人技術和倉庫自動化任務中特別有價值，在這些領域中目標方向至關重要[21]。

Object Tracking: It identifies and traces the path of objects in a sequence of images or video frames[ 21 ]. This real-time tracking capability is essential for applications such as traffic monitoring, sports analysis, and security systems.

【翻譯】6. 目標跟蹤：它在一系列圖像或視頻幀中識別和跟蹤目標的路徑[21]。這種實時跟蹤能力對于交通監控、體育分析和安全系統等應用至關重要。

Table 2 outlines the YOLOv11 model variants and their corresponding tasks. Each variant is designed for specific use cases, from object detection to pose estimation. Moreover, all variants support core functionalities like inference, validation, training, and export, making YOLOv11 a versatile tool for various CV applications.

【翻譯】表2概述了YOLOv11模型變體及其相應的任務。每個變體都是為特定用例設計的，從目標檢測到姿態估計。此外，所有變體都支持推理、驗證、訓練和導出等核心功能，使YOLOv11成為各種計算機視覺應用的多功能工具。

6 Advancements and Key Features of YOLOv11

YOLOv11 represents a significant advancement in object detection technology, building upon the foundations laid by its predecessors, YOLOv9 and YOLOv10, which were introduced earlier in 2024. This latest iteration from Ultralytics showcases enhanced architectural designs, more sophisticated feature extraction techniques, and refined training methodologies. The synergy of YOLOv11’s rapid processing, high accuracy, and computational efficiency positions it as one of the most formidable models in Ultralytics’ portfolio to date [ 22 ]. A key strength of YOLOv11 lies in its refined architecture, which facilitates the detection of subtle details even in challenging scenarios. The model’s improved feature extraction capabilities allow it to identify and process a broader range of patterns and intricate elements within images. Compared to earlier versions, YOLOv11 introduces several notable enhancements:

【翻譯】YOLOv11代表了目標檢測技術的重大進步，建立在其前任YOLOv9和YOLOv10的基礎之上，這兩個版本在2024年早期推出。來自Ultralytics的這一最新迭代展示了增強的架構設計、更復雜的特征提取技術和精細的訓練方法。YOLOv11的快速處理、高精度和計算效率的協同作用使其成為Ultralytics迄今為止最強大的模型之一[22]。YOLOv11的一個關鍵優勢在于其精細的架構，這有助于在具有挑戰性的場景中檢測細微細節。該模型改進的特征提取能力使其能夠識別和處理圖像中更廣泛的模式和復雜元素。與早期版本相比，YOLOv11引入了幾個顯著的增強：

Table 2: YOLOv11 Model Variants and Tasks

Enhanced precision with reduced complexity: The YOLOv11m variant achieves superior mean Average Precision (mAP) scores on the COCO dataset while utilizing $22\%$ fewer parameters than its YOLOv8m counterpart, demonstrating improved computational efficiency without compromising accuracy [23].
Versatility in CV tasks: YOLOv11 exhibits proficiency across a diverse array of CV applications, including pose estimation, object recognition, image classification, instance segmentation, and oriented bounding box (OBB) detection [23].
Optimized speed and performance: Through refined architectural designs and streamlined training pipelines, YOLOv11 achieves faster processing speeds while maintaining a balance between accuracy and computational efficiency [23].
Streamlined parameter count: The reduction in parameters contributes to faster model performance without significantly impacting the overall accuracy of YOLOv11 [22].
Advanced feature extraction: YOLOv11 incorporates improvements in both its backbone and neck architectures, resulting in enhanced feature extraction capabilities and, consequently, more precise object detection [23].
Contextual adaptability: YOLOv11 demonstrates versatility across various deployment scenarios, including cloud platforms, edge devices, and systems optimized for NVIDIA GPUs [23].

【翻譯】1. 在降低復雜性的同時增強精度：YOLOv11m變體在COCO數據集上實現了優異的平均精度均值（mAP）分數，同時比其YOLOv8m對應版本使用了 $22\%$ 更少的參數，在不影響準確性的情況下展示了改進的計算效率[23]。
2. 計算機視覺任務的多功能性：YOLOv11在多種多樣的計算機視覺應用中表現出色，包括姿態估計、目標識別、圖像分類、實例分割和定向邊界框（OBB）檢測[23]。
3. 優化的速度和性能：通過精細的架構設計和簡化的訓練流水線，YOLOv11實現了更快的處理速度，同時保持了準確性和計算效率之間的平衡[23]。
4. 精簡的參數數量：參數的減少有助于更快的模型性能，而不會顯著影響YOLOv11的整體準確性[22]。
5. 先進的特征提取：YOLOv11在其骨干網絡和頸部架構中都融入了改進，導致增強的特征提取能力，從而實現更精確的目標檢測[23]。
6. 上下文適應性：YOLOv11在各種部署場景中表現出多功能性，包括云平臺、邊緣設備和針對NVIDIA GPU優化的系統[23]。

YOLOv11 model demonstrates significant advancements in both inference speed and accuracy compared to its predecessors. In the benchmark analysis, YOLOv11 was compared against several of its predecessors including variants such as YOLOv5 [ 24 ] through to the more recent variants such as YOLOv10. As presented in Figure 2, YOLOv11 consistently outperforms these models, achieving superior mAP on the COCO dataset while maintaining a faster inference rate [25].

【翻譯】與其前任相比，YOLOv11模型在推理速度和準確性方面都表現出顯著的進步。在基準分析中，YOLOv11與其幾個前任進行了比較，包括YOLOv5[24]等變體，直到YOLOv10等較新的變體。如圖2所示，YOLOv11始終優于這些模型，在COCO數據集上實現了優異的mAP，同時保持了更快的推理速度[25]。

The performance comparison graph depicted in Figure 2 overs several key insights. The YOLOv11 variants (11n, 11s, 11m, and 11x) form a distinct performance frontier, with each model achieving higher $\mathrm{COCO}\operatorname*{mAP}^{50-95}$ scores at their respective latency points. Notably, the YOLOv11x achieves approximately $5\check{4}.5\%\mathrm{mAP^{50-95}}$ at $13\mathrm{ms}$ latency, surpassing all previous YOLO iterations. The intermediate variants, particularly YOLOv11m, demonstrate exceptional efficiency by achieving comparable accuracy to larger models from previous generations while requiring significantly less processing time.

【翻譯】圖2所示的性能比較圖提供了幾個關鍵洞察。YOLOv11變體（11n、11s、11m和11x）形成了一個獨特的性能前沿，每個模型在其各自的延遲點上都實現了更高的 $\mathrm{COCO}\operatorname*{mAP}^{50-95}$ 分數。值得注意的是，YOLOv11x在 $13\mathrm{ms}$ 延遲下實現了大約 $5\check{4}.5\%\mathrm{mAP^{50-95}}$ ，超越了所有以前的YOLO迭代。中間變體，特別是YOLOv11m，通過在需要顯著更少處理時間的同時實現與前幾代更大模型相當的準確性，展示了卓越的效率。

A particularly noteworthy observation is the performance leap in the low-latency regime (2-6ms), where YOLOv11s maintains high accuracy (approximately $47\%\mathrm{mAP^{50-95}})$ ) while operating at speeds previously associated with much less accurate models. This represents a crucial advancement for real-time applications where both speed and accuracy are critical. The improvement curve of YOLOv11 also shows better scaling characteristics across its model variants, suggesting more efficient utilization of additional computational resources compared to previous generations.

【翻譯】一個特別值得注意的觀察是在低延遲區間（2-6ms）的性能飛躍，其中YOLOv11s保持高精度（大約 $47\%\mathrm{mAP^{50-95}}$ ），同時以之前與精度低得多的模型相關的速度運行。這代表了對于速度和精度都至關重要的實時應用的關鍵進步。YOLOv11的改進曲線還顯示了其模型變體之間更好的擴展特性，表明與前幾代相比更有效地利用了額外的計算資源。

Figure 2: Benchmarking YOLOv11 Against Previous Versions [23]

7 Discussion

YOLO11 marks a significant leap forward in object detection technology, building upon its predecessors while introducing innovative enhancements. This latest iteration demonstrates remarkable versatility and efficiency across various CV tasks.

【翻譯】YOLO11標志著目標檢測技術的重大飛躍，在其前任的基礎上引入了創新的增強功能。這一最新迭代在各種計算機視覺任務中展示了卓越的多功能性和效率。

Efficiency and Scalability: YOLO11 introduces a range of model sizes, from nano to extra-large, catering to diverse application needs. This scalability allows for deployment in scenarios ranging from resourceconstrained edge devices to high-performance computing environments. The nano variant, in particular, showcases impressive speed and efficiency improvements over its predecessor, making it ideal for real-time applications.
Architectural Innovations: The model incorporates novel architectural elements that enhance its feature extraction and processing capabilities. The incorporation of novel elements such as the C3k2 block, SPPF, and C2PSA contributes to more effective feature extraction and processing. These enhancements allow the model to better analyze and interpret complex visual information, potentially leading to improved detection accuracy across various scenarios.
Multi-Task Proficiency: YOLO11’s versatility extends beyond object detection, encompassing tasks such as instance segmentation, image classification, pose estimation, and oriented object detection. This multi-faceted approach positions YOLO11 as a comprehensive solution for diverse CV challenges.
Enhanced Attention Mechanisms: A key advancement in YOLO11 is the integration of sophisticated spatial attention mechanisms, particularly the C2PSA component. This feature enables the model to focus more effectively on critical regions within an image, enhancing its ability to detect and analyze objects. The improved attention capability is especially beneficial for identifying complex or partially occluded objects, addressing a common challenge in object detection tasks. This refinement in spatial awareness contributes to YOLO11’s overall performance improvements, particularly in challenging visual environments.
Performance Benchmarks: Comparative analyses reveal YOLO11’s superior performance, particularly in its smaller variants. The nano model, despite a slight increase in parameters, demonstrates enhanced inference speed and frames per second (FPS) compared to its predecessor. This improvement suggests that YOLO11 achieves a favorable balance between computational efficiency and detection accuracy.
Implications for Real-World Applications: The advancements in YOLO11 have significant implications for various industries. Its improved efficiency and multi-task capabilities make it particularly suitable for applications in autonomous vehicles, surveillance systems, and industrial automation. The model’s ability to perform well across different scales also opens up new possibilities for deployment in resource-constrained environments without compromising on performance.

【翻譯】1. 效率和可擴展性：YOLO11引入了一系列模型尺寸，從納米級到超大型，滿足多樣化的應用需求。這種可擴展性允許在從資源受限的邊緣設備到高性能計算環境的各種場景中部署。特別是納米變體，相比其前任展示了令人印象深刻的速度和效率改進，使其非常適合實時應用。
2. 架構創新：該模型融入了增強其特征提取和處理能力的新穎架構元素。諸如C3k2塊、SPPF和C2PSA等新穎元素的融入有助于更有效的特征提取和處理。這些增強使模型能夠更好地分析和解釋復雜的視覺信息，可能導致在各種場景中檢測精度的提高。
3. 多任務熟練度：YOLO11的多功能性超越了目標檢測，涵蓋了實例分割、圖像分類、姿態估計和定向目標檢測等任務。這種多方面的方法將YOLO11定位為應對多樣化計算機視覺挑戰的綜合解決方案。
4. 增強的注意力機制：YOLO11的一個關鍵進步是整合了復雜的空間注意力機制，特別是C2PSA組件。這一特性使模型能夠更有效地關注圖像中的關鍵區域，增強其檢測和分析目標的能力。改進的注意力能力對于識別復雜或部分遮擋的目標特別有益，解決了目標檢測任務中的一個常見挑戰。這種空間感知的精細化有助于YOLO11的整體性能改進，特別是在具有挑戰性的視覺環境中。
5. 性能基準：比較分析揭示了YOLO11的優異性能，特別是在其較小的變體中。納米模型盡管參數略有增加，但與其前任相比展示了增強的推理速度和每秒幀數（FPS）。這種改進表明YOLO11在計算效率和檢測精度之間實現了有利的平衡。
6. 對現實世界應用的影響：YOLO11的進步對各個行業都有重要影響。其改進的效率和多任務能力使其特別適合自動駕駛車輛、監控系統和工業自動化中的應用。該模型在不同尺度上表現良好的能力也為在資源受限環境中的部署開辟了新的可能性，而不會影響性能。

8 Conclusion

YOLOv11 represents a significant advancement in the field of CV, offering a compelling combination of enhanced performance and versatility. This latest iteration of the YOLO architecture demonstrates marked improvements in accuracy and processing speed, while simultaneously reducing the number of parameters required. Such optimizations make YOLOv11 particularly well-suited for a wide range of applications, from edge computing to cloud-based analysis.

【翻譯】YOLOv11代表了計算機視覺領域的重大進步，提供了增強性能和多功能性的引人注目的組合。YOLO架構的這一最新迭代在準確性和處理速度方面展示了顯著改進，同時減少了所需的參數數量。這種優化使YOLOv11特別適合廣泛的應用，從邊緣計算到基于云的分析。

The model’s adaptability across various tasks, including object detection, instance segmentation, and pose estimation, positions it as a valuable tool for diverse industries such as emotion detection [ 26 ], healthcare [ 27 ] and various other industries [ 17 ]. Its seamless integration capabilities and improved efficiency make it an attractive option for businesses seeking to implement or upgrade their CV systems. In summary, YOLOv11’s blend of enhanced feature extraction, optimized performance, and broad task support establishes it as a formidable solution for addressing complex visual recognition challenges in both research and practical applications.

【翻譯】該模型在各種任務中的適應性，包括目標檢測、實例分割和姿態估計，將其定位為情感檢測[26]、醫療保健[27]和其他各個行業[17]等多樣化行業的寶貴工具。其無縫集成能力和改進的效率使其成為尋求實施或升級其計算機視覺系統的企業的有吸引力的選擇。總之，YOLOv11融合了增強的特征提取、優化的性能和廣泛的任務支持，將其確立為應對研究和實際應用中復雜視覺識別挑戰的強大解決方案。