計算機視覺：經典數據格式(VOC、YOLO、COCO)解析與轉換(附代碼)

第一章：計算機視覺中圖像的基礎認知
第二章：計算機視覺：卷積神經網絡(CNN)基本概念(一)
第三章：計算機視覺：卷積神經網絡(CNN)基本概念(二)
第四章：搭建一個經典的LeNet5神經網絡(附代碼)
第五章：計算機視覺：神經網絡實戰之手勢識別(附代碼)
第六章：計算機視覺：目標檢測從簡單到容易(附代碼)
第七章：MTCNN 人臉檢測技術揭秘：原理、實現與實戰(附代碼)
第八章：探索YOLO技術：目標檢測的高效解決方案
第九章：計算機視覺：主流數據集整理
第十章：生成對抗網絡(GAN)：從概念到代碼實踐(附代碼)
第十一章：計算機視覺：經典數據格式(VOC、YOLO、COCO)解析與轉換(附代碼)
第十二章：計算機視覺：YOLOv11遙感圖像目標檢測(附代碼)

在計算機視覺（CV）領域，無論是進行目標檢測、圖像分類還是其他任務，理解如何處理不同格式的數據集以及掌握訓練過程中涉及的關鍵指標至關重要。本文將探討三種經典的數據格式（VOC、YOLO、COCO）

一、VOC 格式

VOC（Visual Object Classes）格式是一種廣泛應用于目標檢測任務的數據標注標準，尤其常見于PASCAL VOC挑戰賽中。它使用XML文件來存儲圖像中的對象位置信息和類別信息。

文件結構與內容

每個圖像對應一個XML文件，該文件包含了圖像的基本信息以及圖像中每個對象的位置和類別標簽。以下是一個典型的VOC格式XML文件的內容示例：

<annotation><folder>images</folder><filename>000001.jpg</filename><size><width>500</width><height>375</height><depth>3</depth></size><object><name>dog</name><pose>Left</pose><truncated>1</truncated><difficult>0</difficult><bndbox><xmin>263</xmin><ymin>211</ymin><xmax>324</xmax><ymax>339</ymax></bndbox></object><object><name>person</name><pose>Unspecified</pose><truncated>0</truncated><difficult>0</difficult><bndbox><xmin>159</xmin><ymin>59</ymin><xmax>281</xmax><ymax>287</ymax></bndbox></object>
</annotation>

關鍵元素說明

<folder>：包含圖像的文件夾名稱。
<filename>：圖像文件名。
<size>：描述圖像尺寸，包括寬度、高度和深度（通常是3表示RGB圖像）。
<object>：每個對象的信息塊，可以有多個，每個對象包含：
- <name>：對象類別名稱。
- <pose>：拍攝時物體的姿態。
- <truncated>：指示物體是否被裁剪（部分位于圖像外）。
- <difficult>：指示物體是否難以識別。
- <bndbox>：邊界框坐標，包括：
  - <xmin>, <ymin>：邊界框左上角的絕對坐標（像素值）。
  - <xmax>, <ymax>：邊界框右下角的絕對坐標（像素值）。

處理VOC數據的Python代碼示例

下面是一個簡單的例子，展示如何讀取并解析VOC格式的XML文件，并提取其中的對象信息：

from xml.etree import ElementTreedef parse_voc_xml(file_path):tree = ElementTree.parse(file_path)root = tree.getroot()# 獲取圖像尺寸img_width = int(root.find("size/width").text)img_height = int(root.find("size/height").text)objects = []for obj in root.findall("object"):name = obj.find("name").textxmin = int(obj.find("bndbox/xmin").text)ymin = int(obj.find("bndbox/ymin").text)xmax = int(obj.find("bndbox/xmax").text)ymax = int(obj.find("bndbox/ymax").text)objects.append({"name": name,"bbox": [xmin, ymin, xmax, ymax]})return img_width, img_height, objects# 使用示例
file_path = "path/to/voc_annotation.xml"
width, height, objs = parse_voc_xml(file_path)
print(f"Image width: {width}, height: {height}")
for obj in objs:print(obj)

此代碼段展示如何從給定的VOC格式XML文件中提取圖像尺寸和每個對象的位置及類別信息。

二、YOLO 格式

YOLO（You Only Look Once）是一種流行的目標檢測算法，它使用一種特定的數據標注格式來描述圖像中的對象位置和類別信息。與VOC或COCO等其他數據格式不同，YOLO格式采用文本文件（.txt）存儲每個圖像的標注信息，這些信息包括對象的類別ID及其邊界框的位置坐標。

文件結構與內容

對于每張圖像，YOLO格式會有一個對應的文本文件，該文件中每一行代表一個對象，并且包含五個數值：

類別ID（cls_id）
邊界框中心點的x坐標（x_center）
邊界框中心點的y坐標（y_center）
邊界框的寬度（w）
邊界框的高度（h）

所有坐標都是相對坐標，即相對于圖像寬度和高度的比例值（0到1之間的小數），而不是絕對像素值。以下是YOLO格式的一個簡單示例：

假設有一張分辨率為640x480的圖片，其中包含兩個對象：一只狗和一個人。相應的YOLO格式標注文件可能如下所示：

0 0.500000 0.600000 0.250000 0.300000 # 狗
1 0.300000 0.200000 0.100000 0.150000 # 人

第一行表示“狗”的類別ID為0，其邊界框中心位于圖像寬度的50%、高度的60%，寬度占整個圖像寬度的25%，高度占30%。
第二行表示“人”的類別ID為1，其邊界框中心位于圖像寬度的30%、高度的20%，寬度占整個圖像寬度的10%，高度占15%。

處理YOLO數據的Python代碼示例

以下是一個簡單的例子，展示如何將VOC格式轉換為YOLO格式，并讀取YOLO格式的數據。

from xml.etree import ElementTreedef voc_to_yolo(voc_file_path, output_file_path, label2idx):tree = ElementTree.parse(voc_file_path)root = tree.getroot()img_width = int(root.find("size/width").text)img_height = int(root.find("size/height").text)with open(output_file_path, 'w') as f:for obj in root.findall("object"):name = obj.find("name").textcls_id = label2idx[name]xmin = int(obj.find("bndbox/xmin").text)ymin = int(obj.find("bndbox/ymin").text)xmax = int(obj.find("bndbox/xmax").text)ymax = int(obj.find("bndbox/ymax").text)# 這是計算邊界框左上角和右下角的x坐標的平均值，即邊界框中心點的x坐標（以像素為單位）。x_center = (xmin + xmax) / 2.0 / img_width# 這是計算邊界框左上角和右下角的y坐標的平均值，即邊界框中心點的y坐標（以像素為單位）。y_center = (ymin + ymax) / 2.0 / img_heightwidth = (xmax - xmin) / float(img_width)height = (ymax - ymin) / float(img_height)line = f"{cls_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n"f.write(line)# 示例用法
label2idx = {"dog": 0, "person": 1}
voc_file_path = "path/to/voc_annotation.xml"
output_file_path = "path/to/output.txt"
voc_to_yolo(voc_file_path, output_file_path, label2idx)

在YOLO格式中，邊界框的坐標是以相對坐標的形式表示的，而不是絕對像素值。具體來說，x_center 和 y_center 分別代表邊界框中心點相對于圖像寬度和高度的比例值（范圍從0到1），而 w 和 h 分別代表邊界框的寬度和高度相對于圖像寬度和高度的比例值。

公式解釋

x_center = (xmin + xmax) / 2.0 / img_width
y_center = (ymin + ymax) / 2.0 / img_height

計算邊界框中心點的相對坐標

計算邊界框中心點的絕對坐標：
- (xmin + xmax) / 2.0：這是計算邊界框左上角和右下角的x坐標的平均值，即邊界框中心點的x坐標（以像素為單位）。
- (ymin + ymax) / 2.0：這是計算邊界框左上角和右下角的y坐標的平均值，即邊界框中心點的y坐標（以像素為單位）。
轉換為相對坐標：
- / img_width：將邊界框中心點的x坐標除以圖像的寬度，得到一個比例值（范圍從0到1）。例如，如果邊界框中心點的x坐標是320像素，而圖像的寬度是640像素，則 x_center 的值為 320 / 640 = 0.5。
- / img_height：將邊界框中心點的y坐標除以圖像的高度，得到一個比例值（范圍從0到1）。例如，如果邊界框中心點的y坐標是240像素，而圖像的高度是480像素，則 y_center 的值為 240 / 480 = 0.5。

示例

假設有一張分辨率為640x480的圖片，其中有一個對象的邊界框坐標如下：

xmin = 100
ymin = 150
xmax = 300
ymax = 350

根據上述公式計算：

計算邊界框中心點的絕對坐標：
- x_center_abs = (100 + 300) / 2.0 = 200
- y_center_abs = (150 + 350) / 2.0 = 250
轉換為相對坐標：
- x_center_rel = 200 / 640 ≈ 0.3125
- y_center_rel = 250 / 480 ≈ 0.5208

因此，在YOLO格式的標注文件中，該對象的標注信息可能如下所示：

0 0.3125 0.5208 0.3125 0.4167

其中：

0 是類別ID。
0.3125 是邊界框中心點的x坐標相對于圖像寬度的比例值。
0.5208 是邊界框中心點的y坐標相對于圖像高度的比例值。
0.3125 是邊界框寬度相對于圖像寬度的比例值（(300 - 100) / 640 = 200 / 640 ≈ 0.3125）。
0.4167 是邊界框高度相對于圖像高度的比例值（(350 - 150) / 480 = 200 / 480 ≈ 0.4167）。

讀取YOLO格式數據

def read_yolo_annotations(file_path):annotations = []with open(file_path, 'r') as f:lines = f.readlines()for line in lines:parts = line.strip().split()cls_id = int(parts[0])x_center, y_center, w, h = map(float, parts[1:])annotations.append({"cls_id": cls_id,"bbox": [x_center, y_center, w, h]})return annotations# 示例用法
file_path = "path/to/yolo_annotation.txt"
annotations = read_yolo_annotations(file_path)
for annotation in annotations:print(annotation)

通過上述示例，可以輕松地在VOC格式和YOLO格式之間進行轉換，并讀取YOLO格式的數據。這對于準備訓練數據集或進行數據分析非常有用。

三、COCO 格式

COCO（Common Objects in Context）格式是一種廣泛用于計算機視覺任務，特別是目標檢測、分割和關鍵點檢測的數據標注標準。它采用JSON文件來存儲圖像及其對應的注釋信息，具有高度結構化的特點，支持復雜的多對象標注。

文件結構與內容

COCO格式的JSON文件通常包含以下幾個主要部分：

images: 包含圖像的基本信息。
annotations: 描述圖像中的每個對象或區域的信息。
categories: 定義所有可能的對象類別。

以下是一個簡化的COCO格式JSON文件示例：

{"images": [{"id": 0,"width": 640,"height": 480,"file_name": "000000000009.jpg"}],"annotations": [{"id": 1,"image_id": 0,"category_id": 1,"bbox": [100, 150, 200, 200],"area": 40000,"iscrowd": 0},{"id": 2,"image_id": 0,"category_id": 2,"bbox": [300, 200, 100, 150],"area": 15000,"iscrowd": 0}],"categories": [{"id": 1,"name": "person","supercategory": "person"},{"id": 2,"name": "dog","supercategory": "animal"}]
}

images: 每個元素包含一個圖像的信息，如ID、寬度、高度和文件名。
annotations: 每個元素描述一個對象的位置（通過邊界框bbox）、面積area、是否為群體對象iscrowd等信息。
categories: 定義了所有可能的對象類別及其ID。

關鍵字段解釋

bbox: 邊界框的坐標，格式為 [x, y, width, height]，其中x和y是邊界框左上角的絕對坐標（像素值），width和height是邊界框的寬度和高度（同樣以像素為單位）。
area: 對象的面積，對于目標檢測任務，這通常是邊界框的面積（寬度乘以高度）。
iscrowd: 標記該對象是否為群體對象（例如一群人聚集在一起）。如果為1，則表示該對象是一個群體；如果為0，則表示單獨的對象。

處理COCO數據的Python代碼示例

下面是一個簡單的例子，展示如何讀取并解析COCO格式的JSON文件，并提取其中的對象信息：

import jsondef parse_coco_json(file_path):with open(file_path, 'r') as f:data = json.load(f)images = {img['id']: img for img in data['images']}categories = {cat['id']: cat for cat in data['categories']}annotations = []for ann in data['annotations']:image_info = images[ann['image_id']]category_info = categories[ann['category_id']]annotation = {"image_id": ann['image_id'],"filename": image_info['file_name'],"category_id": ann['category_id'],"category_name": category_info['name'],"bbox": ann['bbox'],"area": ann['area']}annotations.append(annotation)return annotations# 示例用法
file_path = "path/to/coco_annotation.json"
annotations = parse_coco_json(file_path)
for annotation in annotations:print(annotation)

輸出結果：

{'image_id': 0, 'filename': '000000000009.jpg', 'category_id': 1, 'category_name': 'person', 'bbox': [100, 150, 200, 200], 'area': 40000}
{'image_id': 0, 'filename': '000000000009.jpg', 'category_id': 2, 'category_name': 'dog', 'bbox': [300, 200, 100, 150], 'area': 15000}