Towards Open World Object Detection概述(論文)

論文：https://arxiv.org/abs/2103.02603
代碼：https://github.com/JosephKJ/OWOD

Towards Open World Object Detection

邁向開放世界目標檢測

Abstract 摘要

?Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-ofthe-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.¹
?人類天生具備識別環境中未知物體實例的本能。當相關知識最終可獲得時，對這些未知實例的內在好奇心有助于人們認知它們。這促使我們提出一個名為"開放世界目標檢測"的新型計算機視覺問題，該模型需要完成兩項任務：1）在沒有明確監督的情況下，將未接觸過的物體識別為"未知"；2）在逐步獲得相應標簽時，能夠持續學習這些已識別的未知類別而不遺忘先前習得的類別。我們構建了問題框架，制定了嚴格的評估標準，并提出基于對比聚類和能量檢測的未知識別新方法ORE（開放世界目標檢測器）。通過實驗評估和消融研究，我們分析了ORE在實現開放世界目標方面的有效性。一個有趣的副產品是，我們發現識別和表征未知實例有助于減少增量目標檢測中的混淆現象——在不增加方法復雜度的情況下，該方法實現了最先進的性能表現。我們希望這項工作能吸引更多學者投身這一新發現但至關重要的研究方向。¹

1. Introduction 引言

?Deep learning has accelerated progress in Object Detection research ², ³, ⁴, ⁵, ⁶, where a model is tasked to identify and localise objects in an image. All existing approaches work under a strong assumption that all the classes that are to be detected would be available at training phase. Two challenging scenarios arises when we relax this assumption: 1) A test image might contain objects from unknown classes, which should be classified as unknown. 2) As and when information (labels) about such identified unknowns become available, the model should be able to incrementally learn the new class. Research in developmental psychology ⁷, ⁸ finds out that the ability to identify what one doesn’t know, is key in captivating curiosity. Such a curiosity fuels the desire to learn new things ⁹, ¹⁰. This motivates us to propose a new problem where a model should be able to identify instances of unknown objects as unknown and subsequently learns to recognise them when training data progressively arrives, in a unified way. We call this problem setting as Open World Object Detection.
?深度學習加速了目標檢測研究的進展², ³, ⁴, ⁵, ⁶，其任務是讓模型識別并定位圖像中的對象。現有方法都基于一個強假設：所有待檢測類別在訓練階段都是已知的。當我們放寬這個假設時，會面臨兩個挑戰性場景：1）測試圖像可能包含未知類別的對象，這些對象應被分類為"未知"；2）當這些被識別的未知對象信息（標簽）可用時，模型應能增量學習新類別。發展心理學研究表明 ⁷, ⁸，識別未知事物的能力是激發好奇心的關鍵。這種好奇心激發了對學習新事物的渴望 ⁹, ¹⁰。這促使我們提出一個新問題：在統一的方式下，模型應能夠將未知物體的實例識別為未知，并在訓練數據逐步到達時學會識別它們。我們將這個問題的設定稱為“開放世界目標檢測”。
?The number of classes that are annotated in standard vision datasets like Pascal VOC ¹¹ and MS-COCO ¹² are very low (20 and 80 respectively) when compared to the infinite number of classes that are present in the open world. Recognising an unknown as an unknown requires strong generalization. Scheirer et al. ¹³ formalise this as Open Set classification problem. Henceforth, various methodologies (using 1-vs-rest SVMs and deep learning models) has been formulated to address this challenging setting. Bendale et al. ¹⁴ extend Open Set to an Open World classification setting by additionally updating the image classifier to recognise the identified new unknown classes. Interestingly, as seen in Fig. 1, Open World object detection is unexplored, owing to the difficulty of the problem setting.
?與開放世界中存在的無限類別相比，像Pascal VOC¹¹ 和MS-COCO¹² 這樣的標準視覺數據集中標注的類別數量非常少（分別為20和80）。將未知類別識別為未知需要強大的泛化能力。Scheirer等人 ¹³ 將這一問題形式化為開放集分類問題。此后，人們提出了各種方法（使用一對多支持向量機和深度學習模型）來應對這一具有挑戰性的場景。Bendale等人 ¹⁴ 將開放集擴展到開放世界分類場景，其方法還包括更新圖像分類器以識別已確認的新未知類別。有趣的是，如圖1 所示，由于問題場景的復雜性，開放世界目標檢測領域尚未被探索。

在這里插入圖片描述

圖1: 開放世界目標檢測（F）是一個尚未被正式定義和解決的新問題。盡管與開放集和開放世界分類相關，但開放世界目標檢測提出了自身獨特的挑戰，解決這些挑戰將提高目標檢測器的實用性。

?The advances in Open Set and Open World image classification cannot be trivially adapted to Open Set and Open World object detection, because of a fundamental difference in the problem setting: The object detector is trained to detect unknown objects as background. Instances of many unknown classes would have been already introduced to the object detector along with known objects. As they are not labelled, these unknown instances would be explicitly learned as background, while training the detection model. Dhamija et al. ¹⁵ find that even with this extra training signal, the state-of-the-art object detectors result in false positive detections, where the unknown objects end up being classified as one of the known classes, often with very high probability. Miller et al. ¹⁶ propose to use dropout sampling to get an estimate of the uncertainty of the object detection prediction. This is the only peer-reviewed research work in the open set object detection literature. Our proposed Open World Object Detection goes a step further to incrementally learn the new classes, once they are detected as unknown and an oracle provides labels for the objects of interest among all the unknowns. To the best of our knowledge this has not been tried in the literature.
?開放集與開放世界圖像分類領域的進展無法直接套用于開放集與開放世界目標檢測，因為問題設定存在本質差異：目標檢測器被訓練成將未知物體識別為背景。許多未知類別的實例早已伴隨著已知物體被輸入目標檢測器。由于未被標注，這些未知實例會在訓練檢測模型時被明確學習為背景。Dhamija等人¹⁵ 發現，即便存在這種額外的訓練信號，最先進的目標檢測器仍會產生誤檢——未知物體最終會被歸類為某個已知類別，且往往伴隨極高概率。Miller等人¹⁶提出使用Dropout采樣來估計目標檢測預測的不確定性，這是開放集目標檢測文獻中唯一經過同行評審的研究工作。我們提出的開放世界目標檢測更進一步：當新類別被檢測為未知物體且人工標注者提供所有未知物體中目標對象的標簽后，系統將逐步學習這些新類別。據我們所知，這一方法尚未在現有文獻中被嘗試過。
?The Open World Object Detection setting is much more natural than the existing closed-world, static-learning setting. The world is diverse and dynamic in the number, type and configurations of novel classes. It would be naive to assume that all the classes to expect at inference are seen during training. Practical deployments of detection systems in robotics, self-driving cars, plant phenotyping, healthcare and surveillance cannot afford to have complete knowledge on what classes to expect at inference time, while being trained in-house. The most natural and realistic behavior that one can expect from an object detection algorithm deployed in such settings would be to confidently predict an unknown object as unknown, and known objects into the corresponding classes. As and when more information about the identified unknown classes becomes available, the system should be able to incorporate them into its existing knowledge base. This would define a smart object detection system, and ours is an effort towards achieving this goal. The key contributions of our work are:
?開放世界物體檢測的設置比現有的封閉世界靜態學習設置更加自然。世界在新穎類別的數量、類型和配置方面是多樣且動態的。若假設推理時預期的所有類別都在訓練階段見過，這種想法未免過于天真。在機器人、自動駕駛汽車、植物表型分析、醫療保健和監控等實際應用場景中部署檢測系統時，我們無法在內部訓練階段就完全掌握推理時可能遇到的所有類別。在這些場景下，人們對物體檢測算法最自然且現實的期望是：它能自信地將未知物體識別為"未知"，將已知物體歸類到對應類別。當已識別未知類別的更多信息可用時，該系統應能將其納入現有知識庫。這將定義一個智能的物體檢測系統，而我們的工作正是為實現這一目標而努力。本研究的核心貢獻包括：

We introduce a novel problem setting, Open World Object Detection, which models the real-world more closely.
我們提出了一種新穎的問題設置——開放世界物體檢測，它能更貼近地模擬現實世界。
We develop a novel methodology, called ORE, based on contrastive clustering, an unknown-aware proposal network and energy based unknown identification to address the challenges of open world detection.
我們開發了一種名為ORE的新方法，該方法基于對比聚類、未知感知提案網絡和基于能量的未知識別技術，以解決開放世界檢測面臨的挑戰。
We introduce a comprehensive experimental setting, which helps to measure the open world characteristics of an object detector, and benchmark ORE on it against competitive baseline methods.
我們引入了一種全面的實驗設置，有助于衡量目標檢測器的開放世界特性，并在此基礎上將ORE與競爭性基線方法進行基準測試。
As an interesting by-product, the proposed methodology achieves state-of-the-art performance on Incremental Object Detection, even though not primarily designed for it.
作為一個有趣的副產品，所提出的方法在增量目標檢測任務中取得了最先進性能，盡管該方法并非專門為此設計。

2. Related Work 相關工作

?Open Set Classification: The open set setting considers knowledge acquired through training set to be incomplete, thus new unknown classes can be encountered during testing. Scheirer et al. ¹⁷ developed open set classifiers in a one-vs-rest setting to balance the performance and the risk of labeling a sample far from the known training examples (termed as open space risk). Follow up works ¹⁸, ¹⁹ extended the open set framework to multi-class classifier setting with probabilistic models to account for the fading away classifier confidences in case of unknown classes.
?開放集分類：開放集設定認為通過訓練集獲得的知識是不完整的，因此在測試過程中可能會遇到新的未知類別。Scheirer等人¹⁷ 在一對多分類場景中開發了開放集分類器，以平衡性能與遠離已知訓練樣本的樣本被標記的風險（稱為開放空間風險）。后續研究¹⁸, ¹⁹將開放集框架擴展至多分類器場景，通過概率模型來解決面對未知類別時分類器置信度衰減的問題。
?Bendale and Boult ²⁰ identified unknowns in the feature space of deep networks and used a Weibull distribution to estimate the set risk (called OpenMax classifier). A generative version of OpenMax was proposed in ²¹ by synthesizing novel class images. Liu et al. ²² considered a long-tailed recognition setting where majority, minority and unknown classes coexist. They developed a metric learning framework identify unseen classes as unknown. In similar spirit, several dedicated approaches target on detecting the out of distribution samples ²³ or novelties ²⁴. Recently, self-supervised learning ²⁵ and unsupervised learning with reconstruction ²⁶ have been explored for open set recognition. However, while these works can recognize unknown instances, they cannot dynamically update themselves in an incremental fashion over multiple training episodes. Further, our energy based unknown detection approach has not been explored before.
?Bendale和Boult²⁰ 識別出深度網絡特征空間中的未知類別，并采用威布爾分布估算集合風險（稱為OpenMax分類器）。文獻²¹ 提出生成式OpenMax方法，通過合成新類別圖像實現拓展。Liu等人²² 研究了一個多數類、少數類和未知類共存的開放長尾識別場景，開發出基于度量學習的框架來將未見類別識別為未知。類似地，多篇專項研究致力于檢測分布外樣本²³ 或新穎樣本²⁴ 。近期，自監督學習²⁵ 與基于重構的無監督學習²⁶ 也被探索用于開放集識別。然而，這些方法雖然能識別未知實例，卻無法在多輪訓練中以增量方式動態更新模型。此外，我們基于能量的未知檢測方法此前尚未被探索過。
?Open World Classification: ¹⁴ first proposed the open world setting for image recognition. Instead of a static classifier trained on a fixed set of classes, they proposed a more flexible setting where knowns and unknowns both coexist. The model can recognize both types of objects and adaptively improve itself when new labels for unknown are provided. Their approach extends Nearest Class Mean classifier to operate in an open world setting by re-calibrating the class probabilities to balance open space risk. ²⁷ studies open world face identity learning while ²⁸ proposed to use an exemplar set of seen classes to match them against a new sample, and rejects it in case of a low match with all previously known classes. However, they don’t test on image classification benchmarks and study product classification in e-commerce applications.
?開放世界分類：¹⁴首次提出了面向圖像識別的開放世界設定。與在固定類別集上訓練的靜態分類器不同，他們提出了一種更靈活的設定——已知類別與未知類別共存。該模型能同時識別兩類對象，并在提供未知類別新標簽時自適應優化。¹⁴通過重新校準類別概率以平衡開放空間風險，將最近類均值分類器擴展至開放世界場景。²⁷研究了開放世界人臉身份學習，而²⁸提出使用已見類別的范例集與新樣本進行匹配，若與所有已知類別匹配度均較低則予以拒絕。但兩者均未在圖像分類基準測試中進行驗證，而是針對電子商務應用中的商品分類展開研究。
?Open Set Detection: Dhamija et al. ¹⁵ formally studied the impact of open set setting on popular object detectors. They noticed that the state of the art object detectors often classify unknown classes with high confidence to seen classes. This is despite the fact that the detectors are explicitly trained with a background class ²⁹, ², ³⁰ and/or apply one-vs-rest classifiers to model each class ³¹, ⁵. A dedicated body of work ¹⁶, ³², ³³ focuses on developing measures of (spatial and semantic) uncertainty in object detectors to reject unknown classes. E.g., ¹⁶, ³² uses Monte Carlo Dropout ³⁴ sampling in a SSD detector to obtain uncertainty estimates. These methods, however, cannot incrementally adapt their knowledge in a dynamic world.
?開放集檢測：Dhamija等人¹⁵首次系統研究了開放集設定對主流目標檢測器的影響。他們發現，即便檢測器已通過背景類訓練²⁹, ², ³⁰和/或采用一對多分類器建模每個類別³¹, ⁵，當前最優檢測器仍會以高置信度將未知類別誤判為已知類別。為此，系列研究¹⁶, ³², ³³致力于構建目標檢測中的（空間與語義）不確定性度量以排除未知類別。例如¹⁶, ³²在SSD檢測器中采用蒙特卡洛Dropout采樣³⁴來獲取不確定性估計。但這類方法尚無法在動態環境中實現知識增量更新。

3. Open World Object Detection 開放世界目標檢測

?Let us formalise the definition of Open World Object Detection in this section. At any time $t$ , we consider the set of known object classes as $\mathcal{K}^t = \{1, 2, .., C\} ? \mathcal{N}^+$ where $\mathcal{N}^+$ denotes the set of positive integers. In order to realistically model the dynamics of real world, we also assume that their exists a set of unknown classes $\mathcal{U} = \{C + 1, ...\}$ , which may be encountered during inference. The known object classes $K_t$ are assumed to be labeled in the dataset $D^t = \{X^t, Y^t\}$ where $X$ and $Y$ denote the input images and labels respectively. The input image set comprises of $M$ training images, $Xt = \{I_1, . . . , I_M\}$ and associated object labels for each image forms the label set $Y^t = \{Y_1, . . . , Y_M \}$ . Each $Y_i = \{y_1, y_2, .., y_K \}$ encodes a set of $K$ object instances with their class labels and locations i.e., $y_k = [l_k, x_k, y_k, w_k, h_k]$ , where $l_k ∈ K_t$ and $x_k, y_k, w_k, h_k$ denote the bounding box center coordinates, width and height respectively.
?讓我們在此節正式定義開放世界目標檢測。在任意時刻 $t$ ，已知物體類別集合定義為 $\mathcal{K}^t = \{1, 2, .., C\} ? \mathcal{N}^+$ ，其中 $\mathcal{N}^+$ 表示正整數集。為真實模擬現實世界的動態性，我們同時假設存在未知類別集合 $\mathcal{U} = \{C + 1, ...\}$ ，這些類別可能在推理過程中遇到。已知物體類別 $K_t$ 在數據集 $D^t = \{X^t, Y^t\}$ 中被標記，其中 $X$ 和 $Y$ 分別表示輸入圖像和標簽。輸入圖像集包含 $M$ 張訓練圖像 $Xt = \{I_1, . . . , I_M\}$ ，每幅圖像的關聯物體標簽構成標簽集 $Y^t = \{Y_1, . . . , Y_M \}$ 。每個 $Y_i = \{y_1, y_2, .., y_K \}$ 編碼了 $K$ 個帶有類別標簽和位置信息的物體實例，即 $y_k = [l_k, x_k, y_k, w_k, h_k]$ ，其中 $l_k ∈ K_t$ ，而 $x_k, y_k, w_k, h_k$ 分別表示邊界框中心坐標、寬度和高度。
?The Open World Object Detection setting considers an object detection model $\mathcal{M}_C$ that is trained to detect all the previously encountered $C$ object classes. Importantly, the model $\mathcal{M}_C$ is able to identify a test instance belonging to any of the known $C$ classes, and can also recognize a new or unseen class instance by classifying it as an unknown, denoted by a label zero (0). The unknown set of instances $U^t$ can then be forwarded to a human user who can identifyn new classes of interest (among a potentially large number of unknowns) and provide their training examples. The learner incrementally adds $n$ new classes and updates itself to produce an updated model $\mathcal{M}_{C+n}$ without retraining from scratch on the whole dataset. The known class set is also updated $\mathcal{K}_{t+1} = \mathcal{K}_t + \{C + 1, . . . , C + n\}$ . This cycle continues over the life of the object detector, where it adaptively updates itself with new knowledge. The problem setting is illustrated in the top row of Fig. 2.
?開放世界目標檢測設定考慮了一個目標檢測模型 $\mathcal{M}_C$ ，該模型經過訓練可檢測所有先前遇到的 $C$ 個目標類別。值得注意的是，模型 $\mathcal{M}_C$ 不僅能識別屬于已知 $C$ 類中的測試實例，還能通過將其分類為未知（標記為零(0)）來識別新的或未見過的類別實例。未知實例集合 $U^t$ 隨后可轉發給人類用戶，用戶可在潛在的大量未知對象中識別出感興趣的新類別并提供訓練樣本。學習器逐步添加 $n$ 個新類別并自我更新，從而生成升級后的模型 $\mathcal{M}_{C+n}$ ，而無需在整個數據集上從頭開始重新訓練。已知類別集也同步更新為 $\mathcal{K}_{t+1} = \mathcal{K}_t + \{C + 1, ..., C + n\}$ 。這種循環在目標檢測器的生命周期中持續進行，使其能自適應地更新新知識。該問題設定如圖2頂行所示。
在這里插入圖片描述

圖2：方法概述：
第一行：在每一個增量學習步驟中，模型識別未知對象（用“？”表示），這些對象逐步被標注（如藍色圓圈所示）并加入現有知識庫（綠色圓圈）。
第二行：我們的開放世界目標檢測模型通過基于能量的分類頭和未知感知的RPN識別潛在未知對象。此外，我們在特征空間進行對比學習以形成可區分的聚類，并能以持續學習的方式靈活添加新類別，同時避免遺忘已學類別。

4. ORE: Open World Object Detector ORE：開放世界物體檢測器

?A successful approach for Open World Object Detection should be able to identify unknown instances without explicit supervision and defy forgetting of earlier instances when labels of these identified novel instances are presented to the model for knowledge upgradation (without retraining from scratch). We propose a solution, ORE which addresses both these challenges in a unified manner.
?開放世界目標檢測的成功方法應當能夠在不明確監督的情況下識別未知實例，并在將已識別新實例的標簽提供給模型進行知識升級時（無需從頭重新訓練），防止對先前實例的遺忘。我們提出了一種解決方案ORE，它以統一的方式應對這兩大挑戰。
?Neural networks are universal function approximators ³⁵, which learn a mapping between an input and the output through a series of hidden layers. The latent representation learned in these hidden layers directly controls how each function is realised. We hypothesise that learning clear discrimination between classes in the latent space of object detectors could have two fold effect. First, it helps the model to identify how the feature representation of an unknown instance is different from the other known instances, which helps identify an unknown instance as a novelty. Second, it facilitates learning feature representations for the new class instances without overlapping with the previous classes in the latent space, which helps towards incrementally learning without forgetting. The key component that helps us realise this is our proposed contrastive clusteringin the latent space, which we elaborate in Sec. 4.1.
?神經網絡是通用的函數逼近器³⁵，它們通過一系列隱藏層學習輸入與輸出之間的映射關系。這些隱藏層學習到的潛在表征直接控制著每個函數的具體實現方式。我們假設，在目標檢測器的潛在空間中學習清晰的類別區分可能具有雙重效應：首先，這有助于模型識別未知實例的特征表征與已知實例的差異，從而將未知實例識別為新類別；其次，這能促進新類別實例的特征表征學習，避免與潛在空間中已有類別發生重疊，進而實現持續學習而不遺忘。實現這一目標的關鍵組件是我們提出的潛在空間對比聚類方法，詳見第4.1節闡述。
?To optimally cluster the unknowns using contrastive clustering, we need to have supervision on what an unknown instance is. It is infeasible to manually annotate even a small subset of the potentially infinite set of unknown classes. To counter this, we propose an auto-labelling mechanism based on the Region Proposal Network ³ to pseudo-label unknown instances, as explained in Sec. 4.2. The inherent separation of auto-labelled unknown instances in the latent space helps our energy based classification head to differentiate between the known and unknown instances. As elucidated in Sec. 4.3, we find that Helmholtz free energy is higher for unknown instances.
?為了利用對比聚類最優地對未知類別進行聚類，我們需要對未知實例的定義建立監督機制。然而，在潛在無限的未知類別集合中，即使對很小一部分進行人工標注也是不可行的。為此，我們提出基于區域建議網絡³的自動標注機制（如第4.2節所述）來對未知實例進行偽標注。自動標注的未知實例在潛在空間中的固有分離性，有助于我們基于能量的分類頭區分已知和未知實例。如第4.3節所述，我們發現未知實例具有更高的亥姆霍茲自由能。
?Fig. 2 shows the high-level architectural overview of ORE. We choose Faster R-CNN ³ as the base detector as Dhamija et al. ¹⁵ has found that it has better open set performance when compared against one-stage RetinaNet detector ⁵ and objectness based YOLO detector ⁶. Faster R-CNN ³ is a two stage object detector. In the first stage, a class-agnostic Region Proposal Network (RPN) proposes potential regions which might have an object from the feature maps coming from a shared backbone network. The second stage classifies and adjusts the bounding box coordinates of each of the proposed region. The features that are generated by the residual block in the Region of Interest (RoI) head are contrastively clustered. The RPN and the classification head is adapted to auto-label and identify unknowns respectively. We explain each of these coherent constituent components, in the following subsections:
?圖2展示了ORE的高級架構概述。我們選擇Faster R-CNN ³作為基礎檢測器，因為Dhamija等人¹⁵發現與單階段RetinaNet檢測器⁵和基于目標性的YOLO檢測器⁶相比，它在開放集場景下表現更優。Faster R-CNN³是一種兩階段目標檢測器。第一階段，類別無關的區域提議網絡(RPN)從共享主干網絡生成的特征圖中提出可能包含物體的潛在區域。第二階段對每個提議區域進行分類并調整其邊界框坐標。在感興趣區域(RoI)頭部殘差塊生成的特征會進行對比聚類。RPN和分類頭部分別被改造用于自動標注和識別未知類別。我們將在以下小節詳細解釋這些協同工作的組成模塊：

4.1. Contrastive Clustering 對比聚類

?Class separation in the latent space would be an ideal characteristic for an Open World methodology to identify unknowns. A natural way to enforce this would be to model it as a contrastive clustering problem, where instances of same class would be forced to remain close-by, while instances of dissimilar class would be pushed far apart.
?潛在空間中的類別分離對于開放世界方法識別未知類別來說是一個理想特性。強制實現此特性的自然方法是將其建模為對比聚類問題：同類實例會被迫保持接近，而異類實例則會被推遠。
?For each known class $\mathcal{K}^t$ , we maintain a prototype vector $p_i$ . Let $f_c ∈ \mathcal{R}^d$ be a feature vector that is generated by an intermediate layer of the object detector, for an object of class $c$ . We define the contrastive loss as follows:
$L_{cont}(f_c) = ∑^C_{i=0}\mathcal{l}(f_c, {\,} {\,} p_i), {\,} {\,} where, \tag1$
$\mathcal{l}(f_c, p_i) = \begin{cases} \mathcal{D}(f_c, {\,} {\,} p_i) & i = c \\ max\{0, {\,} {\,} ? ? D(f_c, {\,} {\,} p_i)\} & otherwise \end{cases}$
where $\mathcal{D}$ is any distance function and $?$ defines how close a similar and dissimilar item can be. Minimizing this loss would ensure the desired class separation in the latent space.
?對于每個已知類別 $\mathcal{K}^t$ ，我們維護一個原型向量 $p_i$ 。設 $f_c ∈ \mathcal{R}^d$ 為目標檢測器中間層為類別 $c$ 的目標生成的特征向量。我們將對比損失定義如下：
$\tag1$
其中， $\mathcal{D}$ 是任意距離函數， $?$ 定義了相似項與不相似項之間的最小間隔。通過最小化該損失函數，可以確保潛在空間中實現所需的類別分離。
?Mean of feature vectors corresponding to each class is used to create the set of class prototypes: $\mathcal{P} = \{p_0 · · · p_C\}$ . Maintaining each prototype vector is a crucial component of ORE. As the whole network is trained end-to-end, the class prototypes should also gradually evolve, as the constituent features change gradually (as stochastic gradient descent updates weights by a small step in each iteration). We maintain a fixed-length queue $q_i$ , per class for storing the corresponding features. A feature store $\mathcal{F}_{store} = \{q_0 · · · q_C\}$ , stores the class specific features in the corresponding queues. This is a scalable approach for keeping track of how the feature vectors evolve with training, as the number of feature vectors that are stored is bounded by $C \times Q$ , where $Q$ is the maximum size of the queue.
?每個類別對應的特征向量均值用于創建類別原型集合： $\mathcal{P} = \{p_0 · · · p_C\}$ 。維護每個原型向量是ORE方法的關鍵組成部分。由于整個網絡是端到端訓練的，隨著構成特征的逐漸變化（隨機梯度下降在每次迭代中以小步長更新權重），類別原型也應逐步演變。我們為每個類別維護一個固定長度的隊列 $q_i$ 來存儲相應特征。特征存儲器 $\mathcal{F}_{store} = \{q_0 · · · q_C\}$ 將特定類別特征存儲于對應隊列中。這是一種可擴展的方法，用于追蹤特征向量隨訓練演變的軌跡，因為存儲的特征向量數量被限制為 $C \times Q$ ，其中 $Q$ 是隊列的最大容量。
?Algorithm 1 provides an overview on how class prototypes are managed while computing the clustering loss. We start computing the loss only after a certain number of burnin iterations ( $I_b$ ) are completed. This allows the initial feature embeddings to mature themselves to encode class information. Since then, we compute the clustering loss using Eqn. 1. After every $I_p$ iterations, a set of new class prototypes $\mathcal{P}_{new}$ is computed (line 8). Then the existing prototypes $\mathcal P$ are updated by weighing $\mathcal P$ and $\mathcal{P}_{new}$ with a momentum parameter $η$ . This allows the class prototypes to evolve gradually keeping track of previous context. The computed clustering loss is added to the standard detection loss and back-propagated to learn the network end-to-end.
?算法1展示了在計算聚類損失過程中如何管理類別原型的概覽。我們僅在完成一定數量的預熱迭代( $I_b$ )后開始計算損失，這使得初始特征嵌入能夠充分成熟以編碼類別信息。此后，我們使用公式1計算聚類損失。每經過 $I_p$ 次迭代，就會計算一組新的類別原型 $\mathcal{P}{new}$ (第8行)。隨后通過動量參數 $η$ 對現有原型 $\mathcal P$ 和 $\mathcal{P}{new}$ 進行加權更新，使類別原型能夠逐步演化并保持對先前上下文的追蹤。最終將計算得到的聚類損失與標準檢測損失相加，并通過反向傳播實現網絡的端到端學習。
在這里插入圖片描述

4.2. Auto-labelling Unknowns with RPN 使用RPN自動標注未知項

?While computing the clustering loss with Eqn. 1, we contrast the input feature vector $f_c$ against prototype vectors, which include a prototype for unknown objects too ( $c ∈ \{0, 1, .., C\}$ where 0 refers to the unknown class). This would require unknown object instances to be labelled withunknown ground truth class, which is not practically feasible owing to the arduous task of re-annotating all instances of each image in already annotated large-scale datasets.
?在使用公式1計算聚類損失時，我們將輸入特征向量 $f_c$ 與原型向量進行對比，這些原型向量也包括未知對象的原型（ $c ∈ \{0, 1, .., C\}$ ，其中0表示未知類別）。這就要求為未知物體實例標注未知的真實類別，但由于對已標注的大規模數據集中每張圖像的所有實例重新進行標注是一項艱巨的任務，這在實際操作中是不可行的。
?As a surrogate, we propose to automatically label some of the objects in the image as a potential unknown object. For this, we rely on the fact that Region Proposal Network (RPN) is class agnostic. Given an input image, the RPN generates a set of bounding box predictions for foreground and background instances, along with the corresponding objectness scores. We label those proposals that have high objectness score, but do not overlap with a ground-truth object as a potential unknown object. Simply put, we select the top-k background region proposals, sorted by its objectness scores, as unknown objects. This seemingly simple heuristic achieves good performance as demonstrated in Sec. 5.
?我們提出了一種替代方法：自動將圖像中的部分對象標注為潛在未知對象。其原理在于區域提議網絡（RPN）具有類別無關性。給定輸入圖像后，RPN會生成針對前景和背景實例的邊界框預測集合及其對應的目標性分數。我們將那些具有高目標性分數、但不與真實標注對象重疊的提議區域標記為潛在未知對象。簡而言之，我們按照目標性分數排序，選擇前k個背景區域提議作為未知對象。如第5節所示，這種看似簡單的啟發式方法取得了良好的性能表現。

4.3. Energy Based Unknown Identifier 基于能量的未知標識符

?Given the features ( $f \in F$ ) in the latent space F and their corresponding labels $l \in L$ , we seek to learn an energy function $E (F, L)$ . Our formulation is based on the Energy based models (EBMs) ³⁶ that learn a function $E (?)$ to estimates the compatibility between observed variables $F$ and possible set of output variables L using a single output scalar i.e., $\mathcal{R}^d → \mathcal{R}$ . The intrinsic capability of EBMs to assign low energy values to in-distribution data and vice-versa motivates us to use an energy measure to characterize whether a sample is from an unknown class.
?給定潛在空間F中的特征（ $f \in F$ ）及其對應標簽 $l \in L$ ，我們旨在學習一個能量函數 $E (F, L)$ 。該公式基于能量模型（EBMs）³⁶，通過學習函數 $E (?)$ 來評估觀測變量 $F$ 與可能輸出變量集L之間的兼容性，最終輸出單一標量值即 $\mathcal{R}^d → \mathcal{R}$ 。能量模型具有為分布內數據分配低能量值的本質特性，反之亦然，這促使我們采用能量度量來判定樣本是否屬于未知類別。
?Specifically, we use the Helmholtz free energy formulation where energies for all values in L are combined,
?具體而言，我們采用亥姆霍茲自由能公式，將所有L值的能量進行合并。
$log∫_{l′}exp(? \frac{E(f , l^′)}{T}), \tag2$
?where $T$ is the temperature parameter. There exists a simple relation between the network outputs after the softmax layer and the Gibbs distribution of class specific energy values ³⁷. This can be formulated as,
?其中 $T$ 為溫度參數。經過softmax層后的網絡輸出與類別特定能量值的吉布斯分布存在簡單關系³⁷。其公式可表述為：
$\frac { exp (\frac{ g_l (f )} {T} ) } { ∑^C_{i=1} exp( \frac{g_i(f )}{T} )} = exp(? E(f ,l)T ) exp(? E(f )T ) \tag3$
?where $p (l ∣ f)$ is the probability density for a label $l$ , $g l (f)$ is the $l^{th}$ classification logit of the classification head $g (.)$ . Using this correspondence, we define free energy of our classification models in terms of their logits as follows:
?其中 $p (l ∣ f)$ 是標簽 $l$ 的概率密度， $g l (f)$ 是分類頭 $g (.)$ 的第 $l$ 個分類邏輯值。利用這種對應關系，我們根據分類模型的邏輯值定義其自由能如下：
${\,} log∑^C_{i=1}exp( \frac {g_i(f )}{T} ). \tag4$
?The above equation provides us a natural way to transform the classification head of the standard Faster R-CNN ³ to an energy function. Due to the clear separation that we enforce in the latent space with the contrastive clustering, we see a clear separation in the energy level of the known class data-points and unknown data-points as illustrated in Fig. 3. In light of this trend, we model the energy distribution of the known and unknown energy values $ξ_{kn}(f )$ and $ξ_{unk}(f )$ , with a set of shifted Weibull distributions. These distributions were found to fit the energy data of a small held out validation set (with both knowns and unknowns instances) very well, when compared to Gamma, Exponential and Normal distributions. The learned distributions can be used to label a prediction as unknown if $ξ_{kn}(f ) < ξ_{unk}(f )$ .
?上述方程為我們提供了一種將標準Faster R-CNN³分類頭轉換為能量函數的自然方法。由于我們在潛在空間中通過對比聚類強制實現了清晰分離，如圖3所示，我們觀察到已知類數據點和未知數據點在能量水平上存在明顯區分。基于這一趨勢，我們使用一組平移威布爾分布對已知能量值 $ξ_{kn}(f)$ 和未知能量值 $ξ_{unk}(f)$ 的能量分布進行建模。與伽馬分布、指數分布和正態分布相比，這些分布被發現能很好地擬合小規模保留驗證集（包含已知和未知實例）的能量數據。當滿足 $ξ_{kn}(f) < ξ_{unk}(f)$ 條件時，習得的分布可用于將預測標記為未知類別。
在這里插入圖片描述

圖3：如上所示，已知和未知數據點的能量值呈現出明顯區分。我們對兩者分別擬合了威布爾分布，并利用這些分布來識別未見過的已知和未知樣本，具體方法在第4.3節中說明。

4.4. Alleviating Forgetting 緩解遺忘

?After the identification of unknowns, an important requisite for an open world detector is to be able to learn new classes, when the labeled examples of some of the unknown classes of interest are provided. Importantly, the training data for the previous tasks will not be present at this stage since retraining from scratch is not a feasible solution. Training with only the new class instances will lead to catastrophic forgetting ³⁸, ³⁹ of the previous classes. We note that a number of involved approaches have been developed to alleviate such forgetting, including methods based on parameter regularization ⁴⁰, ⁴¹, ⁴², ⁴³, exemplar replay ⁴⁴, ⁴⁵, ⁴⁶, ⁴⁷, dynamically expanding networks ⁴⁸, ⁴⁹, ⁵⁰ and meta-learning ⁵¹, ⁵².
?在未知類別被識別后，開放世界檢測器的一項重要要求是：當某些關注未知類別的標注樣本被提供時，能夠學習新類別。值得注意的是，由于從頭開始重新訓練并非可行方案，先前任務的訓練數據在此階段將不可用。僅用新類別實例進行訓練會導致對先前類別的災難性遺忘³⁸, ³⁹。我們注意到，目前已開發出多種復合方法以緩解此類遺忘現象，包括基于參數正則化的方法⁴⁰, ⁴¹, ⁴², ⁴³、樣本回放法⁴⁴, ⁴⁵, ⁴⁶, ⁴⁷、動態擴展網絡法⁴⁸, ⁴⁹, ⁵⁰以及元學習法⁵¹, ⁵²。
?We build on the recent insights from ⁵³, ⁵⁴, ⁵⁵ which compare the importance of example replay against other more complex solutions. Specifically, Prabhu et al. ⁵³ retrospects the progress made by the complex continual learning methodologies and show that a greedy exemplar selection strategy for replay in incremental learning consistently outperforms the state-of-the-art methods by a large margin. Knoblauch et al. ⁵⁴ develops a theoretical justification for the unwarranted power of replay methods. They prove that an optimal continual learner solves an NP-hard problem and requires infinite memory. The effectiveness of storing few examples and replaying has been found effective in the related few-shot object detection setting by Wang et al. ⁵⁵. These motivates us to use a relatively simple methodology for ORE to mitigate forgetting i.e., we store a balanced set of exemplars and finetune the model after each incremental step on these. At each point, we ensure that a minimum of $N_{ex}$ instances for each class are present in the exemplar set.
?我們借鑒了⁵³、⁵⁴、⁵⁵的最新研究成果，這些研究對比了樣本回放策略與其他復雜解決方案的重要性。具體而言，Prabhu等人⁵³回溯了復雜持續學習方法的進展，表明增量學習中貪心樣本選擇策略的回放方法始終以顯著優勢超越最先進方法。Knoblauch等人⁵⁴從理論上論證了回放方法被低估的效能，證明最優持續學習器需要解決NP難問題且需無限內存。Wang等人⁵⁵在相關的小樣本目標檢測場景中發現，存儲少量樣本并進行回放具有顯著效果。這些發現促使我們在ORE（開放世界目標識別）中采用相對簡單的方法來緩解遺忘問題：即存儲平衡的樣本集，并在每個增量步驟后對這些樣本進行模型微調。我們始終確保樣本集中每類至少保留 $N_{ex}$ 個實例。

5. Experiments and Results 實驗與結果

?We propose a comprehensive evaluation protocol to study the performance of an open world detector to identify unknowns, detect known classes and progressively learn new classes when labels are provided for some unknowns.
?我們提出了一套全面的評估方案，用于研究開放世界檢測器在以下方面的性能：識別未知類別、檢測已知類別，以及在部分未知樣本獲得標注時逐步學習新類別。

5.1. Open World Evaluation Protocol 開放世界評估協議

?Data split: We group classes into a set of tasks $\mathcal{T} =\{T_1, {\,}· · · {\,}T_t, {\,} · · · \}$ . All the classes of a specific task will be introduced to the system at a point of time $t$ . While learning $T_t$ , all the classes of $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ will be treated as known and $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ would be treated as unknown. For a concrete instantiation of this protocol, we consider classes from Pascal VOC ¹¹ and MS-COCO ¹². We group all VOC classes and data as the first task $T_1$ . The remaining60 classes of MS-COCO ¹² are grouped into three successive tasks with semantic drifts (see Tab. 1). All images which correspond to the above split from Pascal VOC and MS-COCO train-sets form the training data. For evaluation, we use the Pascal VOC test split and MS-COCO val split. 1k images from training data of each task is kept aside for validation. Data splits and codes can be found at https://github.com/JosephKJ/OWOD.
?數據劃分：我們將類別分組為一系列任務集合 $\mathcal{T} =\{T_1, {\,}· · · {\,}T_t, {\,} · · · \}$ 。特定任務的所有類別將在時間點 $t$ 引入系統。在學習 $T_t$ 時， $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ 的所有類別將被視為已知類別，而 $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ 將被視為未知類別。為具體說明該方案，我們采用 Pascal VOC ¹¹ 和 MS-COCO ¹² 的類別數據。將所有 VOC 類別及數據歸為第一個任務 $T_1$ ，MS-COCO ¹² 剩余的60個類別按語義偏移分為三個連續任務（見表1）。來自 Pascal VOC 和 MS-COCO 訓練集的上述劃分圖像構成訓練數據。評估時使用 Pascal VOC 測試集劃分和 MS-COCO 驗證集劃分。每個任務的訓練數據中保留1千張圖像作為驗證集。數據劃分與代碼詳見 https://github.com/JosephKJ/OWOD。
在這里插入圖片描述

表1：該表格展示了擬議的開放世界評估方案中的任務構成。顯示了每個任務的語義內容及各數據劃分中的圖像數量和實例（物體）數量。

?Evaluation metrics: Since an unknown object easily gets confused as a known object, we use the Wilderness Impact (WI) metric ¹⁵ to explicitly characterises this behaviour.
?評估指標：由于未知物體容易被誤認為已知物體，我們使用荒野影響（WI）指標¹⁵來明確描述這種行為。
${\,} Impact {\,} (W I) = \frac{P_\mathcal{K}} { P_{\mathcal{K} {\,}∪{\,} \mathcal{U}} }? 1, \tag5$
where $P_\mathcal{K}$ refers to the precision of the model when evaluated on known classes and $P_{\mathcal{K} {\,}∪{\,} \mathcal{U}}$ is the precision when evaluated on known and unknown classes, measured at a recall level $R$ ( $0.8$ in all experiments). Ideally, WI should be less as the precision must not drop when unknown objects are added to the test set. Besides WI, we also use Absolute Open-Set Error (A-OSE) ¹⁶ to report the number count of unknown objects that get wrongly classified as any of the known class. Both WI and A-OSE implicitly measure how effective the model is in handling unknown objects.
其中， $P_\mathcal{K}$ 表示模型在已知類別上的精確率， $P_{\mathcal{K} {\,}∪{\,} \mathcal{U}}$ 為模型在已知與未知類別聯合測試集上的精確率（所有實驗的召回率 $R$ 固定為 $0.8$ ）。理想情況下，WI值應較小，因為當測試集加入未知對象時，精確率不應下降。除WI外，我們還采用絕對開放集誤差(A-OSE)¹⁶來統計被錯誤分類為任何已知類別的未知對象數量。WI和A-OSE均隱式衡量模型處理未知對象的有效性。
?In order to quantify incremental learning capability of the model in the presence of new labeled classes, we measure the mean Average Precision (mAP) at IoU threshold of $0.5$ (consistent with the existing literature ⁵⁶, ⁵⁷).
?為了量化模型在面對新增標注類別時的增量學習能力，我們采用交并比(IoU)閾值為 $0.5$ 時的平均精度均值(mAP)作為評估指標（與現有文獻⁵⁶, ⁵⁷的測評標準保持一致）。

5.2. Implementation Details 實施細節

?ORE re-purposes the standard Faster R-CNN ³ object detector with a ResNet-50 ⁵⁸ backbone. To handle variable number of classes in the classification head, following incremental classification methods ⁵¹, ⁵², ⁴⁴, ⁴⁶, we assume a bound on the maximum number of classes to expect, and modify the loss to take into account only the classes of interest. This is done by setting the classification logits of the unseen classes to a large negative value ( $v$ ), thus making their contribution to softmax negligible ( $e^{?v} → 0$ ).
?ORE采用了標準的Faster R-CNN³目標檢測器，并配備ResNet-50⁵⁸主干網絡。為處理分類頭中可變類別數量的挑戰，該方法遵循增量分類技術⁵¹, ⁵², ⁴⁴, ⁴⁶的通用處理策略：預先設定最大預期類別數上限，并通過修改損失函數使其僅關注當前階段的待識別類別。具體實現時將未見類別的分類邏輯值設為極大負數（ $v$ ），使其對softmax函數的貢獻趨近于零（ $e^{?v} → 0$ ）。
?The 2048-dim feature vector which comes from the last residual block in the RoI Head is used for contrastive clustering. The contrastive loss (defined in Eqn. 1) is added to the standard Faster R-CNN classification and localization losses and jointly optimised for. While learning a task $T_i$ , only the classes that are part of $T_i$ will be labelled. While testing $T_i$ , all the classes that were previously introduced are labelled along with classes in $T_i$ , and all classes of future tasks will be labelled $^{'} u nkn o w n^{'}$ . For the exemplar replay, we empirically choose $N_{ex} = 50$ . We do a sensitivity analysis on the size of the exemplar memory in Sec. 6. Further implementation details are provided in supplementary.
?來自RoI Head最后一個殘差塊的2048維特征向量被用于對比聚類。對比損失（公式1定義）被添加到標準的Faster R-CNN分類和定位損失中，并進行聯合優化。在學習任務 $T_i$ 時，只有屬于 $T_i$ 部分的類別會被標注。在測試 $T_i$ 時，所有先前引入的類別將與 $T_i$ 中的類別一起被標注，而未來任務的所有類別將被標記為"'unknown"。對于樣本回放，我們根據經驗選擇 $N_{ex} = 50$ 。我們在第6節中對樣本記憶庫的規模進行了敏感性分析。更多實現細節見補充材料。

5.3. Open World Object Detection Results 開放世界目標檢測結果

?Table 2 shows how ORE compares against Faster RCNN on the proposed open world evaluation protocol. An ‘Oracle’ detector has access to all known and unknown labels at any point, and serves as a reference. After learning each task, WI and A-OSE metrics are used to quantify how unknown instances are confused with any of the known classes. We see that ORE has significantly lower WI and AOSE scores, owing to an explicit modeling of the unknown. When unknown classes are progressively labelled in Task 2, we see that the performance of the baseline detector on the known set of classes (quantified via mAP) significantly deteriorates from $56.16\%$ to $4.076\%$ . The proposed balanced finetuning is able to restore the previous class performance to a respectable level ( $51.09\%$ ) at the cost of increased WI and A-OSE, whereas ORE is able to achieve both goals: detect known classes and reduce the effect of unknown comprehensively. Similar trend is seen when Task 3 classes are added. WI and A-OSE scores cannot be measured for Task 4 because of the absence of any unknown ground-truths. We report qualitative results in Fig. 4 and supplementary section, along with failure case analysis. We conduct extensive sensitivity analysis in Sec. 6 and supplementary section.
?表2展示了ORE在提出的開放世界評估協議上與Faster RCNN的對比情況。"Oracle"檢測器在任何時刻都能獲取所有已知和未知標簽，作為參考基準。每學習完一個任務后，采用WI和A-OSE指標量化未知實例與已知類別的混淆程度。實驗表明，由于對未知類的顯式建模，ORE的WI和A-OSE得分顯著更低。當任務2中逐步標記未知類別時，基線檢測器在已知類別上的性能（通過mAP衡量）從 $56.16\%$ 急劇下降至 $4.076\%$ 。提出的平衡微調方法以WI和A-OSE上升為代價，將已知類別性能恢復至可觀水平（ $51.09\%$ ），而ORE則能同時實現兩個目標：準確檢測已知類別并全面降低未知類別影響。在添加任務3類別時也觀察到類似趨勢。由于任務4不存在任何未知真實標簽，無法測量WI和A-OSE得分。我們在圖4和補充材料中給出了定性結果及失敗案例分析，并在第6節和補充材料中進行了全面的敏感性分析。
在這里插入圖片描述

表2：此處展示ORE在開放世界目標檢測任務中的表現。Wilderness Impact（WI）和Average Open Set Error（A-OSE）量化ORE處理未知類別（灰色背景）的能力，而Mean Average Precision（mAP）衡量其檢測已知類別（白色背景）的性能。可見ORE在所有指標上均持續超越基于Faster R-CNN的基線方法。具體評估指標的詳細分析與說明請參閱第5.3節。

在這里插入圖片描述

表3：我們在三種不同設置下將ORE與最先進的增量目標檢測器進行對比。分別向已在10、15和19個類別上訓練好的檢測器（藍色背景所示）引入來自Pascal VOC 2007數據集¹¹的10個、5個及最后一個新增類別。ORE無需調整方法即可在所有設置中取得優異表現，詳情請參閱第5.4節。

在這里插入圖片描述

圖4：ORE在任務1訓練后的預測結果。模型尚未接觸“大象”、“蘋果”、“香蕉”、“斑馬”和“長頸鹿”等類別，因此這些對象被成功歸類為“unknown”。該方法將其中一只“長頸鹿”誤判為“馬”，顯示出ORE的局限性。

5.4. Incremental Object Detection Results 增量目標檢測結果

?We find an interesting consequence of the ability of ORE to distinctly model unknown objects: it performs favorably well on the incremental object detection (iOD) task against the state-of-the-art (Tab. 3). This is because, ORE reduces the confusion of an unknown object being classified as a known object, which lets the detector incrementally learn the true foreground objects. We use the standard protocol ⁵⁶, ⁵⁷ used in the iOD domain to evaluate ORE, where group of classes (10, 5 and the last class) from Pascal VOC 2007 ¹¹ are incrementally learned by a detector trained on the remaining set of classes. Remarkably, ORE is used as it is, without any change to the methodology introduced in Sec. 4. We ablate contrastive clustering (CC) and energy based unknown identification (EBUI) to find that it results in reduced performance than standard ORE.
?我們發現ORE對未知物體獨特建模能力帶來一個有趣的結果：在增量式物體檢測（iOD）任務中，其性能顯著優于現有最佳方法（表3）。這是因為ORE減少了未知物體被誤分類為已知物體的情況，使檢測器能逐步學習真實的前景物體。我們采用iOD領域標準協議⁵⁶, ⁵⁷來評估ORE，該方法通過已在其他類別上訓練的檢測器，逐步學習Pascal VOC 2007¹¹中的類別組（10類、5類和最后一類）。值得注意的是，此處使用的ORE完全保持第4節所述方法未做任何改動。通過消融對比聚類（CC）和基于能量的未知標識（EBUI）發現，移除這些組件會導致性能低于標準ORE。

6. Discussions and Analysis 討論與分析

6.1 Ablating ORE Components: 消融ORE組件

?To study the contribution of each of the components in ORE, we design careful ablation experiments (Tab. 4). We consider the setting where Task 1 is introduced to the model. The auto-labelling methodology (referred to as ALU), combined with energy based unknown identification (EBUI) performs better together (row 5) than using either of them separately (row3 and 4). Adding contrastive clustering (CC) to this configuration, gives the best performance in handling unknown (row 7), measured in terms of WI and A-OSE. There is no severe performance drop in known classes detection (mAP metric) as a side effect of unknown identification. In row6, we see that EBUI is a critical component whose absence increases WI and A-OSE scores. Thus, each component in ORE has a critical role to play for unknown identification.
?為研究ORE模型中各組成部分的貢獻，我們設計了嚴謹的消融實驗（表4）。實驗設定為向模型引入任務1的情況。自動標注方法（簡稱ALU）與基于能量的未知樣本識別（EBUI）聯合使用（第5行）的表現優于單獨使用任一方法（第3、4行）。在此配置中加入對比聚類（CC）后，通過WI和A-OSE指標衡量，獲得了最佳的未知樣本處理效果（第7行）。作為未知識別的附加效應，已知類別的檢測性能（mAP指標）未出現顯著下降。第6行數據顯示，EBUI是關鍵組件，缺失該組件將導致WI和A-OSE指標上升。因此，ORE中的每個組件對未知樣本識別都發揮著不可替代的作用。
在這里插入圖片描述

表4：我們仔細分析了ORE的各個組成部分。CC、ALU和EBUI分別指“對比聚類”、“未知類別自動標注”和“基于能量的未知標識器”。更多細節請參閱第6.1節。

6.2 Sensitivity Analysis on Exemplar Memory Size: 范例記憶庫規模的敏感性分析

?Our balanced finetuning strategy requires storing exemplar images with at least $N_{ex}$ instances per class. We vary $N_{ex}$ while learning Task 2 and report the results in Table 5. We find that balanced finetuning is very effective in improving the accuracy of previously known class, even with just having minimum $10$ instances per class. However, we find that increasing Nex to large values does-not help and at the same time adversely affect how unknowns are handled (evident from WI and A-OSE scores). Hence, by validation, we set $N_{ex}$ to $50$ in all our experiments, which is a sweet spot that balances performance on known and unknown classes.
?我們的平衡微調策略要求存儲每類至少 $N_{ex}$ 個示例圖像。在學習任務2的過程中，我們調整 $N_{ex}$ 值并在表5中報告結果。研究發現，即使每類僅保留 $10$ 個樣本實例，平衡微調對提升已知類準確率也非常有效。但值得注意的是，過度增大Nex值不僅無益，反而會影響模型處理未知類的性能（這一點從WI和A-OSE評分可明顯看出）。經實驗驗證，我們最終在所有測試中將 $N_{ex}$ 設定為 $50$ ——這是兼顧已知類與未知類性能的最佳平衡點。
在這里插入圖片描述

表5：該表顯示了敏感性分析。大幅增加 $N_{ex}$ 會損害未知樣本上的性能，而少量圖像對于緩解遺忘至關重要（最佳行以綠色標出）。

6.3 Comparison with an Open Set Detector: 與開放式集檢測器的比較

?The mAP values of the detector when it is evaluated on closed set data (trained and tested on Pascal VOC 2007) and open set data (test set contains equal number of unknown images from MS-COCO) helps to measure how the detector handles unknown instances. Ideally, there should not be a performance drop. We compare ORE against the recent open set detector proposed by Miller et al. ¹⁶. We find from Tab. 6 that drop in performance of ORE is much lower than ¹⁶ owing to the effective modelling of the unknown instances.
?檢測器在閉集數據（基于Pascal VOC 2007訓練測試）和開集數據（測試集包含等量MS-COCO未知圖像）上評估的mAP值，可衡量其處理未知實例的能力。理想情況下性能不應下降。我們將ORE與Miller等人¹⁶提出的最新開集檢測器進行對比。從表6可見，由于對未知實例的有效建模，ORE的性能下降幅度遠低于¹⁶。
在這里插入圖片描述

表6：與開放集物體檢測器的性能對比。ORE能夠顯著減少mAP值下降幅度。

6.4 Clustering loss and t-SNE ⁵⁹ visualization: 聚類損失和t-SNE⁵⁹可視化

?We visualise the quality of clusters that are formed while training with the contrastive clustering loss (Eqn. 1) for Task 1. We see nicely formed clusters in Fig. 5 (a). Each number in the legend correspond to the $20$ classes introduced in Task 1. Label $20$ denotes unknown class. Importantly, we see that the unknown instances also gets clustered, which reinforces the quality of the auto-labelled unknowns used in contrastive clustering. In Fig. 5 (b), we plot the contrastive clustering loss against training iterations, where we see a gradual decrease, indicative of good convergence.
?我們可視化在任務1中使用對比聚類損失（公式1）訓練時形成的聚類質量。圖5(a)顯示出良好形成的聚類簇。圖例中的每個數字對應任務1中介紹的 $20$ 個類別，標簽 $20$ 表示未知類別。值得注意的是，我們看到未知實例同樣形成了聚類，這印證了對比聚類中自動標注未知樣本的質量。圖5(b)繪制了對比聚類損失隨訓練迭代的變化曲線，可見損失值逐漸下降，表明模型具有良好的收斂性。

圖5：(a) 潛在空間中的不同聚類。(b) 我們的對比損失確保這種聚類穩定收斂。

7. Conclusion 結論

?The vibrant object detection community has pushed the performance benchmarks on standard datasets by a large margin. The closed-set nature of these datasets and evaluation protocols, hampers further progress. We introduce Open World Object Detection, where the object detector is able to label an unknown object as unknown and gradually learn the unknown as the model gets exposed to new labels. Our key novelties include an energy-based classifier for unknown detection and a contrastive clustering approach for open world learning. We hope that our work will kindle further research along this important and open direction.
?充滿活力的目標檢測研究界已在標準數據集上將性能基準大幅提升。然而，這些數據集和評估協議的封閉性阻礙了進一步突破。我們提出"開放世界目標檢測"新范式，當檢測器遇到未知物體時能夠將其標記為"未知"，并在模型接觸新標簽后逐步學習識別這些未知對象。該研究的關鍵創新包括：基于能量的未知對象分類器，以及采用對比聚類方法的開放世界學習框架。我們期望這項工作能在這個重要且開放的研究方向上點燃更多探索火花。

Acknowledgements 致謝

?We thank TCS for supporting KJJ through its PhD fellowship; MBZUAI for a start-up grant; VR starting grant (2016-05543) and DST, Govt of India, for partly supporting this work through IMPRINT program (IMP/2019/000250). We thank our anonymous reviewers for their valuable feedback.

References

Manoj Acharya, Tyler L. Hayes, and Christopher Kanan. Rodeo: Replay for online object detection. In The British Machine Vision Conference, 2020. 13 ?? ??
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. 1, 2 ?? ?? ?? ??
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1, 3, 5, 6 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Kaiming He, Georgia Gkioxari, Piotr Doll ?ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1 ?? ??
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ?ar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 1, 2, 3 ?? ?? ?? ?? ?? ??
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 3 ?? ?? ?? ??
John A Meacham. Wisdom and the context of knowledge: Knowing that one doesn’t know. On the development of developmental psychology, 8:111–134, 1983. 1 ?? ??
Mario Livio. Why?: What makes us curious. Simon and Schuster, 2017. 1 ?? ??
Susan Engel. Children’s need to know: Curiosity in schools.Harvard educational review, 81(4):625–645, 2011. 1 ?? ??
Brian Grazer and Charles Fishman. A curious mind: The secret to a bigger life. Simon and Schuster, 2016. 1 ?? ??
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 1, 6, 7 ?? ?? ?? ?? ?? ?? ??
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ?ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, pages 740–755. Springer, 2014. 1, 6 ?? ?? ?? ?? ?? ??
Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2012. 1 ?? ??
Abhijit Bendale and Terrance Boult. Towards open world recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1893–1902, 2015. 1, 2 ?? ?? ?? ?? ??
Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult. The overlooked elephant of object detection: Open set. In The IEEE Winter Conference on Applications of Computer Vision, pages 1021–1030, 2020. 2, 3, 6 ?? ?? ?? ?? ?? ?? ?? ??
Dimity Miller, Lachlan Nicholson, Feras Dayoub, and Niko S ?underhauf. Dropout sampling for robust object detection in open-set conditions. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2018. 2, 6, 8 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(35):1757–1772, 2013. 2 ?? ??
Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multiclass open set recognition using probability of inclusion. InEuropean Conference on Computer Vision, pages 393–409. Springer, 2014. 2 ?? ??
Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence, 36(11):2317– 2324, 2014. 2 ?? ??
Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1563–1572, 2016. 2 ?? ??
Zongyuan Ge, Sergey Demyanov, Zetao Chen, and Rahil Garnavi. Generative openmax for multi-class open set classification. In British Machine Vision Conference 2017. British Machine Vision Association and Society for Pattern Recognition, 2017. 2 ?? ??
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2537–2546, 2019. 2 ?? ??
Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018. 2 ?? ??
Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto. Generative probabilistic novelty detection with adversarial autoencoders. In Advances in neural information processing systems, pages 6822–6833, 2018. 2 ?? ??
Pramuditha Perera, Vlad I. Morariu, Rajiv Jain, Varun Manjunatha, Curtis Wigington, Vicente Ordonez, and Vishal M. Patel. Generative-discriminative feature representations for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 2 ?? ??
Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Naemura. Classificationreconstruction learning for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2 ?? ??
Federico Pernici, Federico Bartoli, Matteo Bruni, and Alberto Del Bimbo. Memory based online learning of deep representations from video streams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2324–2334, 2018. 2 ?? ??
Hu Xu, Bing Liu, Lei Shu, and P Yu. Open-world learning and application to product classification. In The World Wide Web Conference, pages 3413–3419, 2019. 2 ?? ??
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016. 2 ?? ??
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 2 ?? ??
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. 2 ?? ??
Dimity Miller, Feras Dayoub, Michael Milford, and Niko S ?underhauf. Evaluating merging strategies for samplingbased uncertainty techniques in object detection. In 2019 International Conference on Robotics and Automation (ICRA), pages 2348–2354. IEEE, 2019. 2 ?? ?? ?? ??
David Hall, Feras Dayoub, John Skinner, Haoyang Zhang, Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia Angelova, and Niko S ?underhauf. Probabilistic object detection: Definition and evaluation. In The IEEE Winter Conference on Applications of Computer Vision, pages 1031–1040, 2020. 2 ?? ??
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016. 2 ?? ??
Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. 3 ?? ??
Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. 5 ?? ??
Weitang Liu, Xiaoyun Wang, John Owens, and Sharon Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33, 2020. 5 ?? ??
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989. 5 ?? ??
Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999. 5 ?? ??
Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018. 5 ?? ??
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 5 ?? ??
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018. 5 ?? ??
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3987–3995. JMLR. org, 2017. 5 ?? ??
Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with agem. In ICLR, 2019. 5, 6 ?? ?? ?? ??
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542. IEEE, 2017. 5 ?? ??
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017. 5, 6 ?? ?? ?? ??
Francisco M Castro, Manuel J Mar ??n-Jim ?enez, Nicol ?as Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018. 5 ?? ??
Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 5 ?? ??
Joan Serr`a, D ??dac Sur ??s, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423, 2018. 5 ?? ??
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016. 5 ?? ??
Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Mubarak Shah. itaml: An incremental task-agnostic meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13588–13597, 2020. 5, 6 ?? ?? ?? ??
Joseph KJ and Vineeth Nallure Balasubramanian. Metaconsolidation for continual learning. Advances in Neural Information Processing Systems, 33, 2020. 5, 6 ?? ?? ?? ??
Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. In European Conference on Computer Vision, pages 524–540. Springer, 2020. 5 ?? ?? ?? ??
Jeremias Knoblauch, Hisham Husain, and Tom Diethe. Optimal continual learning has perfect memory and is np-hard.arXiv preprint arXiv:2006.05188, 2020. 5 ?? ?? ?? ??
Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957, 2020. 5 ?? ?? ?? ??
Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pages 3400–3409, 2017. 6, 7, 13 ?? ?? ?? ??
Can Peng, Kun Zhao, and Brian C. Lovell. Faster ilod: Incremental learning for object detectors based on faster rcnn.Pattern Recognition Letters, 140:109 – 115, 2020. 6, 7, 13 ?? ?? ?? ??
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6 ?? ??
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. 8 ?? ??