本推文分析了arXiv中Computer Vision and Patteren Recognition(計算機視覺與模式識別)領域2025年8月發布的近50篇論文的研究熱點,旨在幫助讀者快速了解近期領域內的前沿技術與研究方向。
arXiv是全球最具影響力的開放電子預印本平臺之一,由美國國家科學基金會和美國能源部資助,在美國Los Alamos國家實驗室創立,現由美國康奈爾大學負責管理并維護。arXiv涵蓋了計算機科學、物理、數學、量化金融等多個領域學科。目前,越來越多的研究人員選擇在論文正式發表之前,將最新研究成果提前發布于arXiv,極大促進了全球科研社區的交流與共享。
本推文作者為許東舟,審核為黃星宇和邱雪。
一、計算機視覺與模式識別
計算機視覺與模式識別在計算機科學與人工智能領域具有核心地位,兩者相互支撐、共同發展。計算機視覺旨在使計算機從圖像與視頻等數據中自動獲取信息并理解場景與目標,典型任務包括目標檢測、圖像分割、姿態估計和三維重建等;模式識別則側重于從數據中提取特征并建立判別或生成模型,用于分類、聚類、匹配或異常檢測等決策。
隨著技術的成熟,它們正逐漸滲透進各行各業,不僅在人臉識別、物流分揀、交通管理等傳統任務中具有廣泛應用,也為具身智能、自動駕駛、醫學影像分析和AIGC等前沿技術的發展奠定了基礎。
二、熱點分析
本文分析了2025年8月發表在arXiv上計算機視覺與模式識別領域的50篇最新論文。圖1為基于本期所有論文標題中研究熱點生成的詞云圖。表1列出了全部的50篇論文(按照時間排序)。為了進一步揭示本期研究熱點,表2對論文標題中出現頻率最高的10個主題詞進行了整理和統計,旨在為相關領域的研究人員提供研究方向上的參考。
圖1??2025年8月期Computer Vision and Patteren Recognition研究熱點詞云圖
表1??2025年8月Computer Vision and Patteren Recognition方向的50篇論文標題匯總
編號 | 論文?/?項目標題 |
1 | LongSplat: Robust Unposed 3D Gaussian ? Splatting for Casual Long Videos |
2 | Beyond Simple Edits: Composed Video ? Retrieval with Dense Modifications |
3 | Distilled-3DGS: Distilled 3D Gaussian ? Splatting |
4 | GeoSAM2: Unleashing the Power of SAM2 ? for 3D Part Segmentation |
5 | InfiniteTalk: Audio-driven Video ? Generation for Sparse-Frame Video Dubbing |
6 | Backdooring Self-Supervised ? Contrastive Learning by Noisy Alignment |
7 | Online 3D Gaussian Splatting Modeling ? with Novel View Selection |
8 | ResPlan: A Large-Scale Vector-Graph ? Dataset of 17,000 Residential Floor Plans |
9 | Self-Supervised Sparse Sensor Fusion ? for Long Range Perception |
10 | Physics-Based 3D Simulation for ? Synthetic Data Generation and Failure Analysis in Packaging Stability ? Assessment |
11 | OmViD: Omni-supervised active ? learning for video action detection |
12 | ROVR-Open-Dataset: A Large-Scale ? Depth Dataset for Autonomous Driving |
13 | RotBench: Evaluating Multimodal Large ? Language Models on Identifying Image Rotation |
14 | ViT-FIQA: Assessing Face Image ? Quality using Vision Transformers |
15 | DIME-Net: A Dual-Illumination ? Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts |
16 | PhysGM: Large Physical Gaussian Model ? for Feed-Forward 4D Synthesis |
17 | SCRNet: Spatial-Channel Regulation ? Network for Medical Ultrasound Image Segmentation |
18 | Forecasting Smog Events Using ? ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South ? Asia |
19 | In-hoc Concept Representations to ? Regularise Deep Learning in Medical Imaging |
20 | RICO Two: Realistic Benchmarks and an ? In-Depth Analysis for Incremental Learning in Object Detection |
21 | RED.AI Id-Pattern: First Results of ? Stone Deterioration Patterns with Multi-Agent Systems |
22 | SAGA: Learning Signal-Aligned ? Distributions for Improved Text-to-Image Generation |
23 | Self-Aware Adaptive Alignment: ? Enabling Accurate Perception for Intelligent Transportation Systems |
24 | Unsupervised Urban Tree Biodiversity ? Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering |
25 | Timestep-Compressed Attack on Spiking ? Neural Networks through Timestep-Level Backpropagation |
26 | A Fully Transformer Based Multimodal ? Framework for Explainable Cancer Image Segmentation Using Radiology Reports |
27 | VisionLaw: Inferring Interpretable ? Intrinsic Dynamics from Visual Observations via Bilevel Optimization |
28 | Shape-from-Template with Generalised ? Camera |
29 | MR6D: Benchmarking 6D Pose Estimation ? for Mobile Robots |
30 | Mitigating Cross-Image Information ? Leakage in LVLMs for Multi-Image Tasks |
31 | Enhancing Targeted Adversarial ? Attacks on Large Vision-Language Models through Intermediate Projector ? Guidance |
32 | Hierarchical Vision-Language ? Retrieval of Educational Metaverse Content in Agriculture |
33 | Diversity-enhanced Collaborative ? Mamba for Semi-supervised Medical Image Segmentation |
34 | HumanPCR: Probing MLLM Capabilities ? in Diverse Human-Centric Scenes |
35 | DeH4R: A Decoupled and Hybrid Method ? for Road Network Graph Extraction |
36 | OmniTry: Virtual Try-On Anything ? without Masks |
37 | DiffIER: Optimizing Diffusion Models ? with Iterative Error Reduction |
38 | RCGNet: RGB-based Category-Level 6D ? Object Pose Estimation with Geometric Guidance |
39 | TalkVid: A Large-Scale Diversified ? Dataset for Audio-Driven Talking Head Synthesis |
40 | Two-Factor Authentication Smart ? Entryway Using Modified LBPH Algorithm |
41 | PersonaVlog: Personalized Multimodal ? Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction |
42 | Unleashing Semantic and Geometric ? Priors for 3D Scene Completion |
43 | Towards Efficient Vision State Space ? Models via Token Merging |
44 | Bridging Clear and Adverse Driving ? Conditions |
45 | Temporal-Conditional Referring Video ? Object Segmentation with Noise-Free Text-to-Video Diffusion Model |
46 | Generative Model-Based Feature ? Attention Module for Video Action Analysis |
47 | The 9th AI City Challenge |
48 | Learnable SMPLify: A Neural Solution ? for Optimization-Free Human Pose Inverse Kinematics |
49 | DictAS: A Framework for ? Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup |
50 | Color Spike Data Generation via ? Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer |
表2?高頻關鍵詞TOP10
關鍵詞 | 出現次數 |
Image | 8 |
Segmentation | 6 |
3D | 6 |
Video | 6 |
Generation | 5 |
Gaussian/Gaussian Splatting | 4 |
LVLM / Vision-Language / VL | 4 |
Lager Language Model / LLM | 3 |
Multimodal | 3 |
Pose | 3 |
三、總結
從本期arXiv計算機視覺與模式識別方向論文的高頻關鍵詞來看(見表?2),研究熱點呈現出以下特征與趨勢:
本期高頻熱點榜首為“Image(圖像)”(8?次),這表明圖像仍然是計算機視覺研究的核心。無論是圖像分割、圖像生成、目標檢測,還是多模態語言模型的構建,都離不開對圖像這一基礎要素的深入分析與建模。
隨后是“Segmentation(分割)”、“3D(三維)”以及“Video(視頻)”并列第二(均為6次)。反映出了三個重要方向:首先,分割仍是視覺研究的關鍵,從醫學圖像到多模態模型都是不可或缺的一部分;其次,三維視覺的熱度依舊居高不下,相關工作涵蓋三維重建、三維分割以及三維場景建模等,具有較強的實際應用價值;第三,視頻研究已成為新的熱點之一,從生成到檢索再到動作分析,都展現出了學術界與產業界對動態場景的高度重視。
“Generation(生成,5次)”緊隨其后,體現出生成式方法在圖像、視頻以及三維建模等方向中具有重要意義。Gaussian / Gaussian Splatting(高斯濺射)出現4次,可以看出這一方法正逐漸成為三維建模方向中最熱門的領域。
“LVLM / Vision-Language(視覺-語言模型,4次)”與“Large Language Model / LLM(大語言模型,3次)”的頻繁出現,則體現出跨模態與大規模預訓練模型的快速發展。如何在建立視覺與語言之間更穩健的對齊機制,以及如何借助大模型增強視覺任務的泛化能力,已逐漸成為新的研究趨勢。
此外,“Multimodal(多模態)”與“Pose(姿態)”均出現了3次。多模態模型突出了跨模態信息的交互與統一建模,常見于視覺、語言與文本等多源數據的融合,后者則在人機交互、虛擬現實、動作識別等場景中展現出了重要的應用價值。
總體來看,本期的研究熱點主要聚焦于圖像與視頻分析、分割與三維建模、生成式方法、大模型的跨模態應用。隨著高斯濺射、擴散模型以及視覺-語言模型的不斷發展,計算機視覺正逐步邁向更加貼近真實世界應用的方向。可以預見,未來的研究將持續圍繞生成式視覺、視覺-語言融合以及多模態通用大模型展開更深入的探索。