【AAAI 2025】 Local Conditional Controlling for Text-to-Image Diffusion Models

Local Conditional Controlling for Text-to-Image Diffusion Models（文本到圖像擴散模型的局部條件控制）在這里插入圖片描述

文章目錄

- 內容摘要
- 關鍵詞
- 作者及研究團隊
- 項目主頁
- 01 研究領域待解決問題
- 02 論文解決的核心問題
- 03 關鍵解決方案
- 04 主要貢獻
- 05 相關研究工作
- 06 解決方案實現細節
- 07 實驗設計
- 08 實驗結果與對比
- 09 消融研究發現
- 10 后續優化方向

內容摘要

本文針對文本到圖像擴散模型的局部控制問題，提出一種無需訓練的推理階段優化方法。現有全局控制（如ControlNet）無法靈活約束特定區域，直接添加局部條件會導致“局部控制主導”（圖2），忽視非控制區域的文本對齊。作者設計了區域判別損失（RDLoss）（公式5）、聚焦令牌響應（FTR）（公式8）和特征掩碼約束（FMC）（公式9）三大模塊：RDLoss通過最大化局部/非局部注意力差異更新隱變量，FTR抑制弱響應令牌減少重復，FMC通過掩碼控制ControlNet特征泄漏。實驗表明，該方法在COCO和Attend-Condition數據集上實現了局部條件與文本提示的高精度對齊（FID 21.86±0.48，CLIP T2T 0.801±0.006），解決了局部控制中結構失真和概念缺失的核心挑戰（圖1、5）。

關鍵詞

Text-to-Image Diffusion, Local Control, Attention Modulation, Diffusion Model, Controllable Generation

作者及研究團隊

本文由浙江大學CAD&CG國家重點實驗室、Fabu Inc.、騰訊等機構合作完成。

項目主頁

論文未公開代碼，但提及基于Stable Diffusion和ControlNet框架，實驗細節見附錄（Section 4.1-4.3）。

01 研究領域待解決問題

Existing text-to-image diffusion models with global controls (e.g., ControlNet) lack fine-grained localization capability. Directly applying local conditions causes “local control dominance” (Fig. 2), where the model over-focuses on controlled regions and neglects object generation in non-control areas aligned with text prompts. Additionally, global control methods (Section 2) fail to balance structural fidelity and text consistency in localized regions, leading to concept omission or duplication (Table 1, Fig. 5).

當前文本到圖像擴散模型的全局控制（如ControlNet）缺乏局部精細化能力。直接引入局部條件會觸發“局部控制主導”（圖2），模型過度關注控制區域，忽視非控制區域與文本提示的對齊。此外，全局控制方法（第2節）無法平衡局部結構保真度與文本一致性，導致概念缺失或重復（表1，圖5）。例如，在“貓和狗在海邊”的提示中，ControlNet僅生成局部狗的結構，完全忽略貓的存在（圖2左）。
在這里插入圖片描述

02 論文解決的核心問題

The paper addresses local control dominance in text-to-image generation: how to enforce user-defined local conditions (e.g., cat canny) while preserving text-aligned object generation in non-control regions (e.g., dog, seaside). Existing methods either ignore non-control concepts (Fig. 2) or introduce artifacts due to feature inconsistency (Section 3.5).

論文解決文本到圖像生成中的局部控制主導問題：在施加用戶定義的局部條件（如貓的邊緣圖）時，如何保留非控制區域（如狗、海邊）與文本提示的對齊。現有方法要么忽略非控制概念（圖2），要么因特征不一致引入偽影（第3.5節）。例如，圖1中全局控制（帶掩碼）仍無法生成符合“玩具車”提示的非控制區域，而本文方法同時滿足局部結構與全局文本。

在這里插入圖片描述

03 關鍵解決方案

在這里插入圖片描述

Three inference-stage techniques (Fig. 3):

Regional Discriminate Loss (RDLoss) (Eq. 5): Maximizes attention discrepancy between local/non-local regions for $C_{\text{control}}^t$ (Eq. 4), guiding latent updates to regenerate ignored objects.
Focused Token Response (FTR) (Eq. 8): Suppresses weak attention scores via token-wise max suppression, reducing object duplication.
Feature Mask Constraint (FMC) (Eq. 9): Applies control mask to ControlNet features, mitigating quality degradation from blank non-control regions.

推理階段三大技術（圖3）：

區域判別損失（RDLoss）（公式5）：通過最大化局部/非局部注意力差異（針對控制概念 $C_{\text{control}}^t$ ，公式4），引導隱變量更新以再生被忽略的對象。例如，對“狗”令牌，強制非控制區域的注意力最大值高于局部（圖2右）。
聚焦令牌響應（FTR）（公式8）：通過令牌維度的最大抑制，削弱弱響應補丁的注意力，減少對象重復。如在“咖啡杯+泰迪熊”場景中，抑制非最大響應的背景令牌（圖4）。
特征掩碼約束（FMC）（公式9）：對ControlNet輸出施加掩碼，避免非控制區域的空白特征干擾。實驗顯示，FMC使LPIPS降低12.67±1.03（表1）。

在這里插入圖片描述

04 主要貢獻

New Task: Defines “local control” as region-specific structural guidance with text prompts (Fig. 1).
Training-free Solution: Three modules addressing dominance, duplication, and feature inconsistency without retraining.
Empirical Validation: State-of-the-art results on COCO (FID 21.86) and Attend-Condition (CLIP T2T 0.804) datasets, outperforming baselines (Table 1, Fig. 5).
新任務定義：提出“局部控制”范式，允許用戶指定區域的結構引導（圖1），填補了全局控制與局部編輯的空白。
零訓練方案：三大模塊在推理階段解決主導、重復和特征不一致問題，無需額外訓練或數據（第3節）。
實證突破：在COCO（FID 21.86）和Attend-Condition（CLIP T2T 0.804）數據集上超越所有基線（表1，圖5），例如Inpainting基線的IOU僅0.51，而本文達0.57（表2）。

在這里插入圖片描述

05 相關研究工作

Global Control: ControlNet (Zhang et al. 2023) and T2I-Adapter (Mou et al. 2023) enable global structural guidance but fail in localization (Fig. 5).
Compositional Generation: Attend-and-Excite (Chefer et al. 2023) refines attention for multi-concepts, but lacks spatial constraints.
Local Editing: Inpainting methods (Lugmayr et al. 2022) post-process global results, leading to inconsistency (Section 4.2).
全局控制：ControlNet和T2I-Adapter實現全局結構引導，但無法局部約束（圖5中ControlNet生成的“飛機”偏離指定區域）。
組合生成：Attend-and-Excite優化多概念注意力，但缺乏空間約束，導致對象重疊（第2節）。
局部編輯：修復方法（如Inpainting）后處理全局結果，導致控制區與非控制區不一致（圖5右，Inpainting的“桌子”結構模糊）。

06 解決方案實現細節

Control Concept Matching (Eq. 4): Select $C_{\text{control}}^t$ via attention sum in local regions, stabilized by $Count_{\text{max}}$ for early timesteps ( $\beta=0.8$ ，圖8b).
RDLoss Update (Eq. 7): Gradient-based latent adjustment using attention max-distance, with $\alpha_t$ scaling (Section 3.3).
FTR Suppression (Eq. 8): Apply $\gamma=0.1$ to non-max tokens in cross-attention (Fig. 3), reducing patch overlap.
FMC Integration (Eq. 9): Mask ControlNet features at UNet blocks, avoiding blank region interference.
控制概念匹配（公式4）：通過局部區域注意力和選擇 $C_{\text{control}}^t$ ，早期步數用 $Count_{\text{max}}$ 穩定（β=0.8最優，圖8b）。例如，“貓+狗”場景中，動態選擇局部區域主導的概念。
RDLoss更新（公式7）：基于注意力最大距離的梯度調整隱變量，α_t控制步長（第3.3節）。對非控制令牌（如“海邊”），強制非局部注意力最大值高于局部。
FTR抑制（公式8）：對交叉注意力中非最大令牌施加γ=0.1縮放（圖3），減少“咖啡杯”與“泰迪熊”的特征重疊（圖4）。
FMC集成（公式9）：在UNet模塊對ControlNet特征加掩碼，避免非控制區空白特征導致的偽影（圖7c vs f）。

07 實驗設計

Datasets: COCO-5k (validation) and Attend-Condition (11 object+animal pairs, Section 4.1).
Baselines: ControlNet, T2I-Adapter, Noise-Mask (Eq. 10), Feature-Mask, Inpainting.
Metrics: FID, CLIP Score (text-image), CLIP T2T (caption-text), IOU (segmentation), LPIPS (local fidelity).
Ablation: RDLoss, FTR, FMC on COCO-canny (Table 2).
數據集：COCO-5k（驗證集）和Attend-Condition（11對物體+動物，第4.1節），如圖4的“蛋糕+泰迪熊”場景。
基線：ControlNet（全局控制）、T2I-Adapter（輕量控制）、Noise-Mask（掩碼噪聲混合，公式10）、Feature-Mask（僅FMC）、Inpainting（修復后處理）。
指標：FID（圖像質量）、CLIP分數（文本-圖像對齊）、CLIP T2T（生成描述-原提示對齊）、IOU（分割定位）、LPIPS（局部保真度）。
消融：在COCO-canny數據集測試三大模塊（表2），驗證RDLoss（+0.018 CLIP分數）和FMC（-1.82 FID）的關鍵作用。

08 實驗結果與對比

英文：

Quantitative (Table 1): Ours achieves lowest FID (21.86) and highest CLIP T2T (0.801) on COCO, outperforming Inpainting (FID 25.72) and ControlNet (CLIP T2T 0.782).
Qualitative (Fig. 5-6): Baselines fail in multi-object scenarios (e.g., ControlNet generates only “plane” in “building+plane”), while ours preserves both concepts with structural fidelity.
Localization (IOU 0.57): Accurate alignment of local conditions (e.g., cat canny in Fig. 1) without leaking to non-control regions.

中文：

量化結果（表1）：在COCO上，本文FID最低（21.86），CLIP T2T最高（0.801），優于Inpainting（FID 25.72）和ControlNet（CLIP T2T 0.782）。Attend-Condition數據集的CLIP T2T達0.804，遠超T2I-Adapter的0.700。
定性結果（圖5-6）：基線在多物體場景失效（如ControlNet僅生成“飛機”忽略“建筑”），本文保留所有概念并保持結構保真。圖6中，基線的“青蛙”控制區出現偽影，而本文的“獅子”局部邊緣清晰對齊。
定位精度（IOU 0.57）：局部條件（如圖1的貓邊緣）準確定位，無泄漏到非控制區。對比Noise-Mask的IOU僅0.37，本文通過FMC顯著提升空間一致性。

在這里插入圖片描述

09 消融研究發現

英文：

RDLoss (Table 2): Improves CLIP T2T by +0.036 (baseline vs RDLoss+FMC), proving its role in regenerating ignored objects.
FMC (Fig. 7c vs f): Reduces FID by 1.82 (23.65→21.83) by mitigating feature inconsistency, but slightly lowers IOU (-0.22) due to mask constraint.
FTR (Table 2): Enhances object distinction, reducing duplication in “train+dog” scenes (Fig. 7d vs f).

中文：

RDLoss（表2）：CLIP T2T提升0.036（基線0.750→RDLoss+FMC 0.802），證明其再生被忽略對象的作用。如圖7b（僅RDLoss）的“狗”在非控制區正確生成。
FMC（圖7c vs f）：通過減少特征不一致使FID降低1.82（23.65→21.83），但因掩碼約束導致IOU輕微下降（-0.22），驗證特征約束的必要性。
FTR（表2）：增強對象區分，減少“火車+狗”場景的重復（圖7d vs f）。移除FTR后，“海邊”的沙灘紋理出現重復斑塊。

在這里插入圖片描述

10 后續優化方向

英文：

Multi-condition Support: Extend to multi-modal local controls (e.g., depth + edge).
Real-time Inference: Optimize gradient-based updates (Eq. 7) for faster generation.
Dynamic Masking: Adaptive mask refinement during denoising, improving boundary fidelity (Fig. 1 control region edges).
Cross-dataset Generalization: Validate on complex scenes (e.g., cityscapes) beyond COCO.

中文：