如何建立針對 .NET Core web 程序的線程池的長期監控

建立針對 .NET Core Web 應用程序線程池的長期監控是一個系統性的工程，它涉及代碼集成、指標收集、存儲、可視化和告警。

核心思路

線程池監控不是孤立的，它必須與應用程序的整體性能指標（如請求量、響應時間、錯誤率）結合分析才有意義。我們的目標是：當應用出現性能退化時，能快速判斷是否由線程池問題引起，并定位根本原因。

第一步：定義要監控的核心指標

.NET 的 System.Threading.ThreadPool 類提供了以下關鍵性能計數器：

線程數量 (Thread Count)
- ThreadPool.ThreadCount: 線程池中當前存在的線程總數（包括忙碌和空閑的）。這是最核心的指標。
工作項隊列長度 (Work Item Queue Length)
- ThreadPool.PendingWorkItemCount (.NET 6+) 或通過 ThreadPool.GetAvailableThreads(out workerThreads, out ioThreads) 間接計算（舊版本）。隊列積壓是性能問題最直接的信號。
已完成工作項 (Completed Work Items)
- ThreadPool.CompletedWorkItemCount: 自啟動以來完成的工作項總數。可用于計算吞吐量（率）。
線程注入速率 (Thread Injection Rate)
- 監控 ThreadCount 的增長速度。線程池緩慢增加線程是正常的，但短時間內陡增（“線程風暴”）意味著有大量阻塞性任務。

此外，必須關聯的應用程序指標：

應用層： 每秒請求數 (RPS)、95th/99th 分位響應時間、錯誤率（特別是 5xx）。
系統層： CPU 使用率、內存使用率。

第二步：選擇技術棧并實施監控（三種主流方案）

您可以根據公司現有的技術棧和復雜度要求選擇一種或組合使用。

方案一：使用 ASP.NET Core 內置指標與 Prometheus + Grafana（云原生首選）

這是目前最流行、最現代化的方案。

暴露指標端點：

安裝 prometheus-net.AspNetCore NuGet 包。
在 Program.cs 中添加指標收集和暴露端點：

using Prometheus;var builder = WebApplication.CreateBuilder(args);
// ... 其他服務配置var app = builder.Build();// 啟用收集 ASP.NET Core 指標
app.UseHttpMetrics();
// 暴露指標端點，供 Prometheus 抓取
app.MapMetrics("/metrics"); // 默認端口是 80/tcp// ... 其他中間件配置
app.Run();

添加自定義線程池指標：

prometheus-net 包會自動收集很多指標，但線程池指標需要我們自己定義和更新。
創建一個后臺服務 ThreadPoolMetricsService.cs：

using System.Threading;
using System.Threading.Tasks;
using Microsoft.Extensions.Hosting;
using Prometheus;public class ThreadPoolMetricsService : BackgroundService
{private readonly Gauge _threadCountGauge;private readonly Gauge _pendingWorkItemGauge;private readonly Counter _completedWorkItemCounter;public ThreadPoolMetricsService(){_threadCountGauge = Metrics.CreateGauge("dotnet_threadpool_thread_count","Number of thread pool threads");// 注意：PendingWorkItemCount 在 .NET 6 及更高版本中可用_pendingWorkItemGauge = Metrics.CreateGauge("dotnet_threadpool_pending_work_item_count","Number of pending work items");_completedWorkItemCounter = Metrics.CreateCounter("dotnet_threadpool_completed_work_item_total","Total number of work items completed");}protected override async Task ExecuteAsync(CancellationToken stoppingToken){while (!stoppingToken.IsCancellationRequested){// 每 5 秒更新一次指標值_threadCountGauge.Set(ThreadPool.ThreadCount);#if NET6_0_OR_GREATER_pendingWorkItemGauge.Set(ThreadPool.PendingWorkItemCount);#endif_completedWorkItemCounter.IncTo(ThreadPool.CompletedWorkItemCount);await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);}}
}

在 Program.cs 中注冊這個服務：builder.Services.AddHostedService<ThreadPoolMetricsService>();

部署和配置基礎設施：
- 部署 Prometheus： 配置 scrape_configs 來抓取你的 Web 應用的 /metrics 端點。
- 部署 Grafana： 添加 Prometheus 作為數據源。
創建 Grafana 儀表板：
- 編寫 PromQL 查詢來可視化指標，例如：
  - dotnet_threadpool_thread_count
  - rate(dotnet_threadpool_completed_work_item_total[5m]) (吞吐量)
  - dotnet_threadpool_pending_work_item_count
- 將線程池指標與 http_requests_received_total (請求量)、http_request_duration_seconds (延遲) 等指標放在同一個儀表板上進行關聯分析。

方案二：使用 Application Insights（Azure 生態首選）

如果你已經在使用 Azure，Application Insights 提供了開箱即用的集成，但自定義線程池指標需要一些配置。

集成 Application Insights SDK：
- 安裝 Microsoft.ApplicationInsights.AspNetCore NuGet 包。
- 在 Program.cs 中啟用它：builder.Services.AddApplicationInsightsTelemetry();

發送自定義指標：

使用 TelemetryClient 的 GetMetric 方法來發送自定義指標。創建一個類似的 IHostedService：

public class ThreadPoolMetricsService : BackgroundService
{private readonly TelemetryClient _telemetryClient;public ThreadPoolMetricsService(TelemetryClient telemetryClient){_telemetryClient = telemetryClient;}protected override async Task ExecuteAsync(CancellationToken stoppingToken){while (!stoppingToken.IsCancellationRequested){_telemetryClient.GetMetric("ThreadPool.ThreadCount").TrackValue(ThreadPool.ThreadCount);#if NET6_0_OR_GREATER_telemetryClient.GetMetric("ThreadPool.PendingWorkItemCount").TrackValue(ThreadPool.PendingWorkItemCount);#endif// CompletedWorkItemCount 更適合作為累計計數器，但 App Insights 指標主要是快照值// 可以考慮計算差值后發送速率await Task.Delay(TimeSpan.FromSeconds(30), stoppingToken); // App Insights 聚合周期較長，無需太頻繁}}
}

注冊服務：builder.Services.AddHostedService<ThreadPoolMetricsService>();

在 Azure 門戶中查看：
- 在 Application Insights 資源的 Metrics 頁面中，你可以找到你的自定義指標并創建圖表。

方案三：使用診斷工具和事件源（用于深度診斷）

對于長期監控，上述方案更合適。但當你需要深度排查一個復雜的線程池問題時，.NET EventCounters 和 dotnet-counters 工具是無價之寶。

臨時診斷：
- 在生產環境服務器上，使用 dotnet-counters 命令實時監控：
```
dotnet-counters monitor --name <your-process-name> --counters System.Threading.ThreadPool
```
  這會顯示線程池的實時變化，非常適合在壓測或故障發生時進行觀察。
在代碼中監聽 EventSource：
- System.Threading.ThreadPool 會發出 EventSource 事件。你可以編寫一個 EventListener 來長期收集這些事件并轉發到你的監控系統（如 Elasticsearch），但這通常更復雜，除非有非常特殊的需求。

第三步：設置告警機制

監控的目的是為了及時發現問題。你需要為關鍵指標設置告警：

線程隊列積壓告警：
- 規則： dotnet_threadpool_pending_work_item_count > X 持續超過 Y 分鐘。
- 說明： X 的閾值需要根據你的應用基線來定。即使是 5-10 的持續積壓也可能意味著問題。這是最應關注的警報。
線程數異常增長告警：
- 規則： derivative(dotnet_threadpool_thread_count[5m]) > Z。
- 說明： 線程數在5分鐘內增長超過 Z 個（例如10個），可能意味著發生了“線程風暴”。
線程數上限告警：
- 規則： dotnet_threadpool_thread_count 接近你的理論或配置上限。
- 說明： 防止線程耗盡導致應用完全停滯。

告警工具：

Prometheus-based: 使用 Alertmanager。
Azure: 使用 Application Insights 警報 或 Azure Monitor 警報。
其他： 使用 Grafana 內置的告警功能或集成 PagerDuty、OpsGenie 等。

總結與最佳實踐

步驟	推薦方案	工具
1. 代碼集成	自定義 BackgroundService	`prometheus-net` 或 `Application Insights SDK`
2. 數據收集	拉取模型（Pull）	Prometheus
	推送模型（Push）	Application Insights
3. 可視化	自定義儀表板	Grafana (配 Prometheus) 或 Azure 門戶
4. 告警	基于閾值	Prometheus Alertmanager 或 Azure Alerts
5. 深度診斷	臨時工具	`dotnet-counters`, `dotnet-dump`

最佳實踐：

建立基線： 在正常負載下運行你的應用，記錄線程數、隊列長度等的正常范圍。告警閾值應基于這些基線。
關聯分析： 永遠不要孤立地看線程池指標。線程池隊列積壓時，一定要同時檢查 CPU 使用率、響應時間和錯誤日志。高CPU下的積壓和低CPU下的積壓，其原因完全不同（可能是計算密集型 vs. IO等待密集型）。
長期趨勢： 觀察線程數量的長期趨勢。一個健康的、處理穩態負載的應用，其線程數應該是相對穩定的。線程數的持續緩慢增長可能意味著存在微小的資源泄漏（如未關閉的數據庫連接）或任務調度不當。
日志記錄： 在告警觸發時，確保你的應用日志已經記錄了足夠的上下文（如當時正在處理哪些請求、是否有大量異常），這能極大幫助排查問題。

通過以上方案，你可以構建一個強大的、面向生產的線程池長期監控系統，從而保證你的 .NET Core Web 應用的穩健運行。