揭示獨特模式：Elasticsearch 中 significant terms 聚合指南

作者：來自 Elastic?Alexander Dávila

了解如何使用 significant terms 聚合來發現你數據中的洞察。

更多閱讀：Elasticsearch：significant terms aggregation

Elasticsearch 擁有大量新功能，可以幫助你為你的使用場景構建最佳搜索解決方案。深入查看我們的示例 notebooks 了解更多信息，開始免費云試用，或立即在本地機器上體驗 Elastic。

在 Elasticsearch 中， significant terms 聚合不僅僅是找出最常見的詞項，而是發現數據集中統計上不尋常的值。這使我們能夠發現有價值的洞察和非顯而易見的模式。 significant terms 聚合的響應中包含兩個有用的參數：

bg_count（背景計數）：在父數據集中找到的文檔數量
doc_count：在結果數據集中找到的文檔數量

例如，在一個手機銷售數據集中，我們可以像下面這樣對 iPhone 16 的銷售查找 significant terms：

GET phone_sales_analysis/_search
{"size": 0,"query": {"term": {"phone_model": {"value": "iPhone 16"}}},"aggs": {"significant_cities": {"significant_terms": {"field": "city_region","size": 1}}}
}

然后，響應會給出：

{"aggregations": {"significant_cities": {"doc_count": 122,"bg_count": 424,"buckets": [{"key": "Houston","doc_count": 12,"score": 0.1946481360617346,"bg_count": 14}]}}
}

Houston 并不是整個數據集中排名前十的城市，也不是 iPhone 16 的銷量最高城市。然而， significant terms 聚合顯示，與數據集的其余部分相比，iPhone 16 在這個城市的購買量不成比例地偏高。讓我們更深入地看看數據：

在頂層：
- doc_count: 122 — 查詢總共匹配了 122 個文檔
- bg_count: 424 — 背景集（所有銷售文檔）包含 424 個文檔
在 Houston 的 bucket 中：
- doc_count: 12 — 在 122 個查詢結果中，Houston 出現了 12 次
- bg_count: 14 — 在背景數據集的 424 個文檔中，Houston 出現了 14 次

這告訴我們，在總共 424 筆購買中，只有 14 筆發生在 Houston，占比為 3.3%。但如果我們只看 iPhone 16 的銷售，就會發現有 12 筆發生在 Houston，占比為 9.8%，是整體數據集的 3 倍，這就是 “significant”！

下面是這個分析在可視化中的表現：每個 city_region 的總銷量。

我們可以看到 Houston 有 14 筆銷售，在數據集中按銷量排名第 14。

現在，如果我們應用一個過濾器，僅查看 iPhone 16 的銷售，Houston 有 12 筆銷售，成為該型號銷量第 2 高的城市：

了解 significant terms 聚合

根據 Elastic 文檔， significant terms 聚合：

“（查找）在前景集和背景集之間受歡迎程度發生顯著變化的詞項。”

這意味著它使用統計指標將某個詞項在數據子集（前景集）中的頻率與在整個父數據集（背景集）中的頻率進行比較。這樣，評分反映的是統計顯著性，而不是某個詞項在數據中出現的次數。

significant terms 聚合與普通 terms 聚合的主要區別在于：

significant terms 比較的是數據的一個子集，而 terms 聚合只處理查詢結果數據集。
terms 聚合返回的是數據集中最常見的詞項，而 significant terms 會忽略常見詞項，找出讓數據子集獨特的詞項。
significant terms 對性能的影響可能更大，因為它需要從磁盤讀取數據，而不是像 terms 聚合那樣從內存中讀取。

實際應用（消費者行為分析）
?

為分析準備數據

在本次分析中，我們生成了一個合成的手機銷售數據集，包括價格、手機規格、購買者的人口統計信息和反饋。我們還基于用戶反饋生成了 embedding，方便后續運行語義查詢。我們使用的是 Elasticsearch 中開箱即用的 multilingual e5 small 模型。

在 Elasticsearch 中使用這個數據集的方法如下：

使用 Kibana 的 Upload data files 功能上傳 CSV 文件（可以從此處下載）。
設置一個名為 “embedding” 的語義字段，使用 multilingual-e5-small 模型，如本博客所示。
使用默認字段類型完成導入（除 purchase_date 和 user_feedback 外，其它字段為 keyword）。確保設置索引名為 phone_sales_analysis，以便能夠直接運行本文展示的查詢。

本次分析的主要目標是發現：“What's different about the iPhone 16 buyers versus other segments of the population? -?iPhone 16 的購買者與其他人群有何不同？”并以此為基礎做市場營銷的用戶分群。

以下是數據集中的一個示例文檔：

{"customer_type": "Returning","user_feedback": "I have to say, quality is great for the price. The battery life is really good.","upgrade_frequency": "2 years","storage_capacity": "256GB","occupation": "Technology & Data","color": "Phantom Black","gender": "Male","price_paid": 899,"previous_brand_loyalty": "Mixed","location_type": "Urban","phone_model": "Samsung Galaxy S24","city_region": "San Francisco Bay Area","@timestamp": "2024-03-15T00:00:00.000-05:00","income_bracket": "75000-100000","purchase_channel": "Online","feedback_sentiment": "positive","education_level": "Bachelor","embedding": "I have to say, quality is great for the price. The battery life is really good.","customer_id": "C001","purchase_date": "2024-03-15","age": 34,"trade_in_model": "iPhone 13"
}

理解人口統計模式

這里，我們將對總體人群進行分析，并將其與 iPhone 16 用戶的 significant terms 聚合中的有趣發現進行比較。

常規模式

為了理解常規的購買模式，我們可以在所有文檔上對不同字段進行聚合。為簡化起見，我們將重點分析購買手機人群的職業。這可以通過向 Elasticsearch 發送請求來完成。

GET phone_sales_analysis/_search
{"aggs": {"occupation_distribution": {"terms": {"size": 5,"field": "occupation"}}},"size": 0
}

這告訴我們，數據集中主要的職業（按記錄數量排序）是：

iPhone 16 用戶的模式

為了了解購買 iPhone 16 的人群有何不同，我們可以在相同字段上運行一個 terms 聚合，并添加一個過濾器，在查詢中篩選出這些人，如下所示：

GET phone_sales_analysis/_search
{"query": {"term": {"phone_model": "iPhone 16"}},"aggs": {"occupation_distribution": {"terms": {"size": 5,"field": "occupation"}}},"size": 0
}

所以，對于 iPhone 16 用戶來說，主要的職業是：

我們可以看到，iPhone 16 用戶的職業分布模式與其他手機型號的用戶不同。讓我們用 Kibana 來輕松可視化這些結果：

在這個圖表中，我們可以看到 iPhone 16 的趨勢與整體人群的趨勢不同。

我們可以跳過整個分析，直接運行一個 significant terms 聚合，看看是什么讓 iPhone 16 用戶與普通人群不同：

GET phone_sales_analysis/_search
{"query": {"term": {"phone_model": "iPhone 16"}},"aggs": {"occupation_distribution": {"significant_terms": {"size": 5,"field": "occupation"}}},"size": 0
}

簡而言之，我們得到了這個響應：

Values of occupations for the iPhone 16	doc_count	bg_count
occupation_distribution (top level)	122	424
Medical & Healthcare bucket	45	57

響應清楚地表明，與普通人群相比，iPhone 16 用戶中醫療和健康領域的人數異常（即具有顯著性！）。讓我們看看響應中的數字含義：

頂層：
- doc_count: 122 — 查詢總共匹配了 122 個文檔
- bg_count: 424 — 背景集（所有銷售文檔）包含 424 個文檔
在醫療與健康（Medical & Healthcare）桶中：
- doc_count: 45 — “Medical & Healthcare” 出現在 122 個查詢結果中的 45 個
- bg_count: 57 — “Medical & Healthcare” 出現在背景數據集 424 個文檔中的 57 個

在 424 名買家中，有 57 人從事醫療和健康行業，占比 13.44%。但在 iPhone 16 買家中，有 45 人從事該行業，占比 36.88%。這意味著在 iPhone 16 用戶中，找到醫療和健康行業從業者的概率是普通人群的兩倍！

我們可以用同樣的方法分析其它字段（年齡、地點、收入等級等），以發現更多關于 iPhone 16 用戶獨特之處的信息。

消費者分群

我們可以使用 significant terms 聚合來提取產品、類別和客戶分群之間的關聯洞察。為此，我們先構建感興趣類別的父聚合，然后使用 significant terms 和普通 terms 子聚合，找出該類別中有趣的洞察，并與該職業大多數人使用的情況進行比較。

例如，來看一些職業領域的人們偏好什么：

為了讓分析更清晰，我們把搜索范圍限制在 3 個職業領域：["Administrative & Support", "Technology & Data", "Medical & Healthcare"]
在聚合方面，我們先對職業做 terms 聚合
添加第一個子聚合：按手機型號的 terms 聚合 —— 找出各職業領域用戶購買的手機型號
添加第二個子聚合：按手機型號的 significant terms 聚合 —— 找出各職業領域特別偏好的手機型號

GET phone_sales_analysis/_search
{"query": {"terms": {"occupation": ["Administrative & Support","Technology & Data","Medical & Healthcare"]}},"aggs": {"occupations": {"terms": {"size": 15,"field": "occupation"},"aggs": {"general_models": {"terms": {"field": "phone_model"}},"significant_models": {"significant_terms": {"field": "phone_model"}}}}},"size": 0
}

讓我們分解聚合結果：

職業：Administrative & Support

Terms 聚合：

Significant terms 聚合：

從這張表中，我們可以推斷該職業的趨勢與整體人群的趨勢沒有顯著差異。

職業：Technology & Data

Terms 聚合：

Significant terms 聚合

總文檔數：424

該職業中的文檔數：71

phone model	doc_count (this model in this occupation)	bg_count (this model in all the documents)	% in all the documents	% in this occupation
Google Pixel 8	12	22	5.19%	16.90%
OnePlus 11	9	14	3.30%	12.68%
OnePlus 12 Pro	3	3	0.71%	4.23%
Google Pixel 8 Pro	9	21	4.95%	12.68%
Nothing Phone 2	5	8	1.89%	7.04%
Samsung Galaxy Z Fold5	4	6	1.42%	5.63%
OnePlus 12	8	20	4.72%	11.27%

職業：Medical & Healthcare

Terms 聚合：

Significant terms 聚合

總文檔數：424

該職業中的文檔數：57

phone model	doc_count (this model in this occupation)	bg_count (this model in all the documents)	% in all the documents	% in this occupation
iPhone 16	45	122	28.77%	78.95%
iPhone 15 Pro Max	3	13	3.07%	5.26%
iPhone 15	7	40	9.43%	12.28%

讓我們看看這些數據告訴我們什么故事：

醫療和健康專業人士偏愛 iPhone 16，并且總體上非常傾向于使用 Apple 手機。
技術和數據專業人士偏好高端 Android 手機，但不一定使用三星品牌。這個類別中 iPhone 也有較大趨勢。
行政和支持類專業人士偏好三星和 Google 手機，但沒有明顯且獨特的趨勢。

Significant terms 聚合和混合搜索

混合搜索結合了文本搜索和語義搜索結果，提供更好的搜索體驗。在這種情況下， significant terms 聚合可以通過回答 “與所有文檔相比，這個數據集有什么特別之處？” 來為上下文感知搜索的結果提供洞察。

為演示此功能，我們來看當用戶談論 “good performance” 時，哪些手機型號出現頻率過高：

構建一個語義查詢，尋找在 embedding 字段中與輸入 “good performance” 最接近的用戶反饋
同時在 user_feedback 文本字段上使用相同詞項的文本搜索
添加 significant terms 查詢，找出這些結果中比整個數據集中更頻繁出現的手機型號

GET phone_sales_analysis/_search
{"retriever": {"rrf": {"retrievers": [{"standard": {"query": {"bool": {"must": [{"match": {"user_feedback": {"query": "good performance","operator": "and"}}}]}}}},{"standard": {"query": {"semantic": {"field": "embedding","query": "good performance"}}}}],"rank_window_size": 20}},"aggs": {"Models": {"significant_terms": {"field": "phone_model"}}}
}

讓我們來看一個匹配文檔的示例：

這是我們得到的響應：

{"took": 388,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 20,"relation": "eq"},"max_score": 0.016393442,"hits": [...]},"aggregations": {"Models": {"doc_count": 20,"bg_count": 424,"buckets": [{"key": "iPhone 15","doc_count": 5,"score": 0.4125,"bg_count": 40}]}}
}

這告訴我們，雖然 iPhone 15 在 424 個總文檔中出現了 40 次（占 9.4%），但在與語義搜索 “good performance” 匹配的 20 個文檔中出現了 5 次（占 25%）。因此我們可以得出結論：在談論 “good performance” 時，出現 iPhone 15 的概率是隨機情況的 2.7 倍。