Elasticsearch查詢之Disjunction Max Query

前言

Disjunction Max Query 又稱最佳 best_fields 匹配策略，用來優化當查詢關鍵詞出現在多個字段中，以單個字段的最大評分作為文檔的最終評分，從而使得匹配結果更加合理

寫入數據

如下的兩條例子數據：

docId: 1
title: java python go
content: java scaladocId: 2
title: kubernetes docker
content: java spring python

POST test01/doc/_bulk
{ "index" : { "_id" : "1" } }
{ "title" : "kubernetes docker", "content": "java spring python" }
{ "index" : { "_id" : "2" } }
{ "title" : "java python go", "content": "java scala" }

查詢數據

GET test01/_search?
{"query": {"bool": {"should": [{"match": {"title": "java spring"}},{"match": {"content": "java spring"}}]}}
}

結果如下：

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 6,"successful" : 6,"skipped" : 0,"failed" : 0},"hits" : {"total" : 2,"max_score" : 0.5753642,"hits" : [{"_index" : "test01","_type" : "doc","_id" : "2","_score" : 0.5753642,"_source" : {"title" : "java python go","content" : "java scala"}},{"_index" : "test01","_type" : "doc","_id" : "1","_score" : 0.5753642,"_source" : {"title" : "kubernetes docker","content" : "java spring python"}}]}
}

可以看到，兩個 doc 的 score 一樣，盡管從內容上看 id=1 的數據更應該排在前面，但默認的排序策略是有可能會導致id=2 的數據排在 id=1 的前面。

原理分析

在 ES 的默認評分策略下，boolean 查詢的score是所有 should 條件匹配到的評分相加，下面簡化分析一下得分流程，真實評分會比這個復雜，但大致思路一致：

在 id=1 中數據，由于 title 無命中，但 content 匹配到了 2 個關鍵詞，所以得分為 2.

在 id=2 中數據，其 title 命中 1 個關鍵詞，并且其 content 也命中一個關鍵詞，所以最后得分也為 2.

從而得出了最終結果兩個 doc 的得分一樣

dis_max 查詢

使用 dis_max查詢優化匹配機制，采用單字段最大評分，作為最終的 score

GET test01/_search?
{"query": {"dis_max": {"queries": [{"match": {"title": "java spring"}},{"match": {"content": "java spring"}}]}}
}

結果如下：

{"took" : 4,"timed_out" : false,"_shards" : {"total" : 6,"successful" : 6,"skipped" : 0,"failed" : 0},"hits" : {"total" : 2,"max_score" : 0.5753642,"hits" : [{"_index" : "test01","_type" : "doc","_id" : "1","_score" : 0.5753642,"_source" : {"title" : "kubernetes docker","content" : "java spring python"}},{"_index" : "test01","_type" : "doc","_id" : "2","_score" : 0.2876821,"_source" : {"title" : "java python go","content" : "java scala"}}]}
}

結果已經符合預期了

tie_breaker參數

前面的結果我們看到已經符合預期了，現在如果我們用 dis max 繼續查詢另一種 case：


GET test01/_search?
{"query": {"dis_max": {"queries": [{"match": {"title": "python scala"}},{"match": {"content": "python scala"}}]}}
}

結果如下：

"hits" : [{"_index" : "test01","_type" : "doc","_id" : "2","_score" : 0.2876821,"_source" : {"title" : "java python go","content" : "java scala"}},{"_index" : "test01","_type" : "doc","_id" : "1","_score" : 0.2876821,"_source" : {"title" : "kubernetes docker","content" : "java spring python"}}]

可以看到兩者的評分又一樣了，但從實際來說，我們肯定希望 id = 2 的文檔的得分更高的，因為其在多個字段中都有命中，但因為 dis max的匹配評分機制，又導致忽略了其他字段的評分的貢獻，這個時候就需要進一步優化了，在 dis max 里面可以使用 tie_breaker 參數來控制，tie_breaker的值默認是 0 ，其設置了tie_breaker參數之后，dis max 的工作原理如下：

從得分最高的匹配子句中獲取相關性得分。
將任何其他匹配子句的分數乘以 tie_breaker 值。
將最高分數和其他子句相乘的分數進行累加，得到最終的排序 score 值。

改進后的查詢語句如下：

GET test01/_search?
{"query": {"dis_max": {"queries": [{"match": {"title": "python scala"}},{"match": {"content": "python scala"}}],"tie_breaker": 0.4}}
}

查詢結果：

"hits" : {"total" : 2,"max_score" : 0.40275493,"hits" : [{"_index" : "test01","_type" : "doc","_id" : "2","_score" : 0.40275493,"_source" : {"title" : "java python go","content" : "java scala"}},{"_index" : "test01","_type" : "doc","_id" : "1","_score" : 0.2876821,"_source" : {"title" : "kubernetes docker","content" : "java spring python"}}]}

這樣結果就符合我們的預期了

總結

使用dis max 查詢可以達到 best_fields 匹配的效果，在某些細分的檢索場景下效果更好，但單純的 dis max 查詢會導致忽略其他字段評分貢獻，這種一刀切的機制并不是最優的策略，所以需要配合 tie_breaker 參數，來弱化非 best field 子句的評分貢獻，從而達到最終的優化效果

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/42139.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/42139.shtml
英文地址，請注明出處：http://en.pswp.cn/news/42139.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！