文章目錄
- 5、語言處理與自動補全技術探測
- 5.1 自定義語料庫
- 5.1.1 語料庫映射OpenAPI
- 5.1.2 語料庫文檔OpenAPI
- 5.2 產品搜索與自動補全
- 5.2.1 漢字補全OpenAPI
- 5.2.2 拼音補全OpenAPI
- 5.3 產品搜索與語言處理
- 5.3.1 什么是語言處理(拼寫糾錯)
- 5.3.2 語言處理OpenAPI
- 5.4 總結
- 6、電商平臺產品推薦
- 6.1 什么是搜索推薦
- 6.2 產品推薦OpenAPI
- 7、指標聚合與下鉆分析
- 7.1 指標聚合與分類
- 7.2 指標聚合與下鉆設計
- 7.2.1 基礎框架搭建
- 7.2.2 單值分析API設計
- 7.2.3 多值分析API設計
- 8、電商平臺日志埋點與搜索熱詞
- 8.1 什么是熱度搜索
- 8.2 提取熱度搜索
- 8.3 日志埋點
- 8.4 數據落盤
- 8.5 熱度搜索OpenAPI
5、語言處理與自動補全技術探測
實現的效果
實現的最終效果如下圖京東搜索相似,輸入詞的時候返回提示。同時輸入拼音和首字母也會有相同的提示效果
輸入漢字
輸入拼音
輸入首字母
5.1 自定義語料庫
5.1.1 語料庫映射OpenAPI
索引映射OpenAPI
-
定義索引(映射)接口
/*** 索引操作接口*/ public interface ElasticsearchIndexService {//新增索引+映射public boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception; }
-
定義索引(映射)實現
@Overridepublic boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception {boolean flag=false;//創建索引請求CreateIndexRequest request=new CreateIndexRequest(commonEntity.getIndexName());//獲取下游業務參數Map<String,Object> map =commonEntity.getMap();//循環參數for(Map.Entry<String,Object> entry:map.entrySet()){//設置settings參數if("settings".equals(entry.getKey()) && entry.getValue() instanceof Map && ((Map)entry.getValue()).size()>0){request.settings(((Map)entry.getValue()));}//設置mapping參數if("mapping".equals(entry.getKey()) && entry.getValue() instanceof Map && ((Map)entry.getValue()).size()>0){request.mapping(((Map)entry.getValue()));}}//創建索引操作客戶端IndicesClient indicesClient=client.indices();//創建響應對象CreateIndexResponse response=indicesClient.create(request,RequestOptions.DEFAULT);flag=response.isAcknowledged();return flag;}
-
新增控制器
/*** 索引操作控制器*/ @RestController @RequestMapping("v1/indices") public class ElasticsearchIndexController {private static final Logger logger = LoggerFactory.getLogger(ElasticsearchIndexController.class);@AutowiredElasticsearchIndexService elasticsearchIndexService;@PostMapping(value = "/add")public ResponseData addIndexAndMapping(@RequestBody CommonEntity commonEntity) {//構造返回下游業務數據ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//增加索引(映射)是否成功boolean isSuccess = false;try {//通過接口調用遠程結構化查詢方法isSuccess = elasticsearchIndexService.addIndexAndMapping(commonEntity);//通過類型推斷自動裝箱(多個參數取交集)rData.setResultEnum(isSuccess, ResultEnum.success, null);//日志記錄logger.info(TipsEnum.create_index_success.getMessage());} catch (Exception e) {//打印到控制臺e.printStackTrace();//日志記錄logger.error(TipsEnum.create_index_fail.getMessage());//構建錯誤返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;} }
-
開始新增映射
http://127.0.0.1:8888/v1/indices/add
參數
自定義分詞器ik_pinyin_analyzer(ik和pinyin組合分詞器)
tips
在創建映射前,需要安裝拼音插件
{"indexName": "product_completion_index","map": {"settings": {"number_of_shards": 1,"number_of_replicas": 2,"analysis": {"analyzer": {"ik_pinyin_analyzer": {"type": "custom","tokenizer": "ik_smart","filter": "pinyin_filter"}},"filter": {"pinyin_filter": {"type": "pinyin","keep_first_letter": true,"keep_separate_first_letter": false,"keep_full_pinyin": true,"keep_original": true,"limit_first_letter_length": 16,"lowercase": true,"remove_duplicated_term": true}}}},"mapping": {"properties": {"name": {"type": "keyword"},"searchkey": {"type": "completion","analyzer": "ik_pinyin_analyzer"}}}}
}
settings下面的為索引的設置信息,動態設置參數,遵循DSL寫法
mapping下為映射的字段信息,動態設置參數,遵循DSL寫法
屬性 | 說明 |
---|---|
keep_first_letter | 啟用此選項時,例如:劉德華> ldh,默認值: true |
keep_separate_first_letter | 啟用該選項時,將保留第一個字母分開,例如: 劉德華> l,d,h,默認:假的,注意:查詢結果 也許是太模糊,由于長期過頻 |
limit_first_letter_length | 設置first_letter結果的最大長度,默認值:16 |
keep_full_pinyin | 當啟用該選項,例如:劉德華> [ liu,de, hua],默認值:true |
keep_joined_full_pinyin | 當啟用此選項時,例如:劉德華> [ liudehua], 默認值:false |
keep_none_chinese | 在結果中保留非中文字母或數字,默認值:true |
keep_none_chinese_together | 默認值:true,如:DJ音樂家- > DJ,yin,yue, jia,當設置為false,例如:DJ音樂家- > D,J, yin,yue,jia,注意:keep_none_chinese必 須先啟動 |
keep_none_chinese_in_first_letter | 第一個字母保持非中文字母,例如:劉德華 AT2016- > ldhat2016,默認值:true |
keep_none_chinese_in_joined_full_pinyin | 保留非中文字母加入完整拼音,例如:劉德華 2016- > liudehua2016,默認:false |
none_chinese_pinyin_tokenize | 打破非中國信成單獨的拼音項,如果他們拼音, 默認值:true,如: liudehuaaibaba13zhuanghan- > liu,de, hua,a,li,ba,ba,13,zhuang,han,注 意:keep_none_chinese和 keep_none_chinese_together應首先啟用 |
keep_original | 當啟用此選項時,也會保留原始輸入,默認值: false |
lowercase | 小寫非中文字母,默認值:true |
trim_whitespace | 默認值:true |
remove_duplicated_term | 當啟用此選項時,將刪除重復項以保存索引,例 如:de的> de,默認值:false,注意:位置相關 查詢可能受影響 |
返回
5.1.2 語料庫文檔OpenAPI
-
定義批量新增文檔接口
//批量新增文檔public RestStatus bulkAndDoc(CommonEntity commonEntity) throws Exception;
-
定義批量新增文檔實現
@Overridepublic RestStatus bulkAndDoc(CommonEntity commonEntity) throws Exception {//構建批量新增請求BulkRequest bulkRequest = new BulkRequest(commonEntity.getIndexName());//循環下游業務文檔數據for (int i = 0; i < commonEntity.getList().size(); i++) {bulkRequest.add(new IndexRequest().source(XContentType.JSON, SearchTools.mapToObjectGroup(commonEntity.getList().get(i))));}//開始執行批量新增操作BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);return bulkResponse.status();}
官方文檔
如上圖,需要定義成箭頭中的形式
所以上面SearchTools.mapToObjectGroup將map轉成了數組 -
定義批量新增文檔控制器
@PostMapping(value = "/batch")public ResponseData bulkAndDoc(@RequestBody CommonEntity commonEntity) {//構造返回下游業務數據ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || CollectionUtils.isEmpty(commonEntity.getList())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定義批量返回結果RestStatus result = null;try {//通過接口調用批量新增方法result = elasticsearchDocumentService.bulkAndDoc(commonEntity);//通過類型推斷自動裝箱(多個參數取交集)rData.setResultEnum(result, ResultEnum.success, null);//日志記錄logger.info(TipsEnum.batch_create_doc_success.getMessage());} catch (Exception e) {//打印到控制臺e.printStackTrace();//日志記錄logger.error(TipsEnum.batch_create_doc_fail.getMessage());//構建錯誤返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
開始批量新增調用
http://127.0.0.1:8888/v1/docs/batch
參數
定義23個suggest詞庫(定義了兩個小米手機,驗證是否去重)
tips
學完聚合通過日志埋點、數據落盤進行維護
{"indexName": "product_completion_index","list": [{"searchkey": "小米手機","name": "小米(MI)"},{"searchkey": "小米10","name": "小米(MI)"},{"searchkey": "小米電視","name": "小米(MI)"},{"searchkey": "小米路由器","name": "小米(MI)"},{"searchkey": "小米9","name": "小米(MI)"},{"searchkey": "小米手機","name": "小米(MI)"},{"searchkey": "小米耳環","name": "小米(MI)"},{"searchkey": "小米8","name": "小米(MI)"},{"searchkey": "小米10Pro","name": "小米(MI)"},{"searchkey": "小米筆記本","name": "小米(MI)"},{"searchkey": "小米攝像頭","name": "小米(MI)"},{"searchkey": "小米電飯煲","name": "小米(MI)"},{"searchkey": "小米充電寶","name": "小米(MI)"},{"searchkey": "adidas男鞋","name": "adidas男鞋"},{"searchkey": "adidas女鞋","name": "adidas女鞋"},{"searchkey": "adidas外套","name": "adidas外套"},{"searchkey": "adidas褲子","name": "adidas褲子"},{"searchkey": "adidas官方旗艦店","name": "adidas官方旗艦店"},{"searchkey": "阿迪達斯襪子","name": "阿迪達斯襪子"},{"searchkey": "阿迪達斯外套","name": "阿迪達斯外套"},{"searchkey": "阿迪達斯運動鞋","name": "阿迪達斯運動鞋"},{"searchkey": "耐克外套","name": "耐克外套"},{"searchkey": "耐克運動鞋","name": "耐克運動鞋"}]
}
返回
查看
GET product_completion_index/_search
5.2 產品搜索與自動補全
- Term suggester :詞條建議器。對給輸入的文本進進行分詞,為每個分詞提供詞項建議
- Phrase suggester :短語建議器,在term的基礎上,會考量多個term之間的關系
- Completion Suggester,它主要針對的應用場景就是"Auto Completion"。
- Context Suggester:上下文建議器
GET product_completion_index/_search
{"from": 0,"size": 100,"suggest": {"czbk-suggest": {"prefix": "小米","completion": {"field": "searchkey","size": 20,"skip_duplicates": true}}}
}
5.2.1 漢字補全OpenAPI
-
定義自動補全接口
//自動補全(完成建議)public List<String> cSuggest(CommonEntity commonEntity) throws Exception;
-
定義自動補全實現
@Overridepublic List<String> cSuggest(CommonEntity commonEntity) throws Exception {//定義返回List<String> suggestList = new ArrayList<>();//定義自動完成構建器CompletionSuggestionBuilder completionSuggestionBuilder = SuggestBuilders.completionSuggestion(commonEntity.getSuggestFileld());//定義搜索關鍵字completionSuggestionBuilder.prefix(commonEntity.getSuggestValue());//去重completionSuggestionBuilder.skipDuplicates(true);//獲取建議條數completionSuggestionBuilder.size(commonEntity.getSuggestCount());//定義返回字段SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", completionSuggestionBuilder)));//定義查找響應SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);//定義完成建議對象CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("czbk-suggest"); //獲取返回數據List<CompletionSuggestion.Entry.Option> optionList = completionSuggestion.getEntries().get(0).getOptions();//從optionList取出結果if (!CollectionUtils.isEmpty(optionList)) {optionList.forEach(item -> {suggestList.add(item.getText().toString());});}return suggestList;}
-
定義自動補全控制器
@GetMapping(value = "/csuggest")public ResponseData cSuggest(@RequestBody CommonEntity commonEntity) {//構造返回下游業務數據ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定義建議返回結果List<String> result = null;try {//通過接口調用批量新增方法result = elasticsearchDocumentService.cSuggest(commonEntity);//通過類型推斷自動裝箱(多個參數取交集)rData.setResultEnum(result, ResultEnum.success, result.size());//日志記錄logger.info(TipsEnum.csuggest_get_doc_success.getMessage());} catch (Exception e) {//打印到控制臺e.printStackTrace();//日志記錄logger.error(TipsEnum.csuggest_get_doc_fail.getMessage());//構建錯誤返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
自動補全調用驗證
http://192.168.150.7:6666/v1/docs/csuggest
或者
http://localhost:6666/v1/docs/csuggest
參數
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "小米","suggestCount": 13
}
- indexName索引名稱
- suggestFileld:自動補全查找列
- suggestValue:自動補全輸入的關鍵字
- suggestCount:自動補全返回個數(京東是13個)
返回
{"code": "200","desc": "操作成功!","data": ["小米10","小米10Pro","小米8","小米9","小米充電寶","小米手機","小米攝像頭","小米電視","小米電飯煲","小米筆記本","小米耳環","小米路由器"],"count": 12
}
自動補全自動去重
5.2.2 拼音補全OpenAPI
使用拼音訪問【小米】
http://localhost:8888/v1/docs/csuggest
參數
// 全拼訪問
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "xiaomi","suggestCount": 13
}
// 全拼訪問(分隔)
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "xiao mi","suggestCount": 13
}
// 首字母訪問
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "xm","suggestCount": 13
}
安裝pinyin插件
5.3 產品搜索與語言處理
5.3.1 什么是語言處理(拼寫糾錯)
場景描述
例如:錯誤輸入"【adidaas官方旗艦店】 ”能夠糾錯為【adidas官方旗艦店】
5.3.2 語言處理OpenAPI
GET product_completion_index/_search
{"suggest": {"czbk-suggestion": {"text": "adidaas官方旗艦店","phrase": {"field": "name","size": 13}}}
}
-
定義拼寫糾錯接口
// 拼寫糾錯public String pSuggest(CommonEntity commonEntity) throws Exception;
-
定義拼寫糾錯實現
@Overridepublic String pSuggest(CommonEntity commonEntity) throws Exception {//定義返回String pSuggestString = new String();//定義短語建議器的構建器PhraseSuggestionBuilder phraseSuggestionBuilder = new PhraseSuggestionBuilder(commonEntity.getSuggestFileld());//設置搜索關鍵字phraseSuggestionBuilder.text(commonEntity.getSuggestValue());//數量匹配phraseSuggestionBuilder.size(1);//定義返回字段SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", phraseSuggestionBuilder)));//定義查找響應SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);//定義短語建議對象PhraseSuggestion phraseSuggestion = response.getSuggest().getSuggestion("czbk-suggest");//獲取返回數據List<PhraseSuggestion.Entry.Option> optionList = phraseSuggestion.getEntries().get(0).getOptions();//從optionList取出結果if (!CollectionUtils.isEmpty(optionList)) {pSuggestString = optionList.get(0).getText().toString();}return pSuggestString;}
-
定義拼寫糾錯控制器
@GetMapping(value = "/psuggest")public ResponseData pSuggest(@RequestBody CommonEntity commonEntity) {//構造返回下游業務數據ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定義糾錯返回結果String result = null;try {//通過接口調用批量新增方法result = elasticsearchDocumentService.pSuggest(commonEntity);//通過類型推斷自動裝箱(多個參數取交集)rData.setResultEnum(result, ResultEnum.success, null);//日志記錄logger.info(TipsEnum.psuggest_get_doc_success.getMessage());} catch (Exception e) {//打印到控制臺e.printStackTrace();//日志記錄logger.error(TipsEnum.psuggest_get_doc_fail.getMessage());//構建錯誤返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
語言處理調用驗證
http://192.168.150.7:6666/v1/docs/psuggest
或者
http://localhost:6666/v1/docs/psuggest參數
{"indexName": "product_completion_index","suggestFileld": "name","suggestValue": "adidaas官方旗艦店" }
- indexName索引名稱
- suggestFileld:自動補全查找列
- suggestValue:自動補全輸入的關鍵字
返回
{"code": "200","desc": "操作成功!","data": "adidas官方旗艦店" }
5.4 總結
- 需要一個搜索詞庫/語料庫,不要和業務索引庫在一起,方便維護和升級語料庫
- 根據分詞及其他搜索條件去語料庫中查詢若干條(京東13條、淘寶(天貓)10條、百度4條)記錄
返回 - 為了提升準確率,通常都是前綴搜索
6、電商平臺產品推薦
6.1 什么是搜索推薦
例如:關鍵詞輸入【阿迪達斯 耐克 外套 運動鞋 襪子】
汪~沒有找到與“阿迪達斯 耐克 外套 運動鞋 襪子”相關的商品,為您推薦“ 阿迪達斯耐克運動鞋”的相關商品,或者試試:
6.2 產品推薦OpenAPI
GET product_completion_index/_search
{"suggest": {"czbk-suggestion": {"text": "阿迪達斯 耐克 外套 運動鞋 襪子","term": {"field": "name","min_word_length": 2,"string_distance": "ngram","analyzer": "ik_smart"}}}
}
注意的地方,查看官網
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/search-suggesters.html#te
rm-suggester
-
定義搜索推薦接口
//搜索推薦public String tSuggest(CommonEntity commonEntity) throws Exception;
-
定義搜索推薦實現
@Overridepublic String tSuggest(CommonEntity commonEntity) throws Exception {//定義返回String tSuggestString = new String();//定義詞條建議器的構建器TermSuggestionBuilder termSuggestionBuilder = SuggestBuilders.termSuggestion(commonEntity.getSuggestFileld());//定義搜索關鍵字termSuggestionBuilder.text(commonEntity.getSuggestValue());//設置分詞termSuggestionBuilder.analyzer("ik_smart");//定義查詢長度termSuggestionBuilder.minWordLength(2);//設置查找算法termSuggestionBuilder.stringDistance(TermSuggestionBuilder.StringDistanceImpl.NGRAM);//定義返回字段SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", termSuggestionBuilder)));//定義查找響應SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);//定義term建議對象TermSuggestion termSuggestion = response.getSuggest().getSuggestion("czbk-suggest");//獲取返回數據List<TermSuggestion.Entry.Option> optionList = termSuggestion.getEntries().get(0).getOptions();//從optionList取出結果if (!CollectionUtils.isEmpty(optionList)) {tSuggestString = optionList.get(0).getText().toString();}return tSuggestString;}
-
定義搜索推薦控制器
@GetMapping(value = "/tsuggest")public ResponseData tSuggest(@RequestBody CommonEntity commonEntity) {//構造返回下游業務數據ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定義搜索推薦返回結果String result = null;try {//通過接口調用批量新增方法result = elasticsearchDocumentService.tSuggest(commonEntity);//通過類型推斷自動裝箱(多個參數取交集)rData.setResultEnum(result, ResultEnum.success, null);//日志記錄logger.info(TipsEnum.tsuggest_get_doc_success.getMessage());} catch (Exception e) {//打印到控制臺e.printStackTrace();//日志記錄logger.error(TipsEnum.tsuggest_get_doc_fail.getMessage());//構建錯誤返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
語言處理調用驗證
http://127.0.0.1:8888/v1/docs/tsuggest
參數
{"indexName": "product_completion_index","suggestFileld": "name","suggestValue": "阿迪達斯 耐克 外套 運動鞋 襪子" }
- indexName索引名稱
- suggestFileld:自動補全查找列
- suggestValue:自動補全輸入的關鍵字
返回
{"code": "200","desc": "操作成功!","data": "阿迪達斯外套" }
7、指標聚合與下鉆分析
7.1 指標聚合與分類
什么是指標聚合(Metric)
聚合分析是數據庫中重要的功能特性,完成對某個查詢的數據集中數據的聚合計算,
如:找出某字段(或計算表達式的結果)的最大值、最小值,計算和、平均值等。
ES作為搜索引擎兼數據庫,同樣提供了強大的聚合分析能力。
對一個數據集求最大值、最小值,計算和、平均值等指標的聚合,在ES中稱為指標聚合。
Metric聚合分析分為單值分析和多值分析兩類
- 單值分析,只輸出一個分析結果
min,max,avg,sum,cardinality(cardinality 求唯一值,即不重復的字段有多少(相當于mysql中的
distinct) - 多值分析,輸出多個分析結果
stats,extended_stats,percentile,percentile_rank
7.2 指標聚合與下鉆設計
官網
語法
"aggregations" : {"<aggregation_name>" : { <!--聚合的名字 -->"<aggregation_type>" : { <!--聚合的類型 --><aggregation_body> <!--聚合體:對哪些字段進行聚合 -->}[,"meta" : { [<meta_data_body>] } ]? <!--元 -->[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定義子聚合-->}[,"<aggregation_name_2>" : { ... } ]* <!--聚合的名字 -->
}
openAPI設計目標與原則:
- DSL調用與語法進行高度抽象,參數動態設計
- Open API通過結果轉換器支持上百種組合調用
qurey,constant_score,match/matchall/filter/sort/size/frm/higthlight/_source/includes - 邏輯處理公共調用,提升API業務處理能力
- 保留原生API與參數的用法
7.2.1 基礎框架搭建
7.2.2 單值分析API設計
-
Avg(平均值)
從聚合文檔中提取的價格的平均值。
對所有文檔進行avg聚合(DSL)POST product_list/_search {"size": 0,"aggs": {"czbk": {"avg": {"field": "price"}}} }
以上匯總計算了所有文檔的平均值。
“size”: 0, 表示只查詢文檔聚合數量,不查文檔,如查詢50,size=50
aggs:表示是一個聚合
czbk:可自定義,聚合后的數據將顯示在自定義字段中結果:
{"took" : 1662,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : null,"hits" : [ ]},"aggregations" : {"czbk" : {"value" : 920.1535462724372}} }
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"avg": {"field": "price"}}}} }
對篩選后的文檔聚合
POST product_list/_search {"size": 0,"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"avg": {"field": "price"}}} }
結果:
{"took" : 159,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : null,"hits" : [ ]},"aggregations" : {"czbk" : {"value" : 314.77633210684854}} }
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"avg": {"field": "price"}}}} }
根據Script計算平均值:
es所使用的腳本語言是painless這是一門安全-高效的腳本語言,基于jvm的
#統計所有 POST product_list/_search?size=0 {"aggs": {"czbk": {"avg": {"script": {"source": "doc.evalcount.value"}}}} } 結果:"value" : 599929.110015995 #有條件 POST product_list/_search?size=0 {"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"avg": {"script": {"source": "doc.evalcount"}}}} } 結果:"value" : 600055.6935087288
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"avg": {"script": {"source": "doc.evalcount"}}}}} }
總結:
avg平均
1、統一avg(所有文檔)
2、有條件avg(部分文檔)
3、腳本統計(所有)
4、腳本統計(部分)代碼編寫
//平均值if (m.getValue() instanceof ParsedAvg) {map.put("value", ((ParsedAvg) m.getValue()).getValue());}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
或者
http://localhost:5555/v1/analysis/metric/agg -
Max(最大值)
計算從聚合文檔中提取的數值的最大值。
統計所有文檔
POST product_list/_search {"size": 0,"aggs": {"czbk": {"max": {"field": "price"}}} }
結果: “value” :1.0E8
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"max": {"field": "price"}}}} }
統計過濾后的文檔
POST product_list/_search {"size": 0,"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"max": {"field": "price"}}} }
結果: “value” :2474000.0
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"max": {"field": "price"}}}} }
結果: “value” : 2474000.0
代碼編寫
//最大值if (m.getValue() instanceof ParsedMax) {map.put("value", ((ParsedMax) m.getValue()).getValue());}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
Min(最小值)
計算從聚合文檔中提取的數值的最小值。
統計所有文檔
POST product_list/_search {"size": 0,"aggs": {"czbk": {"min": {"field": "price"}}} }
結果:“value”: 0.0
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"min": {"field": "price"}}}} }
統計篩選后的文檔
POST product_list/_search {"size": 1,"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"min": {"field": "price"}}} }
結果:“value”: 0.0
參數size=1;可查詢出金額為0的數據
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 1,"query": {"match": {"onelevel": "手機通訊"}},"aggs": {"czbk": {"min": {"field": "price"}}}} }
代碼編寫
//最小值if (m.getValue() instanceof ParsedMin) {map.put("value", ((ParsedMin) m.getValue()).getValue());}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
或者
http://localhost:5555/v1/analysis/metric/agg -
Sum(總和)
統計所有文檔匯總
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"sum": {"field": "price"}}} }
結果:“value” : 9.652872986812243E8
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"sum": {"field": "price"}}}} }
代碼編寫
//求和if (m.getValue() instanceof ParsedSum) {map.put("value", ((ParsedSum) m.getValue()).getValue());}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
Cardinality(唯一值)
Cardinality Aggregation,基數聚合。它屬于multi-value,基于文檔的某個值(可以是特定的字段,
也可以通過腳本計算而來),計算文檔非重復的個數(去重計數),相當于sql中的distinct。cardinality 求唯一值,即不重復的字段有多少(相當于mysql中的distinct)
統計所有文檔
POST product_list/_search {"size": 0,"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}} }
結果:“value” : 103169
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}}} }
統計篩選后的文檔
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}} }
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}}} }
代碼編寫
//不重復的值if (m.getValue() instanceof ParsedCardinality) {map.put("value", ((ParsedCardinality) m.getValue()).getValue());}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg
7.2.3 多值分析API設計
-
Stats Aggregation
Stats Aggregation,統計聚合。它屬于multi-value,基于文檔的某個值(可以是特定的數值型字段,也可以通過腳本計算而來),計算出一些統計信息(min、max、sum、count、avg5個值)
統計所有文檔
POST product_list/_search {"size": 0,"aggs": {"czbk": {"stats": {"field": "price"}}} }
返回
"aggregations" : {"czbk" : {"count" : 5072448,"min" : 0.0,"max" : 1.0E8,"avg" : 920.1535462724372,"sum" : 4.667431015482532E9}}
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"stats": {"field": "price"}}}} }
統計篩選文檔
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"stats": {"field": "price"}}} }
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"stats": {"field": "price"}}}} }
代碼編寫
//狀態統計if (m.getValue() instanceof ParsedStats) {map.put("count", ((ParsedStats) m.getValue()).getCount());map.put("min", ((ParsedStats) m.getValue()).getMin());map.put("max", ((ParsedStats) m.getValue()).getMax());map.put("avg", ((ParsedStats) m.getValue()).getAvg());map.put("sum", ((ParsedStats) m.getValue()).getSum());}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
擴展狀態統計
Extended Stats Aggregation,擴展統計聚合。它屬于multi-value,比stats多4個統計結果: 平方
和、方差、標準差、平均值加/減兩個標準差的區間統計所有文檔
POST product_list/_search {"size": 0,"aggs": {"czbk": {"extended_stats": {"field": "price"}}} }
返回
"aggregations" : {"czbk" : {"count" : 5072448,"min" : 0.0,"max" : 1.0E8,"avg" : 920.1535462724372,"sum" : 4.667431015482532E9,"sum_of_squares" : 2.0182209454063148E16,"variance" : 3.9779441210362864E9,"variance_population" : 3.9779441210362864E9,"variance_sampling" : 3.9779449052621484E9,"std_deviation" : 63070.94514145389,"std_deviation_population" : 63070.94514145389,"std_deviation_sampling" : 63070.951358467304,"std_deviation_bounds" : {"upper" : 127062.04382918023,"lower" : -125221.73673663534,"upper_population" : 127062.04382918023,"lower_population" : -125221.73673663534,"upper_sampling" : 127062.05626320705,"lower_sampling" : -125221.74917066217}}}
- sum_of_squares:平方和
- variance:方差
- std_deviation:標準差
- std_deviation_bounds:標準差的區間
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"extended_stats": {"field": "price"}}}} }
統計篩選后的文檔
POST product_list/_search {"size": 1,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"extended_stats": {"field": "price"}}} }
返回
"aggregations" : {"czbk" : {"count" : 340378,"min" : 0.0,"max" : 2474000.0,"avg" : 2835.927406240193,"sum" : 9.652872986812243E8,"sum_of_squares" : 6.06065362437439E13,"variance" : 1.7001407710991383E8,"variance_population" : 1.7001407710991383E8,"variance_sampling" : 1.7001457659747353E8,"std_deviation" : 13038.944631752749,"std_deviation_population" : 13038.944631752749,"std_deviation_sampling" : 13038.963785419206,"std_deviation_bounds" : {"upper" : 28913.81666974569,"lower" : -23241.961857265305,"upper_population" : 28913.81666974569,"lower_population" : -23241.961857265305,"upper_sampling" : 28913.854977078605,"lower_sampling" : -23242.00016459822}}}
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 1,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"extended_stats": {"field": "price"}}}} }
代碼編寫
狀態統計ParsedStats是擴展狀態統計ParsedExtendedStats父類
判斷無需更改順序
//擴展統計if (m.getValue() instanceof ParsedExtendedStats) {map.put("count", ((ParsedExtendedStats) m.getValue()).getCount());map.put("min", ((ParsedExtendedStats) m.getValue()).getMin());map.put("max", ((ParsedExtendedStats) m.getValue()).getMax());map.put("avg", ((ParsedExtendedStats) m.getValue()).getAvg());map.put("sum", ((ParsedExtendedStats) m.getValue()).getSum());map.put("sum_of_squares", ((ParsedExtendedStats) m.getValue()).getSumOfSquares());map.put("variance", ((ParsedExtendedStats) m.getValue()).getVariance());map.put("std_deviation", ((ParsedExtendedStats) m.getValue()).getStdDeviation());map.put("upper", ((ParsedExtendedStats) m.getValue()).getStdDeviationBound(ExtendedStats.Bounds.UPPER));map.put("lower", ((ParsedExtendedStats) m.getValue()).getStdDeviationBound(ExtendedStats.Bounds.LOWER));}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
百分位度量/百分比統計
Percentiles Aggregation,百分比聚合。它屬于multi-value,對指定字段(腳本)的值按從小到大累計每個值對應的文檔數的占比(占所有命中文檔數的百分比),返回指定占比比例對應的值。默認返回[1, 5, 25, 50, 75, 95, 99 ]分位上的值。
它們表示了人們感興趣的常用百分位數值。
統計所有文檔
POST product_list/_search {"size": 0,"aggs": {"czbk": {"percentiles": {"field": "price"}}} }
返回
},"aggregations" : {"czbk" : {"values" : {"1.0" : 0.0,"5.0" : 14.99999272133453,"25.0" : 58.76038168571048,"50.0" : 139.47447505232998,"75.0" : 388.59368606915626,"95.0" : 3634.3835145207904,"99.0" : 12547.450833578012}}}
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"percentiles": {"field": "price"}}}} }
統計篩選后的文檔
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"percentiles": {"field": "price"}}} }
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"percentiles": {"field": "price"}}}} }
代碼編寫
//百分位度量if (m.getValue() instanceof ParsedTDigestPercentiles) {for (Iterator<Percentile> iterator = ((ParsedTDigestPercentiles) m.getValue()).iterator(); iterator.hasNext(); ) {Percentile p = iterator.next();map.put(p.getPercent(), p.getValue());}}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
百分位等級/百分比排名聚合
百分比排名聚合:這里有另外一個緊密相關的度量叫 percentile_ranks 。 percentiles 度量告訴
我們落在某個百分比以下的所有文檔的最小值。統計所有文檔
統計價格在15元之內統計價格在30元之內文檔數據占有的百分比
tips:
統計數據會變化
這里的15和30;完全可以理解萬SLA的200;比較字段不一樣而已POST product_list/_search {"size": 0,"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}} }
返回
價格在15元之內的文檔數據占比是4.92%
價格在30元之內的文檔數據占比是12.72%"aggregations" : {"czbk" : {"values" : {"15.0" : 4.89331591488828,"30.0" : 12.732247823263487}}}
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}}} }
統計過濾后的文檔
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}} }
OpenAPI查詢參數設計
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手機"}}}},"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}}} }
代碼編寫
//百分位等級if (m.getValue() instanceof ParsedTDigestPercentileRanks) {for (Iterator<Percentile> iterator = ((ParsedTDigestPercentileRanks) m.getValue()).iterator(); iterator.hasNext(); ) {Percentile p = iterator.next();map.put(p.getValue(), p.getPercent());}}
訪問驗證
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg
8、電商平臺日志埋點與搜索熱詞
8.1 什么是熱度搜索
以下為【京東】熱搜詞
8.2 提取熱度搜索
熱搜詞分析流程圖
8.3 日志埋點
下面的配置針對需要埋點的服務
這里以service-elasticsearch為例
-
Spring Cloud 整合Log4j2
相比與其他的日志系統,log4j2丟數據這種情況少;disruptor技術,在多線程環境下,性能高于logback等10倍以上;利用jdk1.5并發的特性,減少了死鎖的發生;
排除logback的默認集成。
因為Spring Cloud 默認集成了logback, 所以首先要排除logback的集成,在pom.xml文件<!--排除logback的默認集成 Spring Cloud 默認集成了logback--><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId><exclusions><exclusion><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-logging</artifactId></exclusion></exclusions></dependency>
-
引入log4j2起步依賴
<!-- 引入log4j2起步依賴--><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-log4j2</artifactId></dependency><!-- log4j2依賴環形隊列--><dependency><groupId>com.lmax</groupId><artifactId>disruptor</artifactId><version>3.4.2</version></dependency>
-
設置配置文件
如果自定義了文件名,需要在application.yml中配置
進入Nacos修改配置
logging:config: classpath:log4j2-dev.xml
-
配置文件模板
<Configuration><Appenders><Socket name="Socket" host="192.168.150.7" port="4567"><JsonLayout compact="true" eventEol="true" /></Socket></Appenders><Loggers><Root level="info"><AppenderRef ref="Socket"/></Root></Loggers> </Configuration>
從配置文件中可以看到,這里使用的是Socket Appender來將日志打印的信息發送到Logstash。
注意了,Socket的Appender必須要配置到下面的Logger才能將日志輸出到Logstash里!
另外這里的host是部署了Logstash服務端的地址,并且端口號要和你在Logstash里配置的一致才行。
-
日志埋點
private void getClientConditions(CommonEntity commonEntity, SearchSourceBuilder searchSourceBuilder) {//循環下游業務查詢條件for (Map.Entry<String, Object> m : commonEntity.getMap().entrySet()) {if (StringUtils.isNotEmpty(m.getKey()) && m.getValue() != null) {String key = m.getKey();String value = String.valueOf(m.getValue());//構造DSL請求體中的querysearchSourceBuilder.query(QueryBuilders.matchQuery(key, value));logger.info("search for the keyword:" + value);}}}
-
創建索引
下面的索引存儲用戶輸入的關鍵字,最終通過聚合的方式處理索引數據,最終將數據放到語料庫
PUT es-log/ {"mappings": {"properties": {"@timestamp": {"type": "date"},"host": {"type": "text"},"searchkey": {"type": "keyword"},"port": {"type": "long"},"loggerName": {"type": "text"}}} }
8.4 數據落盤
- 配置Logstash.conf
連接logstash方式有兩種
(1) 一種是Socket連接
(2)另外一種是gelf連接
對外暴露logstash容器的4567端口:參考文檔
- 執行全文檢索
http://localhost:8888/v1/docs/mquery
參數
{"pageNumber": 1,"pageSize": 3,"indexName": "product_list","highlight": "productname","map": {"productname": "小米"}
}
- 查詢是否有數據
GET es-log/_search
{"from": 0,"size": 200,"query": {"match_all": {}}
}
返回
"hits" : [{"_index" : "es-log","_type" : "_doc","_id" : "H94AKpQB5vqCNWpIYHYT","_score" : 1.0,"_source" : {"host" : "192.168.150.1","loggerName" : "com.xin.service.impl.ElasticsearchDocumentServiceImpl","@timestamp" : "2025-01-03T02:30:55.118Z","searchkey" : "小米","port" : 54544}},{"_index" : "es-log","_type" : "_doc","_id" : "ZdgAKpQBrYxtVgSQgvHB","_score" : 1.0,"_source" : {"host" : "192.168.150.1","loggerName" : "com.xin.service.impl.ElasticsearchDocumentServiceImpl","@timestamp" : "2025-01-03T02:31:04.021Z","searchkey" : "小米","port" : 54544}}]
8.5 熱度搜索OpenAPI
聚合
獲取es-log索引中的文檔數據并對其進行分組,統計熱搜詞出現的頻率,根據頻率獲取有效數據。
DSL實現
POST es-log/_search?size=0
{"aggs": {"czbk": {"terms": {"field": "searchkey","min_doc_count": 5,"size": 2,"order": {"_count": "desc"}}}}
}
返回
{"took" : 155,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 14,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"czbk" : {"doc_count_error_upper_bound" : 0,"sum_other_doc_count" : 0,"buckets" : [{"key" : "華為","doc_count" : 7},{"key" : "小米","doc_count" : 7}]}}
}
OpenAPI查詢參數設計
-
定義搜索推薦接口
//獲取搜索熱詞public Map<String, Long> hotWords(CommonEntity commonEntity) throws Exception;
-
定義搜索推薦實現
@Overridepublic Map<String, Long> hotWords(CommonEntity commonEntity) throws Exception {//定義返回數據Map<String, Long> map = new LinkedHashMap<>();//執行查詢SearchResponse response = getSearchResponse(commonEntity);//接收數據Terms termsAggData = response.getAggregations().get(response.getAggregations().getAsMap().entrySet().iterator().next().getKey());for (Terms.Bucket entry : termsAggData.getBuckets()) {if (entry.getKey() != null) {//key為分組字段String key = entry.getKey().toString();//count數據條數Long count = entry.getDocCount();//設置到mapmap.put(key, count);}}return map;}
-
定義搜索推薦控制器
@GetMapping(value = "/hotwords")public ResponseData hotWords(@RequestBody CommonEntity commonEntity) {//構造返回數據ResponseData responseData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName())) {responseData.setResultEnum(ResultEnum.param_isnull);return responseData;}//定義查詢返回結果Map<String, Long> result = null;try {result = analysisService.hotWords(commonEntity);//通過類型推斷自動裝箱responseData.setResultEnum(result, ResultEnum.success, null);//日志記錄logger.info(TipsEnum.hotwords_get_doc_success.getMessage());} catch (Exception e) {//打印到控制臺e.printStackTrace();//日志記錄logger.error(TipsEnum.hotwords_get_doc_fail.getMessage());//構建錯誤信息responseData.setResultEnum(ResultEnum.error);}return responseData;}
-
調用驗證
獲取分析服務熱搜詞數據
http://localhost:5555/v1/analysis/hotwords
參數
{"indexName": "es-log","map": {"aggs": {"per_count": {"terms": {"field": "searchkey","min_doc_count": 5,"size": 2,"order": {"_count": "desc"}}}}} }
- field表示需要查找的列
- min_doc_count:熱搜詞在文檔中出現的次數
- size表示本次取出多少數據
- order表示排序(升序or降序)
返回
{"code": "200","desc": "操作成功!","data": {"華為": 7,"小米": 7} }