ElasticSearch 分詞器

文章目錄

- - 一、安裝中文分詞插件
  - - Linux安裝7.14.1版本：
    - - 測試1：ik_smart
      - 測試2：ik_max_word
  - 二、es內置的分詞器：
  - 三、拼音插件安裝以及（IK+pinyin使用）
  - - 配置 IK + pinyin 分詞配置

一、安裝中文分詞插件

IK Analysis for Elasticsearch是開源社區比較流行的中文分詞插件
官網：https://github.com/medcl/elasticsearch-analysis-ik

??本來想用這兩種方法安裝：

.\bin\elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.4.2/elasticsearch-analysis-ik-5.4.2.zip
bin\elasticsearch-plugin install file:///C:\Users\jk\Desktop\elasticsearch-analysis-ik-5.4.2.zip

??但是卻總是報錯：

ERROR: `elasticsearch` directory is missing in the plugin zip

??后來只能用了這種方法：在Elasticsearch安裝目錄下的文件夾plugins中新建文件夾ik，將elasticsearch-analysis-ik-5.4.2.zip解壓到這里即可，其實官網里已經說明了低于5.5.1版本的用解壓的方式安裝了：
在這里插入圖片描述
參考：Elasticsearch5.x安裝IK分詞器以及使用

Linux安裝7.14.1版本：

下載地址：https://github.com/medcl/elasticsearch-analysis-ik/releases

[hadoop@node01 ~]$ elasticsearch-7.14.1/bin/elasticsearch-plugin install file:///mnt/elasticsearch-analysis-ik-7.14.1.zip

在這里插入圖片描述
ik_smart：會做最粗粒度的拆分
ik_max_word：會將文本做最細粒度的拆分。

測試1：ik_smart

GET /_analyze
{"analyzer": "ik_smart","text":"中華人民共和國"
}

結果：

{"tokens": [{"token": "中華人民共和國","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0}]
}

測試2：ik_max_word

GET /_analyze
{"analyzer": "ik_max_word","text":"中華人民共和國"
}

結果：

{"tokens": [{"token": "中華人民共和國","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0},{"token": "中華人民","start_offset": 0,"end_offset": 4,"type": "CN_WORD","position": 1},{"token": "中華","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 2},{"token": "華人","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 3},{"token": "人民共和國","start_offset": 2,"end_offset": 7,"type": "CN_WORD","position": 4},{"token": "人民","start_offset": 2,"end_offset": 4,"type": "CN_WORD","position": 5},{"token": "共和國","start_offset": 4,"end_offset": 7,"type": "CN_WORD","position": 6},{"token": "共和","start_offset": 4,"end_offset": 6,"type": "CN_WORD","position": 7},{"token": "國","start_offset": 6,"end_offset": 7,"type": "CN_CHAR","position": 8}]
}

GET /_analyze
{"analyzer": "ik_max_word","text":"I love you"
}
結果：
{"tokens" : [{"token" : "i","start_offset" : 0,"end_offset" : 1,"type" : "ENGLISH","position" : 0},{"token" : "love","start_offset" : 2,"end_offset" : 6,"type" : "ENGLISH","position" : 1},{"token" : "you","start_offset" : 7,"end_offset" : 10,"type" : "ENGLISH","position" : 2}]
}

參考：https://blog.csdn.net/wenxindiaolong061/article/details/82562450

二、es內置的分詞器：

standard analyzer
simple analyzer
whitespace analyzer
language analyzer(特定的語言的分詞器)

例句：Set the shape to semi-transparent by calling set_trans(5)
不同分詞器的分詞結果：

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默認的是standard）
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的語言的分詞器，比如說，english，英語分詞器）：set, shape, semi, transpar, call, set_tran, 5

分詞器測試：

GET /_analyze
{"analyzer": "standard","text":"The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone."
}

結果：

  "tokens" : [{"token" : "the","start_offset" : 0,"end_offset" : 3,"type" : "<ALPHANUM>","position" : 0},{"token" : "2","start_offset" : 4,"end_offset" : 5,"type" : "<NUM>","position" : 1},{"token" : "quick","start_offset" : 6,"end_offset" : 11,"type" : "<ALPHANUM>","position" : 2},{"token" : "brown","start_offset" : 12,"end_offset" : 17,"type" : "<ALPHANUM>","position" : 3},{"token" : "foxes","start_offset" : 18,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 4},{"token" : "jumped","start_offset" : 24,"end_offset" : 30,"type" : "<ALPHANUM>","position" : 5},{"token" : "over","start_offset" : 31,"end_offset" : 35,"type" : "<ALPHANUM>","position" : 6},{"token" : "the","start_offset" : 36,"end_offset" : 39,"type" : "<ALPHANUM>","position" : 7},{"token" : "lazy","start_offset" : 40,"end_offset" : 44,"type" : "<ALPHANUM>","position" : 8},{"token" : "dog’s","start_offset" : 45,"end_offset" : 50,"type" : "<ALPHANUM>","position" : 9},{"token" : "bone","start_offset" : 51,"end_offset" : 55,"type" : "<ALPHANUM>","position" : 10}]
}

??可以看出是按照空格、非字母的方式對輸入的文本進行了轉換，比如對 Java 做了轉小寫，對一些停用詞也沒有去掉，比如 in，其中 token 為分詞結果；start_offset 為起始偏移；end_offset 為結束偏移；position 為分詞位置。可配置項

選項	描述
max_token_length	最大令牌長度。如果看到令牌超過此長度，則將其max_token_length間隔分割。默認為255。
stopwords	預定義的停用詞列表，例如english或包含停用詞列表的數組。默認為none。
stopwords_path	包含停用詞的文件的路徑。

COPY{"settings": {"analysis": {"analyzer": {"my_english_analyzer": {"type": "standard","max_token_length": 5,"stopwords": "_english_"}}}}
}

??不同的 Analyzer 會有不同的分詞結果，內置的分詞器有以下幾種，基本上內置的 Analyzer 包括 Language Analyzers 在內，對中文的分詞都不夠友好，中文分詞需要安裝其它 Analyzer

分析器	描述	分詞對象	結果
standard	標準分析器是默認的分析器，如果沒有指定，則使用該分析器。它提供了基于文法的標記化(基于 Unicode 文本分割算法，如 Unicode 標準附件 # 29所規定) ，并且對大多數語言都有效。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone ]
simple	簡單分析器將文本分解為任何非字母字符的標記，如數字、空格、連字符和撇號、放棄非字母字符，并將大寫字母更改為小寫字母。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
whitespace	空格分析器在遇到空白字符時將文本分解為術語	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]
stop	停止分析器與簡單分析器相同，但增加了刪除停止字的支持。默認使用的是 english 停止詞。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
keyword	不分詞，把整個字段當做一個整體返回	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]
pattern	模式分析器使用正則表達式將文本拆分為術語。正則表達式應該匹配令牌分隔符，而不是令牌本身。正則表達式默認為 w+ (或所有非單詞字符)。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
多種西語系 arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english等等	一組旨在分析特定語言文本的分析程序。

??中文分詞器最簡單的是ik分詞器，還有jieba分詞，哈工大分詞器等

分析器	描述	分詞對象	結果
ik_smart	ik分詞器中的簡單分詞器，支持自定義字典，遠程字典	學如逆水行舟，不進則退	[學如逆水行舟,不進則退]
ik_max_word	ik_分詞器的全量分詞器，支持自定義字典，遠程字典	學如逆水行舟，不進則退	[學如逆水行舟,學如逆水,逆水行舟,逆水,行舟,不進則退,不進,則,退]

參考：
【9種】ElasticSearch分詞器詳解，一文get！！！| 博學谷狂野架構師
[轉]中英文停止詞表（stopword）

三、拼音插件安裝以及（IK+pinyin使用）

??有時在淘寶搜索商品的時候，會發現使用漢字、拼音、或者拼音混合漢字都會出來想要的搜索結果，其實是通過拼音搜索插件實現的。

??地址：https://github.com/infinilabs/analysis-pinyin/releases

??選擇對應的版本，版本與 ES 版本一致，建議直接下載編譯后的 zip 包；若是下載源碼包，則需要自己編碼打包 mvn clean package 生成 zip 包。聯網安裝：elasticsearch-7.14.1/bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/7.14.1

在這里插入圖片描述
??使用自定義拼音分詞器創建索引：

PUT /medcl/ 
{"settings" : {"analysis" : {"analyzer" : {"pinyin_analyzer" : {"tokenizer" : "my_pinyin"}},"tokenizer" : {"my_pinyin" : {"type" : "pinyin","keep_separate_first_letter" : false,"keep_full_pinyin" : true,"keep_original" : true,"limit_first_letter_length" : 16,"lowercase" : true,"remove_duplicated_term" : true}}}}
}

??測試：

POST medcl/_analyze
{"analyzer": "pinyin_analyzer","text": "劉德華"
}

??結果：

{"tokens" : [{"token" : "liu","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "劉德華","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "ldh","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "de","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 1},{"token" : "hua","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 2}]
}

配置 IK + pinyin 分詞配置

??settings 設置：

curl -XPUT "http://localhost:9200/medcl/" -d'
{"index": {"analysis": {"analyzer": {"default": {"tokenizer": "ik_max_word"},"pinyin_analyzer": {"tokenizer": "shopmall_pinyin"}},"tokenizer": {"shopmall_pinyin": {"keep_joined_full_pinyin": "true","keep_first_letter": "true","keep_separate_first_letter": "false","lowercase": "true","type": "pinyin","limit_first_letter_length": "16","keep_original": "true","keep_full_pinyin": "true"}}}}
}'

??創建 mapping：

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{"folks": {"properties": {"name": {"type": "text","analyzer": "ik_max_word","include_in_all": true,"fields": {"pinyin": {"type": "text","analyzer": "pinyin_analyzer"}}}}}
}'

??添加測試文檔：

curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"劉德華"}'curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"中華人民共和國國歌"}'

??拼音分詞效果：

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"

??中文分詞測試：

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:劉"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:劉德"

注意：用戶輸入搜索內容，根據正則匹配分成中文、拼音、中文+拼音、中文+拼音+數字+特殊符號等情況進行搜索，如下：

若是漢字搜索，沒有搜索結果，轉化為拼音再搜索一次，按拼音搜索還是無結果，則按模糊搜索再搜一次，若是還無結果，可考慮推薦
若是拼音搜索，沒有搜索結果，則按模糊搜索再搜一次
若是漢字+拼音搜索，暫且按拼音處理
拼音、數字、特殊字符，暫且按拼音處理

參考：elasticsearch拼音插件安裝以及（IK+pinyin使用）