文章目錄
- 一、安裝中文分詞插件
- Linux安裝7.14.1版本:
- 測試1:ik_smart
- 測試2:ik_max_word
- 二、es內置的分詞器:
- 三、拼音插件安裝以及(IK+pinyin使用)
- 配置 IK + pinyin 分詞配置
一、安裝中文分詞插件
IK Analysis for Elasticsearch
是開源社區比較流行的中文分詞插件
官網:https://github.com/medcl/elasticsearch-analysis-ik
??本來想用這兩種方法安裝:
.\bin\elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.4.2/elasticsearch-analysis-ik-5.4.2.zip
bin\elasticsearch-plugin install file:///C:\Users\jk\Desktop\elasticsearch-analysis-ik-5.4.2.zip
??但是卻總是報錯:
ERROR: `elasticsearch` directory is missing in the plugin zip
??后來只能用了這種方法:在Elasticsearch安裝目錄下的文件夾plugins中新建文件夾ik,將elasticsearch-analysis-ik-5.4.2.zip
解壓到這里即可,其實官網里已經說明了低于5.5.1版本的用解壓的方式安裝了:
參考:Elasticsearch5.x安裝IK分詞器以及使用
Linux安裝7.14.1版本:
下載地址:https://github.com/medcl/elasticsearch-analysis-ik/releases
[hadoop@node01 ~]$ elasticsearch-7.14.1/bin/elasticsearch-plugin install file:///mnt/elasticsearch-analysis-ik-7.14.1.zip
ik_smart
:會做最粗粒度的拆分
ik_max_word
:會將文本做最細粒度的拆分。
測試1:ik_smart
GET /_analyze
{"analyzer": "ik_smart","text":"中華人民共和國"
}
結果:
{"tokens": [{"token": "中華人民共和國","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0}]
}
測試2:ik_max_word
GET /_analyze
{"analyzer": "ik_max_word","text":"中華人民共和國"
}
結果:
{"tokens": [{"token": "中華人民共和國","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0},{"token": "中華人民","start_offset": 0,"end_offset": 4,"type": "CN_WORD","position": 1},{"token": "中華","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 2},{"token": "華人","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 3},{"token": "人民共和國","start_offset": 2,"end_offset": 7,"type": "CN_WORD","position": 4},{"token": "人民","start_offset": 2,"end_offset": 4,"type": "CN_WORD","position": 5},{"token": "共和國","start_offset": 4,"end_offset": 7,"type": "CN_WORD","position": 6},{"token": "共和","start_offset": 4,"end_offset": 6,"type": "CN_WORD","position": 7},{"token": "國","start_offset": 6,"end_offset": 7,"type": "CN_CHAR","position": 8}]
}
GET /_analyze
{"analyzer": "ik_max_word","text":"I love you"
}
結果:
{"tokens" : [{"token" : "i","start_offset" : 0,"end_offset" : 1,"type" : "ENGLISH","position" : 0},{"token" : "love","start_offset" : 2,"end_offset" : 6,"type" : "ENGLISH","position" : 1},{"token" : "you","start_offset" : 7,"end_offset" : 10,"type" : "ENGLISH","position" : 2}]
}
參考:https://blog.csdn.net/wenxindiaolong061/article/details/82562450
二、es內置的分詞器:
- standard analyzer
- simple analyzer
- whitespace analyzer
- language analyzer(特定的語言的分詞器)
例句:Set the shape to semi-transparent by calling set_trans(5)
不同分詞器的分詞結果:
- standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認的是standard)
- simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans
- whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
- language analyzer(特定的語言的分詞器,比如說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5
分詞器測試:
GET /_analyze
{"analyzer": "standard","text":"The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone."
}
結果:
"tokens" : [{"token" : "the","start_offset" : 0,"end_offset" : 3,"type" : "<ALPHANUM>","position" : 0},{"token" : "2","start_offset" : 4,"end_offset" : 5,"type" : "<NUM>","position" : 1},{"token" : "quick","start_offset" : 6,"end_offset" : 11,"type" : "<ALPHANUM>","position" : 2},{"token" : "brown","start_offset" : 12,"end_offset" : 17,"type" : "<ALPHANUM>","position" : 3},{"token" : "foxes","start_offset" : 18,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 4},{"token" : "jumped","start_offset" : 24,"end_offset" : 30,"type" : "<ALPHANUM>","position" : 5},{"token" : "over","start_offset" : 31,"end_offset" : 35,"type" : "<ALPHANUM>","position" : 6},{"token" : "the","start_offset" : 36,"end_offset" : 39,"type" : "<ALPHANUM>","position" : 7},{"token" : "lazy","start_offset" : 40,"end_offset" : 44,"type" : "<ALPHANUM>","position" : 8},{"token" : "dog’s","start_offset" : 45,"end_offset" : 50,"type" : "<ALPHANUM>","position" : 9},{"token" : "bone","start_offset" : 51,"end_offset" : 55,"type" : "<ALPHANUM>","position" : 10}]
}
??可以看出是按照空格、非字母的方式對輸入的文本進行了轉換,比如對 Java 做了轉小寫,對一些停用詞也沒有去掉,比如 in,其中 token 為分詞結果;start_offset
為起始偏移;end_offset
為結束偏移;position
為分詞位置。可配置項
選項 | 描述 |
---|---|
max_token_length | 最大令牌長度。如果看到令牌超過此長度,則將其max_token_length間隔分割。默認為255。 |
stopwords | 預定義的停用詞列表,例如english或包含停用詞列表的數組。默認為none。 |
stopwords_path | 包含停用詞的文件的路徑。 |
COPY{"settings": {"analysis": {"analyzer": {"my_english_analyzer": {"type": "standard","max_token_length": 5,"stopwords": "_english_"}}}}
}
??不同的 Analyzer 會有不同的分詞結果,內置的分詞器有以下幾種,基本上內置的 Analyzer 包括 Language Analyzers 在內,對中文的分詞都不夠友好,中文分詞需要安裝其它 Analyzer
分析器 | 描述 | 分詞對象 | 結果 |
---|---|---|---|
standard | 標準分析器是默認的分析器,如果沒有指定,則使用該分析器。它提供了基于文法的標記化(基于 Unicode 文本分割算法,如 Unicode 標準附件 # 29所規定) ,并且對大多數語言都有效。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone ] |
simple | 簡單分析器將文本分解為任何非字母字符的標記,如數字、空格、連字符和撇號、放棄非字母字符,并將大寫字母更改為小寫字母。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ] |
whitespace | 空格分析器在遇到空白字符時將文本分解為術語 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ] |
stop | 停止分析器與簡單分析器相同,但增加了刪除停止字的支持。默認使用的是 english 停止詞。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ] |
keyword | 不分詞,把整個字段當做一個整體返回 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.] |
pattern | 模式分析器使用正則表達式將文本拆分為術語。正則表達式應該匹配令牌分隔符,而不是令牌本身。正則表達式默認為 w+ (或所有非單詞字符)。 | The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone. | [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ] |
多種西語系 arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english等等 | 一組旨在分析特定語言文本的分析程序。 |
??中文分詞器最簡單的是ik分詞器,還有jieba分詞,哈工大分詞器等
分析器 | 描述 | 分詞對象 | 結果 |
---|---|---|---|
ik_smart | ik分詞器中的簡單分詞器,支持自定義字典,遠程字典 | 學如逆水行舟,不進則退 | [學如逆水行舟,不進則退] |
ik_max_word | ik_分詞器的全量分詞器,支持自定義字典,遠程字典 | 學如逆水行舟,不進則退 | [學如逆水行舟,學如逆水,逆水行舟,逆水,行舟,不進則退,不進,則,退] |
參考:
【9種】ElasticSearch分詞器詳解,一文get!!!| 博學谷狂野架構師
[轉]中英文停止詞表(stopword)
三、拼音插件安裝以及(IK+pinyin使用)
??有時在淘寶搜索商品的時候,會發現使用漢字、拼音、或者拼音混合漢字都會出來想要的搜索結果,其實是通過拼音搜索插件實現的。
??地址:https://github.com/infinilabs/analysis-pinyin/releases
??選擇對應的版本,版本與 ES 版本一致,建議直接下載編譯后的 zip 包;若是下載源碼包,則需要自己編碼打包 mvn clean package
生成 zip 包。聯網安裝:elasticsearch-7.14.1/bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/7.14.1
??使用自定義拼音分詞器創建索引:
PUT /medcl/
{"settings" : {"analysis" : {"analyzer" : {"pinyin_analyzer" : {"tokenizer" : "my_pinyin"}},"tokenizer" : {"my_pinyin" : {"type" : "pinyin","keep_separate_first_letter" : false,"keep_full_pinyin" : true,"keep_original" : true,"limit_first_letter_length" : 16,"lowercase" : true,"remove_duplicated_term" : true}}}}
}
??測試:
POST medcl/_analyze
{"analyzer": "pinyin_analyzer","text": "劉德華"
}
??結果:
{"tokens" : [{"token" : "liu","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "劉德華","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "ldh","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 0},{"token" : "de","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 1},{"token" : "hua","start_offset" : 0,"end_offset" : 0,"type" : "word","position" : 2}]
}
配置 IK + pinyin 分詞配置
??settings 設置:
curl -XPUT "http://localhost:9200/medcl/" -d'
{"index": {"analysis": {"analyzer": {"default": {"tokenizer": "ik_max_word"},"pinyin_analyzer": {"tokenizer": "shopmall_pinyin"}},"tokenizer": {"shopmall_pinyin": {"keep_joined_full_pinyin": "true","keep_first_letter": "true","keep_separate_first_letter": "false","lowercase": "true","type": "pinyin","limit_first_letter_length": "16","keep_original": "true","keep_full_pinyin": "true"}}}}
}'
??創建 mapping:
curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{"folks": {"properties": {"name": {"type": "text","analyzer": "ik_max_word","include_in_all": true,"fields": {"pinyin": {"type": "text","analyzer": "pinyin_analyzer"}}}}}
}'
??添加測試文檔:
curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"劉德華"}'curl -XPOST http://localhost:9200/medcl/folks/ -d'{"name":"中華人民共和國國歌"}'
??拼音分詞效果:
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"
??中文分詞測試:
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:劉"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name:劉德"
注意:用戶輸入搜索內容,根據正則匹配分成中文、拼音、中文+拼音、中文+拼音+數字+特殊符號等情況進行搜索,如下:
- 若是漢字搜索,沒有搜索結果,轉化為拼音再搜索一次,按拼音搜索還是無結果,則按模糊搜索再搜一次,若是還無結果,可考慮推薦
- 若是拼音搜索,沒有搜索結果,則按模糊搜索再搜一次
- 若是漢字+拼音搜索,暫且按拼音處理
- 拼音、數字、特殊字符,暫且按拼音處理
參考:elasticsearch拼音插件安裝以及(IK+pinyin使用)