es的自定義詞典和停用詞

在 Elasticsearch 中，自定義詞典是優化分詞效果的核心手段，尤其適用于中文或專業領域的文本處理。以下是關于 ES 自定義詞典的完整指南：

為什么需要自定義詞典？
默認分詞不足：
ES 自帶的分詞器（如 Standard Analyzer）對中文處理效果差（按字拆分），IK 分詞器雖支持中文，但默認詞庫可能缺少特定領域詞匯（如 “大模型”“元宇宙”）。
業務需求：
確保專業術語不被拆分（如 “機器學習” 不應拆分為 “機器”“學習”）；
識別品牌名、人名、地名等專有名詞；
處理網絡熱詞或縮寫（如 “yyds”“內卷”）。
如何配置自定義詞典？
以 IK 分詞器為例，步驟如下：
步驟 1：創建詞典文件
在 ES 安裝目錄下的 plugins/ik/config 目錄中創建自定義詞典文件（如 custom/mydict.dic）：
步驟 2：修改配置文件
編輯 plugins/ik/config/IKAnalyzer.cfg.xml，添加自定義詞典路徑：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties><comment>IK Analyzer 擴展配置</comment><!--用戶可以在這里配置自己的擴展字典 --><entry key="ext_dict">custom/mydic.dic</entry><!--用戶可以在這里配置自己的擴展停止詞字典--><entry key="ext_stopwords">custom/stopwords.dic</entry><!--用戶可以在這里配置遠程擴展字典 --><!-- <entry key="remote_ext_dict">words_location</entry> --><!--用戶可以在這里配置遠程擴展停止詞字典--><!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

路徑規則：
使用正斜杠 / 或分號 ; 分隔多個詞典；
路徑相對于 ik/config 目錄（如 custom/mydict.dic 對應實際路徑 plugins/ik/config/custom/mydict.dic）
我的myidct.dic自定義詞典為：

有限公司
有限責任公司
人工智能
許家印
前首富

步驟 3：重啟 ES 并驗證

POST http://localhost:9200/_analyze
{"analyzer": "ik_smart","text": "中國前首富許家印"
}

結果：

{"tokens": [{"token": "中國","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "前首富","start_offset": 2,"end_offset": 5,"type": "CN_WORD","position": 1},{"token": "許家印","start_offset": 5,"end_offset": 8,"type": "CN_WORD","position": 2}]
}

去掉mydic.dic里面的內容，重啟es，不采用自定義詞典后的分詞效果如下

{"tokens": [{"token": "中國","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "前","start_offset": 2,"end_offset": 3,"type": "CN_CHAR","position": 1},{"token": "首富","start_offset": 3,"end_offset": 5,"type": "CN_WORD","position": 2},{"token": "許","start_offset": 5,"end_offset": 6,"type": "CN_CHAR","position": 3},{"token": "家","start_offset": 6,"end_offset": 7,"type": "CN_CHAR","position": 4},{"token": "印","start_offset": 7,"end_offset": 8,"type": "CN_CHAR","position": 5}]
}

-----------------------------------------------手動分割線---------------------------------------------------

在 Elasticsearch 中，自定義詞典和停用詞是兩種功能完全相反的配置，分別用于增強分詞精度和過濾冗余信息。以下是核心區別和應用場景：

核心區別對比
示例對比

場景：分析文本 “我愛自然語言處理”
自定義詞典配置：

<entry key="ext_dict">custom/nlp.dic</entry>

nlp.dic 內容：

自然語言處理

分詞結果：

["我", "愛", "自然語言處理"]  // “自然語言處理”被視為一個整體

停用詞配置：

<entry key="ext_stopwords">stopwords.dic</entry>

stopwords.dic 內容：

我
的
了

分詞結果：

["愛", "自然", "語言", "處理"]  // “我”被過濾

適用場景

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/88983.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/88983.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/88983.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！