ElasticSearch 分詞器介紹及測試:Standard(標準分詞器)、English(英文分詞器)、Chinese(中文分詞器)、IK(IK 分詞器)
- ElasticSearch 分詞器介紹及測試
- 1. Standard Analyzer(標準分詞器)
- 2. English Analyzer(英文分詞器)
- 3. Chinese Analyzer(中文分詞器)
- 4. IK Analyzer(IK 分詞器)
- 官網資源
- 小結
本文 ElasticSearch 版本為:7.17.9,為了對應 spring-boot-starter-parent 的 2.7.9 版本
ElasticSearch 分詞器介紹及測試
ElasticSearch 提供了多種內置的分詞器(Analyzer),用于文本的分析和分詞。分詞器是文本分析的核心,決定了如何把輸入的文本字符串分解成一個個“詞項”(token)。不同的分詞器適用于不同的語言和場景,如中文、英文等。本文將介紹常用的分詞器及其應用。
1. Standard Analyzer(標準分詞器)
- 功能:
standard
是 ElasticSearch 的默認分詞器,基于 Unicode 文本分解標準,適用于多種語言。它會將文本中的標點符號、常見停用詞移除,并將文本轉化為小寫。 - 用途:適用于大多數通用場景,尤其是處理混合語言或沒有特殊分詞需求的情況。
- 分詞示例:
- 輸入:
"The quick brown fox"
- 輸出:
["the", "quick", "brown", "fox"]
- 輸入:
使用 ElasticSearch 的可視化界面 Kibana 的調試工具 Dev Tools 調用解析接口測試:
# `standard` 是 ElasticSearch 的默認分詞器,基于 Unicode 文本分解標準,適用于多種語言。它會將文本中的標點符號、常見停用詞移除,并將文本轉化為小寫。
POST /_analyze
{"analyzer": "standard","text": "The quick brown fox"
}
解析結果:
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"tokens" : [{"token" : "the","start_offset" : 0,"end_offset" : 3,"type" : "<ALPHANUM>","position" : 0},{"token" : "quick","start_offset" : 4,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 1},{"token" : "brown","start_offset" : 10,"end_offset" : 15,"type" : "<ALPHANUM>","position" : 2},{"token" : "fox","start_offset" : 16,"end_offset" : 19,"type" : "<ALPHANUM>","position" : 3}]
}
2. English Analyzer(英文分詞器)
- 功能:
english
分詞器專用于英文文本的分析,除了進行基本的分詞,還會進行停用詞過濾,并將所有文本轉換為小寫字母。 - 用途:適用于英文文本的分析,特別是在英文搜索引擎或英文數據處理中。
- 分詞示例:
- 輸入:
"The quick brown fox"
- 輸出:
["quick", "brown", "fox"]
(the
被移除作為停用詞)
- 輸入:
使用 ElasticSearch 的可視化界面 Kibana 的調試工具 Dev Tools 調用解析接口測試:
# `english` 分詞器專用于英文文本的分析,除了進行基本的分詞,還會進行停用詞過濾,并將所有文本轉換為小寫字母。
POST /_analyze
{"analyzer": "english","text": "The quick brown fox"
}
解析結果:
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"tokens" : [{"token" : "quick","start_offset" : 4,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 1},{"token" : "brown","start_offset" : 10,"end_offset" : 15,"type" : "<ALPHANUM>","position" : 2},{"token" : "fox","start_offset" : 16,"end_offset" : 19,"type" : "<ALPHANUM>","position" : 3}]
}
3. Chinese Analyzer(中文分詞器)
- 功能:
chinese
分詞器專為中文文本設計,基于分詞字典并結合最大匹配法等技術,將中文文本分解成多個詞項。 - 用途:適用于中文文本的分詞處理,特別是中文搜索引擎和中文語料處理。對中文的解析很差。
- 分詞示例:
- 輸入:
"今天天氣很好"
- 期望的輸出:
["今天", "天氣", "很", "好",]
- 實際的輸出:
["今","天", "天","氣", "很", "好"]
- 輸入:
使用 ElasticSearch 的可視化界面 Kibana 的調試工具 Dev Tools 調用解析接口測試:
# `chinese` 分詞器專為中文文本設計,基于分詞字典并結合最大匹配法等技術,將中文文本分解成多個詞項。
# `chinese` 分詞器專為中文文本設計,基于分詞字典并結合最大匹配法等技術,將中文文本分解成多個詞項。
POST /_analyze
{"analyzer": "chinese","text": "今天天氣很好"
}
解析結果:
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"tokens" : [{"token" : "今","start_offset" : 0,"end_offset" : 1,"type" : "<IDEOGRAPHIC>","position" : 0},{"token" : "天","start_offset" : 1,"end_offset" : 2,"type" : "<IDEOGRAPHIC>","position" : 1},{"token" : "天","start_offset" : 2,"end_offset" : 3,"type" : "<IDEOGRAPHIC>","position" : 2},{"token" : "氣","start_offset" : 3,"end_offset" : 4,"type" : "<IDEOGRAPHIC>","position" : 3},{"token" : "很","start_offset" : 4,"end_offset" : 5,"type" : "<IDEOGRAPHIC>","position" : 4},{"token" : "好","start_offset" : 5,"end_offset" : 6,"type" : "<IDEOGRAPHIC>","position" : 5}]
}
4. IK Analyzer(IK 分詞器)
- 官網資源:IK Analyzer GitHub 頁面
- 功能:
IK Analyzer
是一個開源的中文分詞器,專門用于處理中文文本。它結合了多種中文分詞技術,支持細粒度和粗粒度的分詞。 - 安裝:需要作為 ElasticSearch 插件安裝,支持通過精確模式和智能模式兩種分詞策略。
- 分詞示例:
- 輸入:
"今天天氣不錯,適合出游"
- ik_smart(最少切分):
["今天天氣", "不錯", "適合", "出游"]
- ik_max_word(最細切分):
["今天天氣", "今天", "天天", "天氣", "不錯", "適合", "合出", "出游"]
- 輸入:
- 擴展詞典:支持自定義擴展詞典,用戶可以添加特定詞語、行業術語、網絡熱詞等。【安裝IK分詞器;IK分詞器配置擴展詞庫:配置擴展字典-擴展詞,配置擴展停止詞字典-停用詞】
使用 ElasticSearch 的可視化界面 Kibana 的調試工具 Dev Tools 調用解析接口測試:
# `IK Analyzer` ik_smart(最少切分)。
POST /_analyze
{"analyzer": "ik_smart","text": "今天天氣不錯,適合出游"
}
解析結果:
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"tokens" : [{"token" : "今天天氣","start_offset" : 0,"end_offset" : 4,"type" : "CN_WORD","position" : 0},{"token" : "不錯","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 1},{"token" : "適合","start_offset" : 7,"end_offset" : 9,"type" : "CN_WORD","position" : 2},{"token" : "出游","start_offset" : 9,"end_offset" : 11,"type" : "CN_WORD","position" : 3}]
}
使用 ElasticSearch 的可視化界面 Kibana 的調試工具 Dev Tools 調用解析接口測試:
# `IK Analyzer` ik_smart(最少切分)。
POST /_analyze
{"analyzer": "ik_smart","text": "今天天氣不錯,適合出游"
}
解析結果:
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"tokens" : [{"token" : "今天天氣","start_offset" : 0,"end_offset" : 4,"type" : "CN_WORD","position" : 0},{"token" : "今天","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 1},{"token" : "天天","start_offset" : 1,"end_offset" : 3,"type" : "CN_WORD","position" : 2},{"token" : "天氣","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 3},{"token" : "不錯","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 4},{"token" : "適合","start_offset" : 7,"end_offset" : 9,"type" : "CN_WORD","position" : 5},{"token" : "合出","start_offset" : 8,"end_offset" : 10,"type" : "CN_WORD","position" : 6},{"token" : "出游","start_offset" : 9,"end_offset" : 11,"type" : "CN_WORD","position" : 7}]
}
官網資源
你可以訪問 ElasticSearch 官方文檔頁面,獲取有關不同分詞器和分析器的詳細介紹,以及如何配置和使用它們:
- ElasticSearch 分析器官方文檔
小結
ElasticSearch 提供了多種內置分詞器,能夠適應不同語言和文本格式的需求。選擇合適的分詞器對于實現高效的搜索和分析至關重要。你可以根據實際的應用場景選擇 standard
、chinese
、english
等分詞器,或根據需要創建自定義分詞器來滿足特定的文本分析需求。如果你有特殊的需求,可以深入研究分詞器的配置選項和擴展方式。