【NLP】24. spaCy 教程：自然語言處理核心操作指南（進階）

spaCy 中文教程：自然語言處理核心操作指南（進階）

1. 識別文本中帶有“百分號”的數字

import spacy# 創建一個空的英文語言模型
nlp = spacy.blank("en")# 處理輸入文本
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")# 遍歷文檔中的每個詞
for token in doc:if token.like_num:  # 判斷該詞是否看起來像一個數字# 獲取下一個詞next_token = doc[token.i + 1]if next_token.text == "%":print("找到百分比：", token.text)

📌 2. 詞性標注與依存關系分析

import spacy# 加載英文小模型
nlp = spacy.load("en_core_web_sm")# 輸入文本
doc = nlp("She ate the pizza")# 打印每個詞的詞性標簽
for token in doc:print(token.text, token.pos_, token.pos)# 輸出依存結構信息（包括該詞依賴于哪個詞）
for token in doc:print(token.text, token.pos_, token.dep_, token.head.text)

📌 3. 命名實體識別（NER）

# 輸出識別出來的命名實體及其類型
for ent in doc.ents:print(ent.text, ent.label_)

📌 4. 比較不同模型之間的差異（詞性與依存關系）

# 假設 doc_small 和 doc_medium 是使用不同模型處理的結果
for i in range(len(doc_small)):print("詞：", doc_small[i])if doc_small[i].pos_ != doc_medium[i].pos_:print("詞性不同：", doc_small[i].pos_, doc_medium[i].pos_)if doc_small[i].dep_ != doc_medium[i].dep_:print("依存關系不同：", doc_small[i].dep_, doc_medium[i].dep_)

📌 5. 使用 `Matcher` 匹配自定義文本模式

示例一：識別“購買”某物的句子結構

from spacy.matcher import Matcher
import spacynlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)# 定義匹配“購買某物”的結構
pattern = [{"LEMMA": "buy"}, {"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
matcher.add("BUY_ITEM", [pattern])doc = nlp("I bought a smartphone. Now I'm buying apps.")
matches = matcher(doc)for match_id, start, end in matches:span = doc[start:end]print("匹配結果：", span.text)

示例二：識別“love + 名詞”的組合

pattern = [{"LEMMA": "love", "POS": "VERB"}, {"POS": "NOUN"}]
matcher.add("LOVE_PATTERN", [pattern])doc = nlp("I loved vanilla but now I love chocolate more.")
matches = matcher(doc)for match_id, start, end in matches:span = doc[start:end]print("匹配：", span.text)

示例三：匹配 “COLORS10”, “COLORS11” 等結構

text = """After the iOS update you won’t notice big changes. Most of iOS 11's layout remains the same as iOS 10."""matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "COLORS"}, {"IS_DIGIT": True}]
matcher.add("IOS_VERSION", [pattern])doc = nlp(text)
matches = matcher(doc)for match_id, start, end in matches:span = doc[start:end]print("識別出的版本：", span.text)

示例四：識別“download + 專有名詞”的結構

text = """I downloaded Fortnite on my laptop. Should I download WinZip too?"""pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]
matcher.add("DOWNLOAD_PATTERN", [pattern])doc = nlp(text)
matches = matcher(doc)for match_id, start, end in matches:print("下載內容：", doc[start:end].text)

示例五：匹配“形容詞 + 名詞”結構


text = "Features include a beautiful design, smart search, and voice responses."pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]
matcher.add("ADJ_NOUN", [pattern])doc = nlp(text)
matches = matcher(doc)for match_id, start, end in matches:print("形容詞短語：", doc[start:end].text)

📌 6. 使用 `Vocab` 和 `Lexeme` 操作詞匯表


nlp = spacy.blank("en")# 將字符串轉換為 hash 值
word_hash = nlp.vocab.strings["hat"]
print("字符串 'hat' 的哈希值為：", word_hash)# 再將 hash 值反向轉換為字符串
word_text = nlp.vocab.strings[word_hash]
print("哈希反查：", word_text)# 獲取詞匯的詳細屬性
lexeme = nlp.vocab["tea"]
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

? 7. 手動創建 Doc 和 Span

在 spaCy 中，Doc 是處理文本的核心對象。我們可以使用 Doc 類手動創建文檔，而不是通過 nlp() 處理字符串。

from spacy.tokens import Doc, Span
import spacynlp = spacy.blank("en")  # 創建一個空的英文模型tokens = ["Hello", "world", "!"]
spaces = [True, False, False]  # 單詞之間是否有空格doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
print("創建的文檔：", doc)

接著我們可以用 Span 創建一個實體片段，并加上標簽：

span = Span(doc, 0, 2, label="GREETING")  # "Hello world"
doc.ents = [span]  # 設置 doc 的實體
print("命名實體：", doc.ents)

🛠? 8. 查看與修改 NLP 管道組件

spaCy 的 NLP 模型是一個管道（pipeline），包含多個組件（如分詞、實體識別等）。

print("管道組件名稱：", nlp.pipe_names)
print("組件詳細信息：", nlp.pipeline)

你也可以向管道中添加自定義組件，例如：

from spacy.language import Language@Language.component("length_logger")
def log_doc_length(doc):print(f"文檔長度：{len(doc)}")return docnlp.add_pipe("length_logger", first=True)  # 插入為第一個組件
print("修改后的管道組件：", nlp.pipe_names)doc = nlp("A sample sentence.")

🐛 9. 自定義實體識別器（基于詞性和詞形）

我們用 Matcher 組件來匹配特定詞匯（比如“moth”、“fly”、“mosquito”），并用 Span 標記為實體：

from spacy.matcher import Matchertext = "Qantas flies all sorts of cargo! That includes moths, mosquitos, and even the occasional fly."nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)# 添加匹配規則
for insect in ["moth", "fly", "mosquito"]:matcher.add("INSECT", [[{"LEMMA": insect, "POS": "NOUN"}]])@Language.component("insect_finder")
def mark_insects(doc):matches = matcher(doc)doc.ents = [Span(doc, start, end, label="INSECT") for _, start, end in matches]return docnlp.add_pipe("insect_finder", after="ner")  # 放在命名實體識別之后doc = nlp(text)
print("識別到的昆蟲實體：", [(ent.text, ent.label_) for ent in doc.ents])

🔍 10. 文本向量與相似度計算

spaCy 中的 en_core_web_md 或 en_core_web_lg 模型提供了詞向量。我們可以比較詞、句子的相似度：

nlp = spacy.load("en_core_web_md")doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")print("句子相似度：", doc1.similarity(doc2))doc = nlp("I like pizza and pasta")
print("詞語相似度（pizza vs pasta）：", doc[2].similarity(doc[4]))

📚 11. 使用 `nlp.pipe` 批量處理文本

如果你需要處理大量文本，nlp.pipe() 是更高效的選擇：

texts = ["First example!", "Second example."]
for doc in nlp.pipe(texts):print("處理結果：", doc)

🧩 12. 給 Doc 添加自定義屬性（Context 擴展）

使用 Doc.set_extension() 可以添加自定義字段，例如 id 或 page_number：

from spacy.tokens import Docdata = [("This is a text", {"id": 1, "page_number": 15}),("And another text", {"id": 2, "page_number": 16}),
]# 只設置一次，重復設置會報錯
try:Doc.set_extension("id", default=None)Doc.set_extension("page_number", default=None)
except ValueError:pass# 給每個 doc 添加上下文屬性
for doc, context in nlp.pipe(data, as_tuples=True):doc._.id = context["id"]doc._.page_number = context["page_number"]print(f"{doc.text} | ID: {doc._.id} | 頁碼: {doc._.page_number}")

🔧 13. 控制管道的運行組件（select_pipes）

你可以臨時禁用某些組件，加快處理速度或避免不必要的分析：

text = """Chick-fil-A is an American fast food restaurant chain headquartered in 
College Park, Georgia."""with nlp.select_pipes(disable=["tagger", "parser"]):  # 臨時關閉組件doc = nlp(text)print("命名實體：", doc.ents)print("詞性標注（關閉 tagger 后）：", [token.tag_ for token in doc])

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/76936.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/76936.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/76936.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！