spaCy 中文教程:自然語言處理核心操作指南(進階)
1. 識別文本中帶有“百分號”的數字
import spacy# 創建一個空的英文語言模型
nlp = spacy.blank("en")# 處理輸入文本
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")# 遍歷文檔中的每個詞
for token in doc:if token.like_num: # 判斷該詞是否看起來像一個數字# 獲取下一個詞next_token = doc[token.i + 1]if next_token.text == "%":print("找到百分比:", token.text)
📌 2. 詞性標注與依存關系分析
import spacy# 加載英文小模型
nlp = spacy.load("en_core_web_sm")# 輸入文本
doc = nlp("She ate the pizza")# 打印每個詞的詞性標簽
for token in doc:print(token.text, token.pos_, token.pos)# 輸出依存結構信息(包括該詞依賴于哪個詞)
for token in doc:print(token.text, token.pos_, token.dep_, token.head.text)
📌 3. 命名實體識別(NER)
# 輸出識別出來的命名實體及其類型
for ent in doc.ents:print(ent.text, ent.label_)
📌 4. 比較不同模型之間的差異(詞性與依存關系)
# 假設 doc_small 和 doc_medium 是使用不同模型處理的結果
for i in range(len(doc_small)):print("詞:", doc_small[i])if doc_small[i].pos_ != doc_medium[i].pos_:print("詞性不同:", doc_small[i].pos_, doc_medium[i].pos_)if doc_small[i].dep_ != doc_medium[i].dep_:print("依存關系不同:", doc_small[i].dep_, doc_medium[i].dep_)
📌 5. 使用 Matcher
匹配自定義文本模式
示例一:識別“購買”某物的句子結構
from spacy.matcher import Matcher
import spacynlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)# 定義匹配“購買某物”的結構
pattern = [{"LEMMA": "buy"}, {"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
matcher.add("BUY_ITEM", [pattern])doc = nlp("I bought a smartphone. Now I'm buying apps.")
matches = matcher(doc)for match_id, start, end in matches:span = doc[start:end]print("匹配結果:", span.text)
示例二:識別“love + 名詞”的組合
pattern = [{"LEMMA": "love", "POS": "VERB"}, {"POS": "NOUN"}]
matcher.add("LOVE_PATTERN", [pattern])doc = nlp("I loved vanilla but now I love chocolate more.")
matches = matcher(doc)for match_id, start, end in matches:span = doc[start:end]print("匹配:", span.text)
示例三:匹配 “COLORS10”, “COLORS11” 等結構
text = """After the iOS update you won’t notice big changes. Most of iOS 11's layout remains the same as iOS 10."""matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "COLORS"}, {"IS_DIGIT": True}]
matcher.add("IOS_VERSION", [pattern])doc = nlp(text)
matches = matcher(doc)for match_id, start, end in matches:span = doc[start:end]print("識別出的版本:", span.text)
示例四:識別“download + 專有名詞”的結構
text = """I downloaded Fortnite on my laptop. Should I download WinZip too?"""pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]
matcher.add("DOWNLOAD_PATTERN", [pattern])doc = nlp(text)
matches = matcher(doc)for match_id, start, end in matches:print("下載內容:", doc[start:end].text)
示例五:匹配“形容詞 + 名詞”結構
text = "Features include a beautiful design, smart search, and voice responses."pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]
matcher.add("ADJ_NOUN", [pattern])doc = nlp(text)
matches = matcher(doc)for match_id, start, end in matches:print("形容詞短語:", doc[start:end].text)
📌 6. 使用 Vocab
和 Lexeme
操作詞匯表
nlp = spacy.blank("en")# 將字符串轉換為 hash 值
word_hash = nlp.vocab.strings["hat"]
print("字符串 'hat' 的哈希值為:", word_hash)# 再將 hash 值反向轉換為字符串
word_text = nlp.vocab.strings[word_hash]
print("哈希反查:", word_text)# 獲取詞匯的詳細屬性
lexeme = nlp.vocab["tea"]
print(lexeme.text, lexeme.orth, lexeme.is_alpha)
? 7. 手動創建 Doc 和 Span
在 spaCy 中,Doc
是處理文本的核心對象。我們可以使用 Doc
類手動創建文檔,而不是通過 nlp()
處理字符串。
from spacy.tokens import Doc, Span
import spacynlp = spacy.blank("en") # 創建一個空的英文模型tokens = ["Hello", "world", "!"]
spaces = [True, False, False] # 單詞之間是否有空格doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
print("創建的文檔:", doc)
接著我們可以用 Span
創建一個實體片段,并加上標簽:
span = Span(doc, 0, 2, label="GREETING") # "Hello world"
doc.ents = [span] # 設置 doc 的實體
print("命名實體:", doc.ents)
🛠? 8. 查看與修改 NLP 管道組件
spaCy 的 NLP 模型是一個管道(pipeline),包含多個組件(如分詞、實體識別等)。
print("管道組件名稱:", nlp.pipe_names)
print("組件詳細信息:", nlp.pipeline)
你也可以向管道中添加自定義組件,例如:
from spacy.language import Language@Language.component("length_logger")
def log_doc_length(doc):print(f"文檔長度:{len(doc)}")return docnlp.add_pipe("length_logger", first=True) # 插入為第一個組件
print("修改后的管道組件:", nlp.pipe_names)doc = nlp("A sample sentence.")
🐛 9. 自定義實體識別器(基于詞性和詞形)
我們用 Matcher
組件來匹配特定詞匯(比如“moth”、“fly”、“mosquito”),并用 Span
標記為實體:
from spacy.matcher import Matchertext = "Qantas flies all sorts of cargo! That includes moths, mosquitos, and even the occasional fly."nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)# 添加匹配規則
for insect in ["moth", "fly", "mosquito"]:matcher.add("INSECT", [[{"LEMMA": insect, "POS": "NOUN"}]])@Language.component("insect_finder")
def mark_insects(doc):matches = matcher(doc)doc.ents = [Span(doc, start, end, label="INSECT") for _, start, end in matches]return docnlp.add_pipe("insect_finder", after="ner") # 放在命名實體識別之后doc = nlp(text)
print("識別到的昆蟲實體:", [(ent.text, ent.label_) for ent in doc.ents])
🔍 10. 文本向量與相似度計算
spaCy 中的 en_core_web_md
或 en_core_web_lg
模型提供了詞向量。我們可以比較詞、句子的相似度:
nlp = spacy.load("en_core_web_md")doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")print("句子相似度:", doc1.similarity(doc2))doc = nlp("I like pizza and pasta")
print("詞語相似度(pizza vs pasta):", doc[2].similarity(doc[4]))
📚 11. 使用 nlp.pipe
批量處理文本
如果你需要處理大量文本,nlp.pipe()
是更高效的選擇:
texts = ["First example!", "Second example."]
for doc in nlp.pipe(texts):print("處理結果:", doc)
🧩 12. 給 Doc 添加自定義屬性(Context 擴展)
使用 Doc.set_extension()
可以添加自定義字段,例如 id
或 page_number
:
from spacy.tokens import Docdata = [("This is a text", {"id": 1, "page_number": 15}),("And another text", {"id": 2, "page_number": 16}),
]# 只設置一次,重復設置會報錯
try:Doc.set_extension("id", default=None)Doc.set_extension("page_number", default=None)
except ValueError:pass# 給每個 doc 添加上下文屬性
for doc, context in nlp.pipe(data, as_tuples=True):doc._.id = context["id"]doc._.page_number = context["page_number"]print(f"{doc.text} | ID: {doc._.id} | 頁碼: {doc._.page_number}")
🔧 13. 控制管道的運行組件(select_pipes)
你可以臨時禁用某些組件,加快處理速度或避免不必要的分析:
text = """Chick-fil-A is an American fast food restaurant chain headquartered in
College Park, Georgia."""with nlp.select_pipes(disable=["tagger", "parser"]): # 臨時關閉組件doc = nlp(text)print("命名實體:", doc.ents)print("詞性標注(關閉 tagger 后):", [token.tag_ for token in doc])