一、說明
????????Python 對自然語言處理庫有豐富的支持。從文本處理、標記化文本并確定其引理開始,到句法分析、解析文本并分配句法角色,再到語義處理,例如識別命名實體、情感分析和文檔分類,一切都由至少一個庫提供。那么,你從哪里開始呢?
????????本文的目標是為每個核心 NLP 任務提供相關 Python 庫的概述。這些庫通過簡要說明進行了解釋,并給出了 NLP 任務的具體代碼片段。繼續我對?NLP 博客文章的介紹,本文僅顯示用于文本處理、句法和語義分析以及文檔語義等核心 NLP 任務的庫。此外,在 NLP 實用程序類別中,還提供了用于語料庫管理和數據集的庫。
????????涵蓋以下庫:
- NLTK
- TextBlob
- Spacy
- SciKit Learn
- Gensim
- 這篇文章最初出現在博客?admantium.com。
二、核心自然語言處理任務
2.1 文本處理
????????任務:標記化、詞形還原、詞干提取
? ? ? NLTK?庫為文本處理提供了一個完整的工具包,包括標記化、詞干提取和詞形還原。
from nltk.tokenize import sent_tokenize, word_tokenizeparagraph = '''Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.'''
sentences = []
for sent in sent_tokenize(paragraph):sentences.append(word_tokenize(sent))sentences[0]
# ['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline'
????????使用?TextBlob,支持相同的文本處理任務。它與NLTK的區別在于更高級的語義結果和易于使用的數據結構:解析句子已經生成了豐富的語義信息。
from textblob import TextBlobtext = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''blob = TextBlob(text)
blob.ngrams()
#[WordList(['Artificial', 'intelligence', 'was']),
# WordList(['intelligence', 'was', 'founded']),
# WordList(['was', 'founded', 'as']),blob.tokens
# WordList(['Artificial', 'intelligence', 'was', 'founded', 'as', 'an', 'academic', 'discipline', 'in', '1956', ',', 'and', 'in',
????????借助現代 NLP 庫?Spacy,文本處理只是主要語義任務的豐富管道中的第一步。與其他庫不同,它需要首先加載目標語言的模型。最近的模型不是啟發式的,而是人工神經網絡,尤其是變壓器,它提供了更豐富的抽象,可以更好地與其他模型相結合。
import spacy
nlp = spacy.load('en_core_web_lg')text = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''doc = nlp(text)
tokens = [token for token in doc]print(tokens)
# [Artificial, intelligence, was, founded, as, an, academic, discipline
三、句法分析
????????任務:解析、詞性標記、名詞短語提取
????????從?NLTK?開始,支持所有語法任務。它們的輸出作為 Python 原生數據結構提供,并且始終可以顯示為簡單的文本輸出。
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParsertext = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''pos_tag(word_tokenize(text))
# [('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),
# ('as', 'IN'),
# ('an', 'DT'),
# ('academic', 'JJ'),
# ('discipline', 'NN'),
# noun chunk parser
# source: https://www.nltk.org/book_1ed/ch07.htmlgrammar = "NP: {<DT>?<JJ>*<NN>}"
parser = RegexpParser(grammar)
parser.parse(pos_tag(word_tokenize(text)))
#(S
# (NP Artificial/JJ intelligence/NN)
# was/VBD
# founded/VBN
# as/IN
# (NP an/DT academic/JJ discipline/NN)
# in/IN
# 1956/CD
文本 Blob?在處理文本時立即提供 POS 標記。使用另一種方法,創建包含豐富語法信息的解析樹。
from textblob import TextBlobtext = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''
blob = TextBlob(text)blob.tags
#[('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),blob.parse()
# Artificial/JJ/B-NP/O
# intelligence/NN/I-NP/O
# was/VBD/B-VP/O
# founded/VBN/I-VP/O
Spacy?庫使用轉換器神經網絡來支持其語法任務。
import spacy
nlp = spacy.load('en_core_web_lg')for token in nlp(text):print(f'{token.text:<20}{token.pos_:>5}{token.tag_:>5}')#Artificial ADJ JJ
#intelligence NOUN NN
#was AUX VBD
#founded VERB VBN
四、語義分析
????????任務:命名實體識別、詞義消歧、語義角色標記
????????語義分析是NLP方法開始不同的領域。使用?NLTK?時,生成的語法信息將在字典中查找以識別例如命名實體。因此,在處理較新的文本時,可能無法識別實體。
from nltk import download as nltk_download
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunknltk_download('maxent_ne_chunker')
nltk_download('words')
text = '''
As of 2016, only three nations have flown crewed spacecraft: USSR/Russia, USA, and China. The first crewed spacecraft was Vostok 1, which carried Soviet cosmonaut Yuri Gagarin into space in 1961, and completed a full Earth orbit. There were five other crewed missions which used a Vostok spacecraft. The second crewed spacecraft was named Freedom 7, and it performed a sub-orbital spaceflight in 1961 carrying American astronaut Alan Shepard to an altitude of just over 187 kilometers (116 mi). There were five other crewed missions using Mercury spacecraft.
'''pos_tag(word_tokenize(text))
# [('Artificial', 'JJ'),
# ('intelligence', 'NN'),
# ('was', 'VBD'),
# ('founded', 'VBN'),
# ('as', 'IN'),
# ('an', 'DT'),
# ('academic', 'JJ'),
# ('discipline', 'NN'),
# noun chunk parser
# source: https://www.nltk.org/book_1ed/ch07.htmlprint(ne_chunk(pos_tag(word_tokenize(text))))
# (S
# As/IN
# of/IN
# [...]
# (ORGANIZATION USA/NNP)
# [...]
# which/WDT
# carried/VBD
# (GPE Soviet/JJ)
# cosmonaut/NN
# (PERSON Yuri/NNP Gagarin/NNP)
? ? ? Spacy?庫使用的轉換器模型包含一個隱式的“時間戳”:它們的訓練時間。這決定了模型使用了哪些文本,因此模型能夠識別哪些文本。
import spacy
nlp = spacy.load('en_core_web_lg')text = '''
As of 2016, only three nations have flown crewed spacecraft: USSR/Russia, USA, and China. The first crewed spacecraft was Vostok 1, which carried Soviet cosmonaut Yuri Gagarin into space in 1961, and completed a full Earth orbit. There were five other crewed missions which used a Vostok spacecraft. The second crewed spacecraft was named Freedom 7, and it performed a sub-orbital spaceflight in 1961 carrying American astronaut Alan Shepard to an altitude of just over 187 kilometers (116 mi). There were five other crewed missions using Mercury spacecraft.
'''
doc = nlp(paragraph)
for token in doc.ents:print(f'{token.text:<25}{token.label_:<15}')
# 2016 DATE
# only three CARDINAL
# USSR GPE
# Russia GPE
# USA GPE
# China GPE
# first ORDINAL
# Vostok 1 PRODUCT
# Soviet NORP
# Yuri Gagarin PERSON
五、文檔語義
????????任務:文本分類、主題建模、情感分析、毒性識別
????????情感分析也是NLP方法差異不同的任務:在詞典中查找單詞含義與在單詞或文檔向量上編碼的學習單詞相似性。
? ? ? TextBlob?具有內置的情感分析,可返回文本中的極性(整體正面或負面內涵)和主觀性(個人意見的程度)。
from textblob import TextBlobtext = '''
Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
'''blob = TextBlob(text)
blob.sentiment
#Sentiment(polarity=0.16180290297937355, subjectivity=0.42155589508530683)
Spacy?不包含文本分類功能,但可以作為單獨的管道步驟進行擴展。下面的代碼很長,包含幾個 Spacy 內部對象和數據結構 — 以后的文章將更詳細地解釋這一點。
## train single label categorization from multi-label dataset
def convert_single_label(dataset, filename):db = DocBin()nlp = spacy.load('en_core_web_lg')for index, fileid in enumerate(dataset):cat_dict = {cat: 0 for cat in dataset.categories()}cat_dict[dataset.categories(fileid).pop()] = 1doc = nlp(get_text(fileid))doc.cats = cat_dictdb.add(doc)db.to_disk(filename)## load trained model and apply to text
nlp = spacy.load('textcat_multilabel_model/model-best')
text = dataset.raw(42)
doc = nlp(text)
estimated_cats = sorted(doc.cats.items(), key=lambda i:float(i[1]), reverse=True)print(dataset.categories(42))
# ['orange']print(estimated_cats)
# [('nzdlr', 0.998894989490509), ('money-supply', 0.9969857335090637), ... ('orange', 0.7344251871109009),
SciKit Learn?是一個通用的機器學習庫,提供許多聚類和分類算法。它僅適用于數字輸入,因此需要對文本進行矢量化,例如使用 GenSims 預先訓練的詞向量,或使用內置的特征矢量化器。僅舉一個例子,這里有一個片段,用于將原始文本轉換為單詞向量,然后將 KMeans分類器應用于它們。
from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeansvectorizer = DictVectorizer(sparse=False)
x_train = vectorizer.fit_transform(dataset['train'])
kmeans = KMeans(n_clusters=8, random_state=0, n_init="auto").fit(x_train)print(kmeans.labels_.shape)
# (8551, )print(kmeans.labels_)
# [4 4 4 ... 6 6 6]
最后,Gensim是一個專門用于大規模語料庫的主題分類的庫。以下代碼片段加載內置數據集,矢量化每個文檔的令牌,并執行聚類分析算法 LDA。僅在 CPU 上運行時,這些最多可能需要 15 分鐘。
# source: https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html, https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.htmlimport logging
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import LdaModellogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
docs = api.load('text8')
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]_ = dictionary[0]
id2word = dictionary.id2token
# Define and train the modelmodel = LdaModel(corpus=corpus,id2word=id2word,chunksize=2000,alpha='auto',eta='auto',iterations=400,num_topics=10,passes=20,eval_every=None
)print(model.num_topics)
# 10print(model.top_topics(corpus)[6])
# ([(4.201401e-06, 'done'),
# (4.1998064e-06, 'zero'),
# (4.1478743e-06, 'eight'),
# (4.1257395e-06, 'one'),
# (4.1166854e-06, 'two'),
# (4.085097e-06, 'six'),
# (4.080696e-06, 'language'),
# (4.050306e-06, 'system'),
# (4.041121e-06, 'network'),
# (4.0385708e-06, 'internet'),
# (4.0379923e-06, 'protocol'),
# (4.035399e-06, 'open'),
# (4.033435e-06, 'three'),
# (4.0334166e-06, 'interface'),
# (4.030141e-06, 'four'),
# (4.0283044e-06, 'seven'),
# (4.0163245e-06, 'no'),
# (4.0149207e-06, 'i'),
# (4.0072555e-06, 'object'),
# (4.007036e-06, 'programming')],
六、公用事業
6.1 語料庫管理
NLTK為JSON格式的純文本,降價甚至Twitter提要提供語料庫閱讀器。它通過傳遞文件路徑來創建,然后提供基本統計信息以及迭代器以處理所有找到的文件。
from nltk.corpus.reader.plaintext import PlaintextCorpusReadercorpus = PlaintextCorpusReader('wikipedia_articles', r'.*\.txt')print(corpus.fileids())
# ['AI_alignment.txt', 'AI_safety.txt', 'Artificial_intelligence.txt', 'Machine_learning.txt', ...]print(len(corpus.sents()))
# 47289print(len(corpus.words()))
# 1146248
Gensim?處理文本文件以形成每個文檔的詞向量表示,然后可用于其主要用例主題分類。文檔需要由包裝遍歷目錄的迭代器處理,然后將語料庫構建為詞向量集合。但是,這種語料庫表示很難外部化并與其他庫重用。以下片段是上面的摘錄——它將加載 Gensim 中包含的數據集,然后創建一個基于詞向量的表示。
import gensim.downloader as api
from gensim.corpora import Dictionarydocs = api.load('text8')
dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]print('Number of unique tokens: %d' % len(dictionary))
# Number of unique tokens: 253854print('Number of documents: %d' % len(corpus))
# Number of documents: 1701
七、數據
NLTK提供了幾個即用型數據集,例如路透社新聞摘錄,歐洲議會會議記錄以及古騰堡收藏的開放書籍。請參閱完整的數據集和模型列表。
from nltk.corpus import reutersprint(len(reuters.fileids()))
#10788print(reuters.categories()[:43])
# ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil']
SciKit Learn包括來自新聞組,房地產甚至IT入侵檢測的數據集,請參閱完整列表。這是后者的快速示例。
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups()
dataset.data[1]
# "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll.
八、結論
????????對于 Python 中的 NLP 項目,存在大量的庫選擇。為了幫助您入門,本文提供了 NLP 任務驅動的概述,其中包含緊湊的庫解釋和代碼片段。從文本處理開始,您了解了如何從文本創建標記和引理。繼續語法分析,您學習了如何生成詞性標簽和句子的語法結構。到達語義,識別文本中的命名實體以及文本情感也可以在幾行代碼中解決。對于語料庫管理和訪問預結構化數據集的其他任務,您還看到了庫示例。總而言之,這篇文章應該會給你一個良好的開端,進入你的下一個NLP項目,從事核心NLP任務。
????????NLP方法向使用神經網絡,特別是大型語言模型的演變引發了一套完整的新庫的創建和適應,從文本矢量化,神經網絡定義和訓練開始,以及語言生成模型的應用等等。這些模型涵蓋了所有高級 NLP 任務,將在以后的文章中介紹。
塞巴斯蒂安