一、說明
????????文本摘要是為較長的文本文檔生成簡短、流暢且最重要的是準確摘要的過程。自動文本摘要背后的主要思想是能夠從整個集合中找到最重要信息的一小部分,并以人類可讀的格式呈現。隨著在線文本數據的增長,自動文本摘要方法可能會非常有用,因為可以在短時間內有用的信息。
二、為什么要自動文本摘要?
- 摘要減少了閱讀時間。
- 研究文檔時,摘要使選擇過程變得更加容易。
- 自動摘要提高了索引的有效性。
- 自動摘要算法比人工摘要的偏差更小。
- 個性化摘要在問答系統中非常有用,因為它們提供個性化信息。
- 使用自動或半自動摘要系統使商業摘要服務能夠增加其能夠處理的文本文檔的數量。
三、文本總結的依據?
????????在下圖,至少出現了三個環節,1)文檔歸類? 2)文檔目的歸類 3)主題信息抽取。
3.1 基于輸入類型:
- Single?Document?,輸入長度較短。許多早期的摘要系統處理單文檔摘要。
- 多文檔,輸入可以任意長。
3.2 根據目的的歸類
- 通用,模型不對要總結的文本的領域或內容做出任何假設,并將所有輸入視為同類。已完成的大部分工作都是圍繞通用摘要展開的。
- 特定領域,模型使用特定領域的知識來形成更準確的摘要。例如,總結特定領域的研究論文、生物醫學文獻等。
- 基于查詢,其中摘要僅包含回答有關輸入文本的自然語言問題的信息。
3.3 根據輸出類型:
- 提取,從輸入文本中選擇重要的句子以形成摘要。當今大多數總結方法本質上都是提取性的。
- 抽象,模型形成自己的短語和句子,以提供更連貫的摘要,就像人類會生成的一樣。這種方法肯定更有吸引力,但比提取摘要困難得多。
四、如何進行文本摘要
- 文字清理
- 句子標記化
- 單詞標記化
- 詞頻表
- 總結
4.1 文字清理:
# !pip instlla -U spacy
# !python -m spacy download en_core_web_sm
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
stopwords = list(STOP_WORDS)
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(text)
4.2 單詞標記化:
tokens = [token.text for token in doc]
print(tokens)
punctuation = punctuation + ‘\n’
punctuation
word_frequencies = {}
for word in doc:
if word.text.lower() not in stopwords:
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
print(word_frequencies)
4.3 句子標記化:
max_frequency = max(word_frequencies.values())
max_frequency
for word in word_frequencies.keys():
word_frequencies[word] = word_frequencies[word]/max_frequency
print(word_frequencies)
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)
4.4 建立詞頻表:
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word.text.lower()]
else:
sentence_scores[sent] += word_frequencies[word.text.lower()]
sentence_scores
4.5 主題信息總結:
from heapq import nlargest
select_length = int(len(sentence_tokens)*0.3)
select_length
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary
final_summary = [word.text for word in summary]
summary = ‘ ‘.join(final_summary)
輸入原始文檔:
text = “””
Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: ‘I don’t really hide any feelings too much.
I think everyone knows this is my job here. When I’m on the courts or when I’m on the court playing, I’m a competitor and I want to beat every single person whether they’re in the locker room or across the net.
So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.
I have not a lot of friends away from the courts.’ When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men’s tour than the women’s tour? ‘No, not at all.
I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players.
I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life.
I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.
There are so many other things that we’re interested in, that we do.’
“””
4.6 輸出(最終摘要):摘要
I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players. Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life. I think everyone just thinks because we’re tennis players So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. When she said she is not really close to a lot of players, is that something strategic that she is doing?
有關完整代碼,請查看我的存儲庫:
五、結語
????????本文至少精簡地告訴大家,文章自動摘要需要哪些關鍵環節。
????????創建數據集可能是一項繁重的工作,并且經常是學習數據科學中被忽視的部分,實際工作要給以重視。不過,這是另一篇博客文章。阿努普·辛格