文本預處理(text preprocess)總結

在任何機器學習任務中，清理（cleaning?）或預處理（preprocessing）數據與模型構建同樣重要，甚至更重要。當涉及文本等非結構化數據時，這個過程就更加重要。

1. 小寫化(Lower Casing)

小寫是一種常見的文本預處理技術。這個想法是將輸入文本轉換為相同的大小寫格式，以便以相同的方式處理 'text'、'Text' 和 'TEXT'。

    def lower_casing(self, text):return text.lower()

2. 刪除標點符號(Removal of Punctuations)

另一種常見的文本預處理技術是從文本數據中刪除標點符號。這又是一個文本標準化過程，將有助于處理“hurray”和“hurray！” 以同樣的方式。

    #     PUNCT_TO_REMOVE = """!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`‘"""def remove_punctuation(self, text):return text.translate(str.maketrans('', '', self.PUNCT_TO_REMOVE))

3.?刪除停用詞(Removal of stopwords)

停用詞是語言中常見的單詞，如“the”、“a”等。大多數時候它們可以從文本中刪除，因為它們不為下游分析提供有價值的信息。在像詞性標記這樣的情況下，我們不應該刪除它們，因為它們提供了有關 POS 的非常有價值的信息。

    def remove_stopwords(self, text):"""custom function to remove the stopwords"""return " ".join([word for word in str(text).split() if word not in self.STOPWORDS])

4. 刪除常用詞(Removal of Frequent words)

在前面的預處理步驟中，我們根據語言信息刪除了停用詞。但是，如果我們有一個特定領域的語料庫，我們可能還會有一些對我們來說不太重要的頻繁出現的單詞。

所以這一步就是去除給定語料庫中的頻繁出現的單詞。如果我們使用 tfidf 之類的東西，就會自動解決這個問題。

from collections import Counter
cnt = Counter()
for text in df["text"].values:for word in text.split():cnt[word] += 1cnt.most_common(10)

5. 刪除不經常用的詞(Removal of Rare words)

這與之前的預處理步驟非常相似，但我們將從語料庫中刪除稀有單詞。

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):"""custom function to remove the rare words"""return " ".join([word for word in str(text).split() if word not in RAREWORDS])df["text"] = df["text"].apply(lambda text: remove_rarewords(text))

6.?詞干提取(Stemming)

詞干提取是將詞形變化（或有時派生）的單詞還原為其詞干、詞根或詞根形式的過程

例如，如果語料庫中有兩個單詞walks和walking，那么詞干提取就會對后綴進行詞干處理，使它們成為walking。但在另一個例子中，我們有兩個單詞 console 和 consoling，詞干分析器將刪除后綴并使它們成為 consol，這不是一個正確的英語單詞。

有多種類型的詞干算法可用，其中最著名的一種是廣泛使用的 porter 詞干分析器。我們可以使用 nltk 包來實現同樣的目的。

    #  self.stemmer = PorterStemmer()def stem_words(self, text):return " ".join([self.stemmer.stem(word) for word in text.split()])

7.?詞形還原(Lemmatization)

詞形還原與詞干提取類似，將詞形變化的單詞減少到詞干，但不同之處在于它確保詞根（也稱為詞條）屬于該語言。

因此，這一過程通常比詞干提取過程慢。因此，根據速度要求，我們可以選擇使用詞干提取或詞形還原。

讓我們使用 nltk 中的 WordNetLemmatizer 來對句子進行詞形還原

    #  self.lemmatizer = WordNetLemmatizer()def lemmatize_words(self, text):return " ".join([self.lemmatizer.lemmatize(word) for word in text.split()])

8. 刪除表情符號(Removal of Emojis)

隨著社交媒體平臺的使用越來越多，表情符號在我們日常生活中的使用也呈爆炸式增長。也許我們可能需要刪除這些表情符號以進行一些文本分析。

感謝這段代碼，請在下面找到一個輔助函數，從我們的文本中刪除表情符號。

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(string):emoji_pattern = re.compile("["u"\U0001F600-\U0001F64F"  # emoticonsu"\U0001F300-\U0001F5FF"  # symbols & pictographsu"\U0001F680-\U0001F6FF"  # transport & map symbolsu"\U0001F1E0-\U0001F1FF"  # flags (iOS)u"\U00002702-\U000027B0"u"\U000024C2-\U0001F251""]+", flags=re.UNICODE)return emoji_pattern.sub(r'', string)remove_emoji("game is on 🔥🔥")

9.?刪除表情符號(Removal of Emoticons)

https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py

def remove_emoticons(text):emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')return emoticon_pattern.sub(r'', text)remove_emoticons("Hello :-)")

10. 替換或刪除Http url

     def remove_urls(self, text):url_pattern = re.compile(r'https?://\S+|www\.\S+')return url_pattern.sub(r'', text)def replace_http_url(self, text,  word = "urladd"):return re.sub(r'https?://\S+|www\.\S+', word, text)

11. 替換郵件地址

    def replace_email_id(self, text,  word = "emailadd"):return re.sub(r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})", word,text)

12.? 替換數字

主要是為了較低輸入的維度

    def replace_digit(self, text,  word = "digitadd"):return re.sub('\d+', word, text)

13. 刪掉多余空格和換行

    def remove_extra_space(self, text):return re.sub(' +', ' ', text)def remove_line_break_space(self, text):# return text.replace('\n', ' ').replace('\r', '')return " ".join([word for word in text.split()])

14. 提取html標簽里內容并刪除

    def remove_html(self, text):return BeautifulSoup(text, features='html5lib').text

15. 縮寫還原

# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. It's awesome to meet new friends.We've been waiting for this day for so long.'''# creating an empty list
expanded_words = []    
for word in text.split():# using contractions.fix to expand the shortened wordsexpanded_words.append(contractions.fix(word))   expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)

完整代碼

from bs4 import BeautifulSoup
import lxml
import re
from nltk.corpus import stopwords
import nltk
from nltk.stem.porter import PorterStemmer
from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import contractionsclass text_preprocessing():PUNCT_TO_REMOVE = """!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`‘"""def __init__(self):nltk.download('stopwords')self.STOPWORDS = set(stopwords.words('english'))self.stemmer = PorterStemmer()nltk.download('wordnet')self.lemmatizer = WordNetLemmatizer()self.spell = SpellChecker()self.wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}def expand_contractions(self, text):expanded_text = contractions.fix(text)return expanded_textdef lemmatize_words(self, text):return " ".join([self.lemmatizer.lemmatize(word) for word in text.split()])def lemmatize_words_position(self, text):pos_tagged_text = nltk.pos_tag(text.split())return " ".join([self.lemmatizer.lemmatize(word, self.wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])def remove_punctuation(self, text):"""custom function to remove the punctuation"""return text.translate(str.maketrans('', '', self.PUNCT_TO_REMOVE))def remove_space(self, text):return text.replace("_x000D_", " ")def remove_extra_space(self, text):return re.sub(' +', ' ', text)def remove_line_break_space(self, text):# return text.replace('\n', ' ').replace('\r', '')return " ".join([word for word in text.split()])def remove_html(self, text):return BeautifulSoup(text, features='html5lib').textdef lower_casing(self, text):return text.lower()def remove_urls(self, text):url_pattern = re.compile(r'https?://\S+|www\.\S+')return url_pattern.sub(r'', text)def remove_stopwords(self, text):"""custom function to remove the stopwords"""return " ".join([word for word in str(text).split() if word not in self.STOPWORDS])def stem_words(self, text):return " ".join([self.stemmer.stem(word) for word in text.split()])def remove_words(self, text, words):return " ".join([word for word in str(text).split() if word not in words])def correct_spellings(self, text):corrected_text = []misspelled_words = self.spell.unknown(text.split())for word in text.split():if word in misspelled_words:corrected_text.append(str(self.spell.correction(word)))else:corrected_text.append(str(word))if len(corrected_text) == 0:return  ""return " ".join(corrected_text)def replace_http_url(self, text,  word = "urladd"):return re.sub(r'https?://\S+|www\.\S+', word, text)def replace_email_id(self, text,  word = "emailadd"):return re.sub(r"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})", word,text)def replace_digit(self, text,  word = "digitadd"):return re.sub('\d+', word, text)if __name__ == '__main__':text_preprocessing = text_preprocessing()text ="""this text is for test"""text = text_preprocessing.replace_email_id(text)text = text_preprocessing.replace_http_url(text)text = text_preprocessing.replace_digit(text)text = text_preprocessing.expand_contractions(text)print(text)# text = text_preprocessing.remove_extra_space(text)# print('after removing extra space:', text)# old_text= text_preprocessing.remove_line_break_space(text)# print('old_text:',old_text)# text = text_preprocessing.lemmatize_words(old_text)# print("lemmatize_words_position:", text)# text = text_preprocessing.stem_words(old_text)# print("stem_words:",text)

Padas 處理的文字代碼

import pandas as pd
from pandas import DataFrame
from tabulate import tabulate
from nlp.text_preprocessing_util.text_preprocessing import *base_dir = "C:/apps/ml_datasets"text_preprocessing = text_preprocessing()def get_dataset():data = pd.read_excel(base_dir+'/Support_email_category.xlsx', sheet_name='Emails')#data = pd.read_excel('../dataset/final.xlsx', sheetname='final')# data = data.apply(preprocess_data, axis=1)X_orig = data['Subject'].astype(str) +" "+  data['Email content']y_orig = data['Priority']new_data = pd.DataFrame()new_data['X_orig'] = X_orignew_data['y_orig'] = y_orignew_data.to_excel(base_dir+'/raw_data.xlsx', index=None)return new_datadef lower_casing(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.lower_casing(text))return datadef remove_space(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_space(text))return datadef remove_punctuation(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_punctuation(text))return datadef remove_stopwords(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_stopwords(text))return datadef correct_spellings(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.correct_spellings(text))return datadef remove_freqwords(data: DataFrame):from collections import Countercnt = Counter()for text in data["X_orig"].values:for word in text.split():cnt[word] += 1cnt.most_common(10)FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_words(text, FREQWORDS))return datadef remove_rare_words(data: DataFrame):from collections import Countercnt = Counter()for text in data["X_orig"].values:for word in text.split():cnt[word] += 1cnt.most_common(10)n_rare_words = 10RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words - 1:-1]])print("rarewords:", RAREWORDS)data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_words(text, RAREWORDS))return datadef stem_words(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.stem_words(text))return datadef lemmatize_words(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.lemmatize_words(text))return datadef lemmatize_words1(data: DataFrame):import nltkfrom nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnetnltk.download('wordnet')lemmatizer = WordNetLemmatizer()wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.lemmatize_words_position(text, lemmatizer, wordnet_map))return datadef remove_urls(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_urls(text))return datadef remove_html(data: DataFrame):data["X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.remove_html(text))return datadef process_abbreviated_words(data: DataFrame):data["after X_orig"] = data["X_orig"].apply(lambda text: text_preprocessing.chat_words_conversion(text ))return datadef texts_preprocessing(data: DataFrame):data = remove_space(data)data = remove_urls(data)data= remove_html(data)data= process_abbreviated_words(data)data = lower_casing(data)# print(tabulate(data.head(3)))data = remove_punctuation(data)data = remove_stopwords(data)data= remove_freqwords(data)data = remove_rare_words(data)# data = stem_words(data)print('before...')print(tabulate(data.head(3)))# data = lemmatize_words(data)data = lemmatize_words1(data)# data = correct_spellings(data)print('after...')print(tabulate(data.head(3)))return datadef save_file(data, file_name):data.to_excel(base_dir + '/'+file_name, index=None)if __name__ == '__main__':data =  get_dataset()data = texts_preprocessing(data)save_file(data, 'after_preprocessing.xlsx')