nlp構建

Over the years, suicide has been one of the major causes of death worldwide, According to Wikipedia, Suicide resulted in 828,000 global deaths in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. There is also increasing evidence that the Internet and social media can influence suicide-related behaviour. Using Natural Language Processing, a field in Machine Learning, I built a very simple suicidal ideation classifier which predict whether a text is likely to be suicidal or not.

多年來，自殺一直是全世界主要的死亡原因之一。據維基百科稱，自殺導致2015年全球死亡828,000人，比1990年的712,000人有所增加。這使自殺成為全球第十大死亡原因。越來越多的證據表明，互聯網和社交媒體可以影響自殺相關行為。使用機器學習中的自然語言處理這一領域，我建立了一個非常簡單的自殺意念分類器，該分類器可預測文本是否可能具有自殺意味。

數據 (Data)

I used a Twitter crawler which I found on Github, made some few changes to the code by removing hashtags, links, URL and symbols whenever it crawls data from Twitter, the data were crawled based on query parameters which contain words like:

我使用了一個在Github上找到的Twitter搜尋器，通過在每次從Twitter抓取數據時刪除標簽，鏈接，URL和符號來對代碼進行了一些更改，這些數據是根據包含以下單詞的查詢參數進行抓取的：

Depressed, hopeless, Promise to take care of, I dont belong here, Nobody deserve me, I want to die etc.
沮喪，絕望，無極照顧，我不屬于這里，沒人值得我，我想死等等。

Although some of the text we’re in no way related to suicide at all, I had to manually label the data which were about 8200 rows of tweets. I also sourced for more Twitter Data and I was able to concatenate with the one I previously had which was enough for me to train.

盡管有些文本根本與自殺無關，但我不得不手動標記大約8200行tweet數據。我還獲得了更多的Twitter數據，并且能夠與以前擁有的足以進行訓練的數據相結合。

建立模型 (Building the Model)

數據預處理 (Data Preprocessing)

I imported the following libraries:

我導入了以下庫：

import pickle
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import nltk
nltk.download('stopwords')

I then wrote a function to clean the text data to remove any form of HTML markup, keep emoticon characters, remove non-word character and lastly convert to lowercase.

然后，我編寫了一個函數來清除文本數據，以刪除任何形式HTML標記，保留表情符號字符，刪除非單詞字符并最后轉換為小寫字母。

def preprocess_tweet(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    lowercase_text = re.sub('[\W]+', ' ', text.lower())
    text = lowercase_text+' '.join(emoticons).replace('-', '') 
    return text

After that, I applied the preprocess_tweet function to the tweet dataset to clean the data.

之后，我將preprocess_tweet函數應用于tweet數據集以清理數據。

tqdm.pandas()df = pd.read_csv('data.csv')
df['tweet'] = df['tweet'].progress_apply(preprocess_tweet)

Then I converted the text to tokens by using the .split() method and used word stemming to convert the text to their root form.

然后，我使用.split()方法將文本轉換為標記，并使用詞干將文本轉換為其根形式。

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

Then I imported the stopwords library to remove stop words in the text.

然后，我導入了停用詞庫，以刪除文本中的停用詞。

from nltk.corpus import stopwords
stop = stopwords.words('english')

Testing the function on a single text.

在單個文本上測試功能。

[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

Output:

輸出：

['runner', 'like', 'run', 'run', 'lot']

矢量化器 (Vectorizer)

For this project, I used the Hashing Vectorizer because it data-independent, which means that it is very low memory scalable to large datasets and it doesn’t store vocabulary dictionary in memory. I then created a tokenizer function for the Hashing Vectorizer

在此項目中，我使用了Hashing Vectorizer，因為它與數據無關，這意味著它的內存非常低，可擴展到大型數據集，并且不將詞匯表存儲在內存中。然后，我為Hashing Vectorizer創建了tokenizer函數

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\(|D|P)',text.lower())
    text = re.sub('[\W]+', ' ', text.lower())
    text += ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in tokenizer_porter(text) if w not in stop]
    return tokenized

Then I created the Hashing Vectorizer object.

然后，我創建了哈希向量化器對象。

from sklearn.feature_extraction.text import HashingVectorizervect = HashingVectorizer(decode_error='ignore', n_features=2**21, 
                         preprocessor=None,tokenizer=tokenizer)

模型 (Model)

For the Model, I used the stochastic gradient descent classifier algorithm.

對于模型，我使用了隨機梯度下降分類器算法。

from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1)

培訓與驗證 (Training and Validation)

X = df["tweet"].to_list()
y = df['label']

For the model, I used 80% for training and 20% for testing.

對于模型，我使用了80％的訓練和20％的測試。

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size=0.20,
                                                 random_state=0)

Then I transformed the text data to vectors with the Hashing Vectorizer we created earlier:

然后，使用之前創建的Hashing Vectorizer將文本數據轉換為矢量：

X_train = vect.transform(X_train)
X_test = vect.transform(X_test)

Finally, I then fit the data to the algorithm

最后，然后將數據擬合到算法中

classes = np.array([0, 1])
clf.partial_fit(X_train, y_train,classes=classes)

Let's test the accuracy on our test data:

讓我們在測試數據上測試準確性：

print('Accuracy: %.3f' % clf.score(X_test, y_test))

Output:

輸出：

Accuracy: 0.912

I had an accuracy of 91% which is fair enough, after that, I then updated the model with the prediction

我的準確度是91％，這還算公允，之后，我用預測更新了模型

clf = clf.partial_fit(X_test, y_test)

測試和做出預測 (Testing and Making Predictions)

I added the text “I’ll kill myself am tired of living depressed and alone” to the model.

我在模型中添加了文本“我會厭倦生活在沮喪和孤獨中，殺死自己”。

label = {0:'negative', 1:'positive'}
example = ["I'll kill myself am tired of living depressed and alone"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%'
      %(label[clf.predict(X)[0]],np.max(clf.predict_proba(X))*100))

And I got the output:

我得到了輸出：

Prediction: positive
Probability: 93.76%

And when I used the following text “It’s such a hot day, I’d like to have ice cream and visit the park”, I got the following prediction:

當我使用以下文字“天氣真熱，我想吃冰淇淋并參觀公園”時，我得到以下預測：

Prediction: negative
Probability: 97.91%

The model was able to predict accurately for both cases. And that's how you build a simple suicidal tweet classifier.

該模型能夠準確預測這兩種情況。這就是您構建簡單的自殺性推文分類器的方式。

You can find the notebook I used for this article here

您可以在這里找到我用于本文的筆記本

Thanks for reading 😊

感謝您閱讀😊

翻譯自: https://towardsdatascience.com/building-a-suicidal-tweet-classifier-using-nlp-ff6ccd77e971

nlp構建

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/392136.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/392136.shtml
英文地址，請注明出處：http://en.pswp.cn/news/392136.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！