twitter 數據集處理
In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.
在 過去的十年中,諸如微博和文本消息之類的新通信形式已經出現并無處不在。 盡管對推文和文本傳達的信息范圍沒有限制,但這些短消息通常用于分享人們對周圍世界正在發生的事情的看法和觀點。
Opinion mining (known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
觀點挖掘(稱為情感分析或情感AI)是指使用自然語言處理,文本分析,計算語言學和生物識別技術來系統地識別,提取,量化和研究情感狀態和主觀信息。 情緒分析廣泛應用于客戶材料的聲音,例如評論和調查響應,在線和社交媒體以及醫療保健材料,其應用范圍從營銷到客戶服務再到臨床醫學。
Both Lexion and Machine learning-based approach will be used to for Emoticons based sentiment analysis. Firstly we stand up with the Machine Learning based clustering. In MachineLearning based approach we are used Supervised and Unsupervised learning methods. The twitter data are collected and given as input in the system. The system classifies each tweets data as Positive, Negative and Neutral and also produce the positive, negative and neutral no of tweets of each emoticon separately in the output. Besides being the polarity of each tweet is also determined on the basis of polarity.
Lexion和基于機器學習的方法都將用于基于表情的情緒分析。 首先,我們支持基于機器學習的集群。 在基于MachineLearning的方法中,我們使用了有監督和無監督的學習方法。 收集twitter數據并作為系統中的輸入給出。 系統將每個推文數據分類為“正”,“負”和“中性”,并且還分別在輸出中生成每個表情符號的正,負和中性no。 除了作為每個推文的極性之外,還基于極性來確定。
Collection of Data
資料收集
To collecting the twitter data, we have to do some data mining process. In that process, we have created our own applicating with help of twitter API. With the help of twitter API, we have collected a large no of the dataset . From this, we have to create a developer account and register our app. Here we received a consumer key and a consumer secret: these are used in application settings and from the configuration page of the app we also require an access token and an access token secrets which provide the application access to Twitter on behalf of the account. The process is divided into two sub-process. This is discussed in the next subsection.
要收集Twitter數據,我們必須執行一些數據挖掘過程。 在此過程中,我們借助twitter API創建了自己的應用程序。 借助twitter API,我們已收集了大量數據集。 由此,我們必須創建一個開發人員帳戶并注冊我們的應用程序。 在這里,我們收到了一個用戶密鑰和一個消費者密鑰:這些密鑰用于應用程序設置中,并且在應用程序的配置頁面中,我們還需要訪問令牌和訪問令牌密鑰,以代表帳戶向Twitter提供應用程序訪問權限。 該過程分為兩個子過程。 下一部分將對此進行討論。
Accessing Twitter Data and Strimming
訪問Twitter數據并加強
To make the application and to interact with twitter services we use Twitter provided REST API. We use a bunch of Python-based clients. The API variable is now our entry point for most of the operations we can perform with Twitter. The API provides features to access different types of data. In this way, we can easily collect tweets (and more) and store them in the system. By default, the data is in JSON format, we change it to txt format for easy accessibility.
為了制作應用程序并與Twitter服務進行交互,我們使用Twitter提供的REST API。 我們使用了許多基于Python的客戶端。 現在,API變量是我們可以使用Twitter執行的大多數操作的入口點。 該API提供了訪問不同類型數據的功能。 這樣,我們可以輕松地收集(和更多)推文并將其存儲在系統中。 默認情況下,數據采用JSON格式,為了方便訪問,我們將其更改為txt格式。
In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. By extending and customizing the stream-listener process, we processed the incoming data. This way, we gather a lot of tweets. This is especially true for live events with worldwide live coverage.
如果我們想“保持連接打開”并收集有關特定事件的所有即將發布的推文,則需要流API。 通過擴展和定制流偵聽器過程,我們處理了傳入的數據。 這樣,我們收集了很多推文。 對于具有全球實時報道的現場活動尤其如此。
# Twitter Sentiment Analysis
import sys
import csv
import tweepy
import matplotlib.pyplot as pltfrom collections import Counterif sys.version_info[0] < 3:input = raw_input## Twitter credentials
consumer_key = "------------"
consumer_secret = "------------"
access_token = "----------"
access_token_secret = "-----------"## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")results = api.search(lang="en",q=query + " -rt",count=number,result_type="recent"
)print("--- Gathered Tweets \n")## open a csv file to store the Tweets and their sentiment
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)with open(file_name, 'w', newline='') as csvfile:csv_writer = csv.DictWriter(f=csvfile,fieldnames=["Tweet", "Sentiment"])csv_writer.writeheader()print("--- Opened a CSV file to store the results of your sentiment analysis... \n")## tidy up the Tweets and send each to the AYLIEN Text APIfor c, result in enumerate(results, start=1):tweet = result.texttidy_tweet = tweet.strip().encode('ascii', 'ignore')if len(tweet) == 0:print('Empty Tweet')continueresponse = client.Sentiment({'text': tidy_tweet})csv_writer.writerow({'Tweet': response['text'],'Sentiment': response['polarity']})print("Analyzed Tweet {}".format(c))
Data Pre-Processing and Cleaning
數據預處理和清理
The data pre-processing steps perform the necessary data pre-processing and cleaning on the collected dataset. On the previously collected dataset, the are some key attributes text: the text of the tweet itself, created_at: the date of creation,favorite_count, retweet_count: the number of favourites and retweets, favourited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet etc. We have applied an extensive set of pre-processing steps to decrease the size of the feature set to make it suitable for learning algorithms. The cleaning method is based on dictionary methods.
數據預處理步驟對收集的數據集執行必要的數據預處理和清理。 在先前收集的數據集上,有一些關鍵屬性文本:tweet本身的文本,created_at:創建日期,favorite_count,retweet_count:收藏和轉推的數量,已收藏,已轉推:布爾值,指明是否通過身份驗證的用戶(您)對此推文等有幫助或轉發。我們已應用了廣泛的預處理步驟,以減小功能集的大小,使其適合于學習算法。 清潔方法基于字典方法。
Data obtained from twitter usually contains a lot of HTML entities like < > & which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Hare, we are using the HTML parser module of Python which can convert these entities to standard HTML tags. For example < is converted to “<” and & is converted to “&”. After this, we are removing this special HTML Character and links. In decoding data, this is the process of transforming information from complex symbols to simple and easier to understand characters. The collected data uses different forms of decoding like “Latin”, “UTF8” etc.
從Twitter獲得的數據通常包含許多HTML實體,例如&lt; &gt; &amp; 嵌入到原始數據中。 因此有必要擺脫這些實體。 一種方法是通過使用特定的正則表達式直接刪除它們。 野兔,我們正在使用PythonHTML解析器模塊,該模塊可以將這些實體轉換為標準HTML標記。 例如&lt; 轉換為“ <”和&amp; 轉換為“&”。 此后,我們將刪除此特殊HTML字符和鏈接。 在解碼數據時,這是將信息從復雜的符號轉換為簡單易懂的字符的過程。 收集的數據使用不同的解碼形式,例如“拉丁”,“ UTF8”等。
In the twitter datasets, there is also other information as retweet, Hashtag, Username and modified tweets. All of this is ignored and removed from the dataset.
在Twitter數據集中,還有其他信息,如轉推,標簽,用戶名和已修改的推文。 所有這些都將被忽略并從數據集中刪除。
from nltk import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import words
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, pos_tag_sents#import for bag of word
import numpy as np
#For the regular expression
import re
#Textblob dependency
from textblob import TextBlob
from textblob import Word
#set to string
from ast import literal_eval
#From src dependency
from sentencecounter import no_sentences,getline,gettempwords import os
def getsysets(word):syns = wordnet.synsets(word) #wordnet from ntlk.corpus will not work with textblob#print(syns[0].name()) #print(syns[0].lemmas()[0].name()) #get synsets names #print(syns[0].definition()) #defination #print(syns[0].examples()) #example# getsysets("good")def getsynonyms(word):synonyms = []# antonyms = []for syn in wordnet.synsets(word):for l in syn.lemmas():synonyms.append(l.name())# if l.antonyms():# antonyms.append(l.antonyms()[0].name())# print(set(synonyms))return(set(synonyms))# print(set(antonyms))# getsynonyms_and_antonyms("good")def extract_words(sentence):ignore_words = ['a']words = re.sub("[^\w]", " ", sentence).split() #nltk.word_tokenize(sentence)words_cleaned = [w.lower() for w in words if w not in ignore_words]return words_cleaned def tokenize_sentences(sentences):words = []for sentence in sentences:w = extract_words(sentence)words.extend(w)words = sorted(list(set(words)))return wordsdef bagofwords(sentence, words):sentence_words = extract_words(sentence)# frequency word countbag = np.zeros(len(words))for sw in sentence_words:for i,word in enumerate(words):if word == sw: bag[i] += 1return np.array(bag)def tokenizer(sentences):token = word_tokenize(sentences)return tokenprint("#"*100)print (sent_tokenize(sentences))print (token)print("#"*100)# sentences = "Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"
# vocabulary = tokenize_sentences(sentences)
# print (vocabulary)
# tokenizer(sentences)def createposfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word+'\n')def createnegfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word)def getsortedsynonyms(word):sortedsynonyms = sorted(getsynonyms(word))return sortedsynonymsdef getlengthofarray(word):return getsortedsynonyms(word).__len__()def readposfile():f = open('list of positive words.txt')return f# def searchword(word, sourcename):
# if word in open('list of negative words.txt').read():
# createnegfile('destinationposfile.txt',word)
# elif word in open('list of positive words.txt').read():
# createposfile('destinationnegfile.txt',word) # else:
# for i in range (0,getlengthofarray(word)):
# searchword(getsortedsynonyms(word)[i],sourcename)def searchword(word,srcfile):# if word in open('list of negative words.txt').read():# createnegfile('destinationposfile.txt',word)if word in open('list of positive words.txt').read():createposfile('destinationnegfile.txt',word)else:for i in range(0,getlengthofarray(word)):searchword(sorted(getsynonyms(word))[i],srcfile)f = open(srcfile,'w')f.writelines(word)print ('#'*50)
# searchword('lol','a.txt')
print(readposfile())
# tokenizer(sentences)
# getsynonyms('good')
# print(sorted(getsynonyms('good'))[2]) #finding an array object [hear it's 3rd object]
print ('#'*50)
# print (getsortedsynonyms('bad').__len__())
# createposfile('created.txt','lol')
# for word in word_tokenize(getline()):
# searchword(word,'a.txt')
Stop words are generally thought to be a “single set of words”. We would not want these words taking up space in our database. For this using NLTK and using a “Stop Word Dictionary” . The stop words are removed as they are not useful.All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed. In the twitter datasets, there is also other information as retweet, Hashtag, Username and Modified tweets. All of this is ignored and removed from the dataset. We should remove these duplicates, which we already did. Sometimes it is better to remove duplicate data based on a set of unique identifiers. For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.
停用詞通常被認為是“單個詞集”。 我們不希望這些單詞占用數據庫中的空間。 為此,請使用NLTK并使用“停用詞詞典”。 停用詞因無用而被刪除。應根據優先級處理所有標點符號。 例如: ”。”, ”,”,”?” 是重要的標點符號,應保留下來,而其他標點符號則需要刪除。 在Twitter數據集中,還存在其他信息,如轉推,標簽,用戶名和修改過的推文。 所有這些都將被忽略并從數據集中刪除。 我們應該刪除這些重復項,而我們已經這樣做了。 有時最好根據一組唯一的標識符刪除重復的數據。 例如,以相同的平方英尺,相同的價格和相同的建造年份,同時進行兩次交易的機會幾乎為零。
Thank you for reading.
感謝您的閱讀。
I hope you found this data cleaning guide helpful. Please leave any comments to let us know your thoughts.
希望本數據清理指南對您有所幫助。 請留下任何評論,讓我們知道您的想法。
To read previous part of the series -
要閱讀本系列的前一部分-
https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f
https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f
翻譯自: https://medium.com/swlh/twitter-data-cleaning-and-preprocessing-for-data-science-3ca0ea80e5cd
twitter 數據集處理
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392164.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392164.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392164.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!