twitter 數據集處理_Twitter數據清理和數據科學預處理

twitter 數據集處理

In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.

在 過去的十年中，諸如微博和文本消息之類的新通信形式已經出現并無處不在。 盡管對推文和文本傳達的信息范圍沒有限制，但這些短消息通常用于分享人們對周圍世界正在發生的事情的看法和觀點。

Opinion mining (known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

觀點挖掘(稱為情感分析或情感AI)是指使用自然語言處理，文本分析，計算語言學和生物識別技術來系統地識別，提取，量化和研究情感狀態和主觀信息。 情緒分析廣泛應用于客戶材料的聲音，例如評論和調查響應，在線和社交媒體以及醫療保健材料，其應用范圍從營銷到客戶服務再到臨床醫學。

Both Lexion and Machine learning-based approach will be used to for Emoticons based sentiment analysis. Firstly we stand up with the Machine Learning based clustering. In MachineLearning based approach we are used Supervised and Unsupervised learning methods. The twitter data are collected and given as input in the system. The system classifies each tweets data as Positive, Negative and Neutral and also produce the positive, negative and neutral no of tweets of each emoticon separately in the output. Besides being the polarity of each tweet is also determined on the basis of polarity.

Lexion和基于機器學習的方法都將用于基于表情的情緒分析。首先，我們支持基于機器學習的集群。在基于MachineLearning的方法中，我們使用了有監督和無監督的學習方法。收集twitter數據并作為系統中的輸入給出。系統將每個推文數據分類為“正”，“負”和“中性”，并且還分別在輸出中生成每個表情符號的正，負和中性no。除了作為每個推文的極性之外，還基于極性來確定。

Collection of Data

資料收集

To collecting the twitter data, we have to do some data mining process. In that process, we have created our own applicating with help of twitter API. With the help of twitter API, we have collected a large no of the dataset . From this, we have to create a developer account and register our app. Here we received a consumer key and a consumer secret: these are used in application settings and from the configuration page of the app we also require an access token and an access token secrets which provide the application access to Twitter on behalf of the account. The process is divided into two sub-process. This is discussed in the next subsection.

要收集Twitter數據，我們必須執行一些數據挖掘過程。在此過程中，我們借助twitter API創建了自己的應用程序。借助twitter API，我們已收集了大量數據集。由此，我們必須創建一個開發人員帳戶并注冊我們的應用程序。在這里，我們收到了一個用戶密鑰和一個消費者密鑰：這些密鑰用于應用程序設置中，并且在應用程序的配置頁面中，我們還需要訪問令牌和訪問令牌密鑰，以代表帳戶向Twitter提供應用程序訪問權限。該過程分為兩個子過程。下一部分將對此進行討論。

Accessing Twitter Data and Strimming

訪問Twitter數據并加強

To make the application and to interact with twitter services we use Twitter provided REST API. We use a bunch of Python-based clients. The API variable is now our entry point for most of the operations we can perform with Twitter. The API provides features to access different types of data. In this way, we can easily collect tweets (and more) and store them in the system. By default, the data is in JSON format, we change it to txt format for easy accessibility.

為了制作應用程序并與Twitter服務進行交互，我們使用Twitter提供的REST API。我們使用了許多基于Python的客戶端。現在，API變量是我們可以使用Twitter執行的大多數操作的入口點。該API提供了訪問不同類型數據的功能。這樣，我們可以輕松地收集(和更多)推文并將其存儲在系統中。默認情況下，數據采用JSON格式，為了方便訪問，我們將其更改為txt格式。

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. By extending and customizing the stream-listener process, we processed the incoming data. This way, we gather a lot of tweets. This is especially true for live events with worldwide live coverage.

如果我們想“保持連接打開”并收集有關特定事件的所有即將發布的推文，則需要流API。通過擴展和定制流偵聽器過程，我們處理了傳入的數據。這樣，我們收集了很多推文。對于具有全球實時報道的現場活動尤其如此。

# Twitter Sentiment Analysis
import sys
import csv
import tweepy
import matplotlib.pyplot as pltfrom collections import Counterif sys.version_info[0] < 3:input = raw_input## Twitter credentials
consumer_key = "------------"
consumer_secret = "------------"
access_token = "----------"
access_token_secret = "-----------"## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")results = api.search(lang="en",q=query + " -rt",count=number,result_type="recent"
)print("--- Gathered Tweets \n")## open a csv file to store the Tweets and their sentiment 
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)with open(file_name, 'w', newline='') as csvfile:csv_writer = csv.DictWriter(f=csvfile,fieldnames=["Tweet", "Sentiment"])csv_writer.writeheader()print("--- Opened a CSV file to store the results of your sentiment analysis... \n")## tidy up the Tweets and send each to the AYLIEN Text APIfor c, result in enumerate(results, start=1):tweet = result.texttidy_tweet = tweet.strip().encode('ascii', 'ignore')if len(tweet) == 0:print('Empty Tweet')continueresponse = client.Sentiment({'text': tidy_tweet})csv_writer.writerow({'Tweet': response['text'],'Sentiment': response['polarity']})print("Analyzed Tweet {}".format(c))

Data Pre-Processing and Cleaning

數據預處理和清理

The data pre-processing steps perform the necessary data pre-processing and cleaning on the collected dataset. On the previously collected dataset, the are some key attributes text: the text of the tweet itself, created_at: the date of creation,favorite_count, retweet_count: the number of favourites and retweets, favourited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet etc. We have applied an extensive set of pre-processing steps to decrease the size of the feature set to make it suitable for learning algorithms. The cleaning method is based on dictionary methods.

數據預處理步驟對收集的數據集執行必要的數據預處理和清理。在先前收集的數據集上，有一些關鍵屬性文本：tweet本身的文本，created_at：創建日期，favorite_count，retweet_count：收藏和轉推的數量，已收藏，已轉推：布爾值，指明是否通過身份驗證的用戶(您)對此推文等有幫助或轉發。我們已應用了廣泛的預處理步驟，以減小功能集的大小，使其適合于學習算法。清潔方法基于字典方法。

Data obtained from twitter usually contains a lot of HTML entities like < > & which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Hare, we are using the HTML parser module of Python which can convert these entities to standard HTML tags. For example < is converted to “<” and & is converted to “&”. After this, we are removing this special HTML Character and links. In decoding data, this is the process of transforming information from complex symbols to simple and easier to understand characters. The collected data uses different forms of decoding like “Latin”, “UTF8” etc.

從Twitter獲得的數據通常包含許多HTML實體，例如＆lt; ＆gt; ＆amp; 嵌入到原始數據中。因此有必要擺脫這些實體。一種方法是通過使用特定的正則表達式直接刪除它們。野兔，我們正在使用PythonHTML解析器模塊，該模塊可以將這些實體轉換為標準HTML標記。例如＆lt; 轉換為“ <”和＆amp; 轉換為“＆”。此后，我們將刪除此特殊HTML字符和鏈接。在解碼數據時，這是將信息從復雜的符號轉換為簡單易懂的字符的過程。收集的數據使用不同的解碼形式，例如“拉丁”，“ UTF8”等。

In the twitter datasets, there is also other information as retweet, Hashtag, Username and modified tweets. All of this is ignored and removed from the dataset.

在Twitter數據集中，還有其他信息，如轉推，標簽，用戶名和已修改的推文。所有這些都將被忽略并從數據集中刪除。

from nltk import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import words
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, pos_tag_sents#import for bag of word
import numpy as np
#For the regular expression
import re
#Textblob dependency
from textblob import TextBlob
from textblob import Word
#set to string 
from ast import literal_eval
#From src dependency 
from sentencecounter import no_sentences,getline,gettempwords import os
def getsysets(word):syns = wordnet.synsets(word)  #wordnet from ntlk.corpus  will not work with textblob#print(syns[0].name()) #print(syns[0].lemmas()[0].name())  #get synsets names #print(syns[0].definition())  #defination #print(syns[0].examples())    #example# getsysets("good")def getsynonyms(word):synonyms = []# antonyms = []for syn in wordnet.synsets(word):for l in syn.lemmas():synonyms.append(l.name())# if l.antonyms():#     antonyms.append(l.antonyms()[0].name())# print(set(synonyms))return(set(synonyms))# print(set(antonyms))# getsynonyms_and_antonyms("good")def extract_words(sentence):ignore_words = ['a']words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence)words_cleaned = [w.lower() for w in words if w not in ignore_words]return words_cleaned    def tokenize_sentences(sentences):words = []for sentence in sentences:w = extract_words(sentence)words.extend(w)words = sorted(list(set(words)))return wordsdef bagofwords(sentence, words):sentence_words = extract_words(sentence)# frequency word countbag = np.zeros(len(words))for sw in sentence_words:for i,word in enumerate(words):if word == sw: bag[i] += 1return np.array(bag)def tokenizer(sentences):token = word_tokenize(sentences)return tokenprint("#"*100)print (sent_tokenize(sentences))print (token)print("#"*100)# sentences = "Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"
# vocabulary = tokenize_sentences(sentences)
# print (vocabulary)
# tokenizer(sentences)def createposfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word+'\n')def createnegfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word)def getsortedsynonyms(word):sortedsynonyms = sorted(getsynonyms(word))return sortedsynonymsdef getlengthofarray(word):return getsortedsynonyms(word).__len__()def readposfile():f = open('list of positive words.txt')return f# def searchword(word, sourcename):
#     if word in open('list of negative words.txt').read():
#             createnegfile('destinationposfile.txt',word)
#     elif word in open('list of positive words.txt').read():
#             createposfile('destinationnegfile.txt',word)     #     else:
#         for i in range (0,getlengthofarray(word)):
#             searchword(getsortedsynonyms(word)[i],sourcename)def searchword(word,srcfile):# if word in open('list of negative words.txt').read():#         createnegfile('destinationposfile.txt',word)if word in open('list of positive words.txt').read():createposfile('destinationnegfile.txt',word)else:for i in range(0,getlengthofarray(word)):searchword(sorted(getsynonyms(word))[i],srcfile)f = open(srcfile,'w')f.writelines(word)print ('#'*50)
# searchword('lol','a.txt')
print(readposfile())
# tokenizer(sentences)
# getsynonyms('good')
# print(sorted(getsynonyms('good'))[2])  #finding an array object [hear it's 3rd object]
print ('#'*50)
# print (getsortedsynonyms('bad').__len__())
# createposfile('created.txt','lol')
# for word in word_tokenize(getline()):
#     searchword(word,'a.txt')

Stop words are generally thought to be a “single set of words”. We would not want these words taking up space in our database. For this using NLTK and using a “Stop Word Dictionary” . The stop words are removed as they are not useful.All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed. In the twitter datasets, there is also other information as retweet, Hashtag, Username and Modified tweets. All of this is ignored and removed from the dataset. We should remove these duplicates, which we already did. Sometimes it is better to remove duplicate data based on a set of unique identifiers. For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.

停用詞通常被認為是“單個詞集”。我們不希望這些單詞占用數據庫中的空間。為此，請使用NLTK并使用“停用詞詞典”。停用詞因無用而被刪除。應根據優先級處理所有標點符號。例如： ”。”， ”，”，”？” 是重要的標點符號，應保留下來，而其他標點符號則需要刪除。在Twitter數據集中，還存在其他信息，如轉推，標簽，用戶名和修改過的推文。所有這些都將被忽略并從數據集中刪除。我們應該刪除這些重復項，而我們已經這樣做了。有時最好根據一組唯一的標識符刪除重復的數據。例如，以相同的平方英尺，相同的價格和相同的建造年份，同時進行兩次交易的機會幾乎為零。

Thank you for reading.

感謝您的閱讀。

I hope you found this data cleaning guide helpful. Please leave any comments to let us know your thoughts.

希望本數據清理指南對您有所幫助。請留下任何評論，讓我們知道您的想法。

To read previous part of the series -

要閱讀本系列的前一部分-

https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f

翻譯自: https://medium.com/swlh/twitter-data-cleaning-and-preprocessing-for-data-science-3ca0ea80e5cd

twitter 數據集處理

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/392164.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/392164.shtml
英文地址，請注明出處：http://en.pswp.cn/news/392164.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

ios 動態化視圖_如何在iOS應用中使高度收集視圖動態化

ios 動態化視圖by Payal Gupta通過Payal Gupta 如何在iOS應用中使集合視圖的高度動態化 (How to make height of collection views dynamic in your iOS apps) 充滿活力，就像生活一樣… (Be dynamic, just like life…) Table views and collection views have alw…

新開通博客

新開通博客，希望兄弟們積極更新。轉載于:https://www.cnblogs.com/ydhliphonedev/archive/2011/07/28/2119720.html

思維導圖分析http之http協議版本

1.結構總覽在http協議這一章，我將先后介紹上圖六個部分，本文先介紹http的協議版本。 2.http協議版本 http協議的歷史并不長，從1991的0.9版本到現在(2017)僅僅才20多年，算算下來,http還是正處青年，正是大好發展的好時光…

分布與并行計算—生產者消費者模型RabbitMQ（Java）

連接工具 public class ConnectionUtil {public static final String QUEUE_NAME"firstQueue";private static final String RABBIT_HOST "11";private static final String RABBIT_USERNAME "";private static final String RABBIT_PASSWORD…

飛騰 linux 內核,FT2004-Xenomai

移植Xenomai到基于飛騰FT2004 CPU的FT Linux系統1 目前飛騰FT2000/4相關設備驅動還沒有開源，需要先聯系飛騰軟件生態部獲取FT Linux源代碼2 如需在x86交叉編譯arm64內核，推薦使用Linaro gcc編譯器，鏈接如下：https://releases.lina…

使用管道符組合使用命令_如何使用管道的魔力

使用管道符組合使用命令Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of…

關于網頁授權的兩種scope的區別說明

關于網頁授權的兩種scope的區別說明 1、以snsapi_base為scope發起的網頁授權，是用來獲取進入頁面的用戶的openid的，并且是靜默授權并自動跳轉到回調頁的。用戶感知的就是直接進入了回調頁（往往是業務頁面） 2、以snsapi_userinfo為…

安卓流行布局開源庫_如何使用流行度在開源庫之間進行選擇

安卓流行布局開源庫by Ashish Singal通過Ashish Singal 如何使用流行度在開源庫之間進行選擇 (How to choose between open source libraries using popularity) Through my career as a product manager, I’ve worked closely with engineers to build many technology prod…

TCP/IP分析(一) 協議概述

各協議層分工明確轉載于:https://www.cnblogs.com/HonkerYblogs/p/11247604.html

window 下分linux分區,如何在windows9x下訪問linux分區

1. 簡介Linux 內核支持眾多的文件系統類型, 目前它可以讀寫( 至少是讀) 大部分的文件系統.Linux 經常與Microsoft Windows 共存于一個系統或者硬盤中.Linux 對windows9x/NT 的文件系統支持的很好, 反之你想在windows 下…

C# new關鍵字和對象類型轉換(雙括號、is操作符、as操作符)

一、new關鍵字 CLR要求所有的對象都通過new來創建,代碼如下: Object objnew Object(); 以下是new操作符做的事情 1、計算類型及其所有基類型(一直到System.Object,雖然它沒有定義自己的實例字段)中定義的所有實例字段需要的字節數.堆上每個對象都需要一些額外的成員,包括“類型…

JDBC01 利用JDBC連接數據庫【不使用數據庫連接池】

目錄： 1 什么是JDBC 2 JDBC主要接口 3 JDBC編程步驟【學渣版本】 5 JDBC編程步驟【學神版本】 6 JDBC編程步驟【學霸版本】 1 什么是JDBC JDBC是JAVA提供的一套標準連接數據庫的接口，規定了連接數據庫的步驟和功能；不同的數據庫提供商提供了一…

leetcode 778. 水位上升的泳池中游泳（并查集）

在一個 N x N 的坐標方格 grid 中，每一個方格的值 grid[i][j] 表示在位置 (i,j) 的平臺高度。現在開始下雨了。當時間為 t 時，此時雨水導致水池中任意位置的水位為 t 。你可以從一個平臺游向四周相鄰的任意一個平臺，但是前提是此時水位必須…

2020年十大幣預測_2020年十大商業智能工具

2020年十大幣預測In the rapidly growing world of today, when technology is expanding at a rate like never before, there are plenty of tools and skills to explore, learn, and master. In this digital and data age, Business Information and Intelligence have cl…