twitter數據分析
If you’ve written data science articles or are trying to get started, finding the most popular topics is a big help in getting your articles read. Below are the steps to easily determine what these topics are using R and the results of the analysis. This article can also serve as an intro to using an API, and doing some basic text processing in R. Feel free to alter this code to do other Twitter analyses and skip to the end if you’re only interested in the results.
如果您寫過數據科學文章或正在嘗試入門,那么找到最受歡迎的主題對閱讀文章有很大幫助。 以下是輕松確定使用R這些主題和分析結果的步驟。 本文還可以作為使用API??的介紹,以及在R中進行一些基本的文本處理。如果您僅對結果感興趣,請隨意更改此代碼以進行其他Twitter分析,并跳到最后。
Twitter API (The Twitter API)
If you don’t have a twitter account, you need to make one. After that head over to Twitter Developer. After signing in with your new account, you can select the Apps Menu, then Create an App. From there fill out the information about the App’s details, for the most part, this can be left blank except for the App’s name and details.
如果您沒有Twitter帳戶,則需要注冊一個。 之后,前往Twitter開發人員 。 使用新帳戶登錄后,您可以選擇“應用程序菜單”,然后選擇“創建應用程序”。 從那里填寫有關應用程序詳細信息的信息,在大多數情況下,除了應用程序名稱和詳細信息外,可以將其留空。
安裝R軟件包 (Installing R Packages)
We will need 3 R packages to do this project — rtweet, tidyverse, and tidytext. Install these with
我們需要3個R軟件包來完成此項目-rtweet,tidyverse和tidytext。 用這些安裝
install.packages("package-name")
Make sure you load each of these packages in your script with the library() function.
確保使用library()函數在腳本中加載每個軟件包。
收集推文 (Collecting Tweets)
To get tweets we first need to generate a token. We can do that with the create_token() function from the rtweet package. The arguments can be filled in by looking over the information on the page of your new app on the Twitter Developer site. All of these variables are important for security, so whenever you use an API, make sure to take precaution by keeping these tokens and keys private.
要獲取推文,我們首先需要生成一個令牌。 我們可以使用rtweet包中的create_token()函數來實現。 可以通過在Twitter Developer網站上查看新應用程序頁面上的信息來填充參數。 所有這些變量對于安全性都很重要,因此,每當您使用API??時,請確保將這些令牌和密鑰設為私有,以防患于未然。
token <- create_token(app = "<app-name>",
consumer_key = "<consumer-key>",
consumer_secret = "<consumer-secret>",
access_token = "<access-token>",
access_secret = "<access-secret>")
Now to grab tweets! Use the get_timeline() function to get access to the 3200 most recent tweets of any public user, to get any more than that in one query you will have to pay.
現在抓推文! 使用get_timeline()函數可訪問任何公共用戶的3200條最新推文,以獲取除您將要支付的查詢之外的任何更多信息。
tweets <- get_timeline("TDataScience", n=3200, token=token)
There are 50 columns of data with 3200 rows stored in the variable “tweets” now. We primarily care about the columns text, favorite_count, and retweet_count, but you can explore what else you have to look at.
現在,變量“ tweets”中存儲了50列數據,其中3200行。 我們主要關心列文本,favorite_count和retweet_count,但是您可以探索其他內容。
刪除不需要的短語 (Removing Unwanted Phrases)
We’d also like to remove some things that are common to this domain, mainly links and twitter handles. We can use the gsub() function to replace occurrences of links and handles with something else, in our case we remove them by replacing them with empty strings. Below we pass the text column from one gsub() function to the next with the pipe operator, removing the urls and handles, ultimately saving them in a new column called clean_text.
我們還想刪除該域的一些常見內容,主要是鏈接和Twitter句柄。 我們可以使用gsub()函數將鏈接和句柄的出現替換為其他內容,在本例中,我們通過將它們替換為空字符串來刪除它們。 下面我們使用管道運算符將text列從一個gsub()函數傳遞到下一個gsub()函數,刪除url和handle,最終將它們保存在名為clean_text的新列中。
tweets$text %>%
gsub("?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", .) %>%
gsub("@([A-Za-z0-9_]+)", "", .) -> tweets$clean_text
刪除停用詞 (Removing Stop Words)
Now we want to remove stop words from our clean_text column. Stop words are words that are not important to understanding the meaning of a passage. These are typically the most common words of a language, but there may be additional words you’d like to remove for a specific domain. In our task we might consider the word “data” to be trivial since almost every article will mention data.
現在我們要從clean_text列中刪除停用詞。 停用詞是對于理解段落的含義不重要的詞。 這些通常是一種語言中最常見的單詞,但是對于特定的域,您可能希望刪除其他單詞。 在我們的任務中,我們可能會認為“數據”一詞微不足道,因為幾乎每篇文章都會提到數據。
Tokenization is the process of splitting of text into sections. This could be by sentence, paragraph, word, or something else. In our case we will be using splitting by word, which will take our table of data and make into a table with one row for each word in each tweet. Below I select the columns of interest and then tokenize the words.
標記化是將文本分成多個部分的過程。 這可以是句子,段落,單詞或其他形式。 在我們的例子中,我們將使用按單詞拆分,這將獲取我們的數據表,并在每個推特中的每個單詞排成一行的表中。 在下面,我選擇感興趣的列,然后標記單詞。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text)
This leaves us with the table below, where the words “include”, “training”, and “operations”, were originally all part of the same row in the clean_text column.
這就留給我們下表,其中“ include”,“ training”和“ operations”一詞原本是clean_text列中同一行的一部分。

Now we proceed by doing an anti-join with the stop_words data set from the tidytext package. This removes all the words that are part of both the tweets data set and the stop_words data set (essentially this removes the stop words from your data). The code for this is below.
現在,我們對tidytext包中的stop_words數據集進行反聯接。 這將刪除屬于tweets數據集和stop_words數據集的所有單詞(實質上,這將從數據中刪除停用詞)。 下面的代碼。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text) %>%
anti_join(stop_words)
獲得平均收藏夾和轉發 (Getting Average Favorites and Retweets)
Using the groub_by() and summarise() functions from the tidyverse, we can get the median number of favorites and retweets for each word in our data set. I choose to use median here to eliminate the effects that some outliers may have on our data.
使用tidyverse中的groub_by()和summarise()函數,我們可以獲得數據集中每個單詞的收藏夾和轉推的中位數。 我選擇在此處使用中位數來消除某些異常值可能會對我們的數據產生的影響。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text) %>%
anti_join(stop_words) %>%
group_by(word) %>%
summarise("Median Favorites" = median(favorite_count),
"Median Retweets" = median(retweet_count),
"Count" = n())
篩選結果 (Filtering Results)
We want to see the words that have the most favorites/retweets to inform us about what we should write about. But let’s consider the case where a word is only used once. For example, if there is one really popular article out their about data science applied to the geography of Michigan, we wouldn’t want this to bias us to writing articles about Michigan. To fix this, I filter out all the words that weren’t used in at least 4 tweets with the filter() function, and then sort the results with the arrange() function. Leaving us with our final bit of code.
我們希望看到最喜歡/轉發最多的單詞,以告知我們應該寫些什么。 但是,讓我們考慮一個單詞僅使用一次的情況。 例如,如果有一篇非常流行的文章將其應用于數據科學的文章應用于密歇根州,那么我們不希望這使我們偏向于撰寫有關密歇根州的文章。 為了解決這個問題,我使用filter()函數過濾掉了至少4條推文中未使用過的所有單詞,然后使用ranging()函數對結果進行排序。 剩下最后的代碼。
tweets %>%
select(clean_text, favorite_count, retweet_count) %>%
unnest_tokens(word, clean_text) %>%
anti_join(stop_words) %>%
group_by(word) %>%
summarise("Median Favorites" = median(favorite_count),
"Median Retweets" = median(retweet_count),
"Count" = n()) %>%
arrange(desc(`Median Favorites`)) %>%
filter(`Count` > 4)
最終結果 (Final Results)
So if you are skipping to the end to see what the results are, or don’t want to run this code yourself, here are the results. These are the common words in Towards Data Science, ranked by how many favorites the tweet that contains that word usually gets.
因此,如果您跳到最后看看結果是什么,或者不想自己運行此代碼,則這里是結果。 這些是“邁向數據科學”中的常用詞,按包含該詞的推文通常獲得多少偏愛來排名。
- website — 84.5 favorites 網站— 84.5收藏夾
- finance — 72 favorites 金融— 72個收藏夾
- action — 66 favorites 動作— 66個收藏夾
- matplotlib — 59.5 favorites matplotlib — 59.5個收藏夾
- plotting — 57 favorites 繪圖— 57個收藏夾
- beautiful — 55.5 favorites 美麗— 55.5收藏夾
- portfolio — 51 favorites 投資組合— 51個收藏夾
- exploratory — 47 favorites 探索— 47個收藏夾
- github — 46.5 favorites github — 46.5最喜歡的
- comprehensive — 46 favorites 綜合— 46個收藏夾
Note: I did have to remove the word “James” here since apparently there is an author(s) out there who is very popular with that name. Congratulations, James(es)!
注意:我確實必須在這里刪除“ James”一詞,因為顯然那里有一位作者非常受歡迎。 恭喜,詹姆斯!
My advice now would be to write an article titled “A Comprehensive Guide to Creating Beautiful Financial Plots with Matplotlib” or something like that. You can repeat this process now for any other twitter user with a public account.
現在,我的建議是寫一篇標題為“使用Matplotlib創建漂亮的財務圖的綜合指南”或類似的文章。 您現在可以對具有公共帳戶的任何其他Twitter用戶重復此過程。
翻譯自: https://towardsdatascience.com/the-most-popular-towards-data-science-article-topics-on-twitter-2ecc512dd041
twitter數據分析
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388296.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388296.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388296.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!