機器學習美股

by Sofia Godovykh

索非亞·戈多維克(Sofia Godovykh)

我如何使用機器學習來探索英美文學之間的差異 (How I used machine learning to explore the differences between British and American literature)

As I delved further into English literature to further my own language gains, my interest was piqued: how do American and British English differ?

當我進一步研究英語文學以提高自己的語言水平時，我的興趣激起了：美國英語和英國英語有何不同？

With this question framed in my mind, the next steps were to apply natural language processing and machine learning techniques to find concrete examples. I was curious to know whether it would be possible to train a classifier, which would distinguish literary texts.

考慮到這個問題，下一步是應用自然語言處理和機器學習技術來找到具體的例子。我很想知道是否有可能訓練分類器來區分文學文本。

It is quite easy to distinguish texts written in different languages since the cardinality of the intersection of words (features, in terms of machine learning) was relatively small. Text classification by category (such as science, atheism, computer graphics, etc.) is a well-known “hello world” when it comes to tasks related with working with text classification. I faced a more difficult task when I tried to compare two dialects of the same language, as texts have no common theme.

由于單詞交集的基數(相對于機器學習而言的特征)相對較小，因此區分以不同語言編寫的文本非常容易。當涉及與文本分類相關的任務時，按類別(例如科學，無神論，計算機圖形學等)進行文本分類是眾所周知的“ hello world”。當我嘗試比較同一種語言的兩種方言時，我面臨著更加艱巨的任務，因為文本沒有共同的主題。

The most time consuming stage of machine learning deals with the retrieval of data. For the training sample, I used texts from Project Gutenberg, which can be freely downloaded. As for the list of American and British authors, I used names of authors I found in the Wikipedia.

機器學習最耗時的階段是數據的檢索。對于培訓樣本，我使用了來自Gutenberg項目的文本，可以免費下載。至于美國和英國作者的名單，我使用了我在維基百科上找到的作者的名字。

One of the challenges I encountered was finding the name of the author of a text that matched the Wikipedia page. A good search by name was implemented on the site, but since the site doesn’t allow the parsing of data, I instead proposed to use files that contained metadata. This meant that I needed to solve a non-trivial task of matching names (Sir Arthur Ignatius Conan Doyle and Doyle, C. is the same person, but Doyle, M.E. is a different person) — and I had to do so with a very high level of accuracy.

我遇到的挑戰之一是找到與Wikipedia頁面匹配的文本的作者姓名。在站點上實現了按名稱的良好搜索，但是由于站點不允許解析數據，因此我建議使用包含元數據的文件。這意味著我需要解決一個簡單的姓名匹配任務(Sir Arthur Ignatius Conan Doyle和C. Doyle是同一個人，而ME。Doyle是不同的人)，而我必須非常高精度。

Instead, I chose to sacrifice the sample size for the sake of attaining high accuracy, as well as saving some time. I chose, as a unique identifier, an author’s Wikipedia link, which was included in some of the metadata files. With these files, I was able to acquire about 1,600 British and 2,500 American texts and use them to begin training my classifier.

取而代之的是，我選擇犧牲樣本大小以達到高精度，同時節省一些時間。我選擇了作者的Wikipedia鏈接作為唯一標識符，該鏈接包含在某些元數據文件中。有了這些文件，我就可以獲取約1,600份英國文本和2500份美國文本，并使用它們開始訓練我的分類器。

For this project I used sklearn package. The first step after the data collection and analysis stage is pre-processing, in which I utilized a CountVectorizer. A CountVecrorizer takes a text data as input and returns a vector of features as output. Next, I needed to calculate the tf-idf (term frequency — inverted document frequency). A brief explanation why I needed to use it and how:

對于這個項目，我使用了sklearn包。數據收集和分析階段之后的第一步是預處理，其中我使用了CountVectorizer。 CountVecrorizer將文本數據作為輸入，并返回特征向量作為輸出。接下來，我需要計算tf-idf (術語頻率-倒排文檔頻率)。簡要說明為什么需要使用它以及如何使用：

For example, take the word “the” and count the number of occurrences of the word in a given text, A. Let’s suppose that we have 100 occurrences, and the total number of words in a document is 1000.

例如，取單詞“ the”并計算給定文本A中該單詞的出現次數。假設我們有100個出現次數，而文檔中的單詞總數為1000。

Thus,

從而，

tf(“the”) = 100/1000 = 0.1

Next, take the word “sepal”, which has occurred 50 times:

接下來，使用“ sepal”一詞，該詞已經出現了50次：

tf(“sepal”) = 50/1000 = 0.05

To calculate the inverted document frequency for these words, we need to take the logarithm of the ratio of the number of texts from which there is at least one occurrence of the word, to the total number of texts. If there are all 10,000 texts, and in each, there is the word “the”:

要計算這些單詞的倒排文檔頻率，我們需要取至少出現一次單詞的文本數與文本總數之比的對數。如果總共有10,000個文本，并且每個文本中都有單詞“ the”：

idf(“the”) = log(10000/10000) = 0 and

idf(“the”) = log(10000/10000) = 0且

tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0

The word “sepal” is way more rare, and was found only in the 5 texts. Therefore:

“ sepal”一詞更為罕見，僅在5個文本中才發現。因此：

idf(“sepal”) = log(10000/5) and tf-idf(“sepal”) = 7.6 * 0.05 = 0.38

Thus, the most frequently occurring words carry less weight, and specific rarer ones carry more weight. If there are many occurrences of the word “sepal”, we can assume that this is a botanical text. We can not feed a classifier with words, we will use tf-idf measure instead.

因此，最常出現的單詞的權重較小，而特定的罕見單詞的權重較大。如果出現“ sepal”一詞的次數很多，我們可以假定這是植物性文本。我們無法用單詞來填充分類器，我們將改用tf-idf度量。

After I had presented the data as a set of features, I needed to train the classifier. I was working with text data, which is presented as sparse data, so the best option is to use a linear classifier, which works well with large amounts of features.

在將數據呈現為一組功能之后，我需要訓練分類器。我正在處理以稀疏數據形式表示的文本數據，因此最好的選擇是使用線性分類器，該分類器可以很好地與大量功能配合使用。

First, I ran the CountVectorizer, TF-IDFTransformer and SGDClassifier using the default parameters. By analyzing the plot of the accuracy of the sample size — where accuracy fluctuated from 0.6 to 0.85 — I discovered that the classifier was very much dependent on the particular sample used, and therefore not very effective.

首先，我使用默認參數運行CountVectorizer，TF-IDFTransformer和SGDClassifier。通過分析樣本大小的精度圖(精度從0.6到0.85波動)，我發現分類器在很大程度上取決于所使用的特定樣本，因此效果不佳。

After receiving a list of the classifier weights, I noticed a part of the problem: the classifier had been fed with words like “of” and “he”, which we should have treated as a noise. I could easily solve this problem by removing these words from the features by setting the stop_words parameter to the CountVectorizer: stop_words = ‘english’ (or your own custom list of stop words).

在收到分類器權重列表之后，我注意到了問題的一部分：分類器被喂了“ of”和“ he”之類的詞，我們應該將其視為雜音。我可以通過將stop_words參數設置為stop_words從功能中刪除這些單詞來輕松解決此問題： stop_words = 'english' (或您自己的自定義停用詞列表)。

With the default stop words removed, I got an accuracy of 0.85. After that, I launched the automatic selection of parameters using GridSearchCV and achieved a final accuracy of 0.89. I may be able to improve this result with a larger training sample, but for now I stuck with this classifier.

刪除默認停用詞后，我的準確度為0.85。之后，我使用GridSearchCV啟動了參數的自動選擇，最終精度達到了0.89。我可能可以通過使用更大的訓練樣本來改善此結果，但是現在我堅持使用該分類器。

Now on to what interests me most: which words point to the origin of the text? Here’s a list of words, sorted in descending order of weight in the classifier:

現在，我最感興趣的是：哪些詞指向文本的起源？這是單詞列表，在分類器中按權重降序排列：

American: dollars, new, york, girl, gray, american, carvel, color, city, ain, long, just, parlor, boston, honor, washington, home, labor, got, finally, maybe, hodder, forever, dorothy, dr

美國：美元，新，約克，女孩，灰色，美國，carvel，顏色，城市，艾因，長，只是，客廳，波士頓，榮譽，華盛頓，家庭，勞工，終于有了，也許是霍德，永遠，多蘿西，博士

British: round, sir, lady, london, quite, mr, shall, lord, grey, dear, honour, having, philip, poor, pounds, scrooge, soames, things, sea, man, end, come, colour, illustration, english, learnt

英國人：圓形，先生，女士，倫敦，相當，先生，須，領主，灰色，親愛的，榮譽，有，菲利普，可憐，磅，史克魯奇，蘇打，東西，海，人，端，來，顏色，插圖，英語，學習

While having fun with the classifier, I was able to single-out the most “American” British authors and the most “British” American authors (a tricky way to see how bad my classifier could work).

在與分類器一起玩耍的同時，我能夠挑選出最“美國”的英國作者和最“英國”的美國作者(這是一種棘手的方法，可以看出我的分類器的工作效果如何)。

The most “British” Americans:

最“英國”的美國人：

Frances Hodgson Burnett (born in England, moved to the USA at age of 17, so I treat her as an American writer)
弗朗西斯·霍奇森·伯內特(Frances Hodgson Burnett)(出生于英國，在17歲時移居美國，所以我將她視為美國作家)
Henry James (born in the USA, moved to England at age of 33)
亨利·詹姆斯(Henry James)(出生于美國，現年33歲，移居英國)
Owen Wister (yes, the father of Western fiction)
Owen Wister(是，西方小說之父)
Mary Roberts Rinehart (was called the American Agatha Christie for a reason)
瑪麗·羅伯茨·雷內哈特(Mary Roberts Rinehart)(由于某種原因被稱為美國阿加莎·克里斯蒂)
William McFee (another writer moved to America at a young age)
威廉·麥克菲(另一位作家年輕時移居美國)

The most “American” British:

最“美國”的英國人：

Rudyard Kipling (he lived in America several years, also, he wrote “American Notes”)
魯德亞德·吉卜林(他在美國生活了幾年，也寫了《美國筆記》)
Anthony Trollope (the author of “North America”)
安東尼·特羅洛普(Anthony Trollope)(《北美》的作者)
Frederick Marryat (A veteran of Anglo-American War of 1812, thanks to his “Narrative of the Travels and Adventures of Monsieur Violet in California, Sonara, and Western Texas” which made him fall into the american category)
弗雷德里克·馬里亞特(Frederick Marryat)(1812年英美戰爭的退伍軍人，這要歸功于他的“加利福尼亞，索納拉和西得克薩斯州的紫羅蘭先生游記和歷險記”，使他進入了美國類別)
Arnold Bennett (the author of “Your United States: Impressions of a first visit”) one more gentleman wrote travel notes
阿諾德·貝內特(Arnold Bennett)(《您的美國：第一次訪問的印象》的作者)又一位先生寫了旅行記錄
E. Phillips Oppenheim
菲利普斯·奧本海姆

And also the most “British” British and “American” American authors (because the classifier still works well):

也是最“英國”的英國和“美國”美國作者(因為分類器仍然有效)：

Americans:

美國人：