單詞嵌入_神秘的文本分類：單詞嵌入簡介

單詞嵌入

Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown IBM experiment in 1954 was a big step towards a fully automated text translation. More than 60 Russian sentences were translated into English using simple reordering and replacing rules.

自然語言處理(NLP)是始于1950年代的一門古老科學。 1954年，喬治敦IBM的實驗是朝著全自動文本翻譯邁出的一大步。使用簡單的重新排序和替換規則，將60多個俄語句子翻譯成英語。

The statistical revolution in NLP started in late the 1980s. Instead of hand-crafting a set of rules, a large corpus of text was analyzed to create rules using statistical approaches. Different metrics were calculated for given input data, and predictions were made using decision trees or regression-based calculations.

NLP的統計革命始于1980年代后期。與其手工制定一組規則，不如分析大量文本集以使用統計方法創建規則。針對給定的輸入數據計算了不同的指標，并使用決策樹或基于回歸的計算進行了預測。

Today, complex metrics are replaced by more holistic approaches that create better results and that are easier to maintain.

如今，復雜的指標已被更全面的方法所取代，這些方法可以產生更好的結果并且更易于維護。

This post is about word embeddings, which is the first part of my machine learning for coders series (with more to follow!).

這篇文章是關于單詞嵌入的，這是我的機器學習編碼器系列文章的第一部分(還有更多后續內容！)。

什么是詞嵌入？ (What are word embeddings?)

Traditionally, in natural language processing (NLP), words were replaced with unique IDs to do calculations. Let’s take the following example:

傳統上，在自然語言處理(NLP)中，單詞被替換為唯一的ID進行計算。讓我們來看下面的例子：

This approach has the disadvantage that you will need to create a huge list of words and give each element a unique ID. Instead of using unique numbers for your calculations, you can also use vectors to that represent their meaning, so-called word embeddings:

這種方法的缺點是您將需要創建大量單詞并為每個元素賦予唯一的ID。除了使用唯一的數字進行計算之外，您還可以使用向量來表示其含義，即所謂的詞嵌入：

In this example, each word is represented by a vector. The length of a vector can be different. The bigger the vector is, the more context information it can store. Additionally, the calculation costs go up as vector size increases.

在此示例中，每個單詞都由一個向量表示。向量的長度可以不同。向量越大，它可以存儲的上下文信息越多。此外，計算成本隨著向量大小的增加而增加。

The element count of a vector is also called the number of vector dimensions. In the example above, the word example is expressed with (4 2 6), whereby 4 is the value of the first dimension, 2 of the 2nd, and 6 of the 3rd dimension.

向量的元素計數也稱為向量維數。在上面的示例中，單詞example用(4 2 6)表示，其中4是第一維的值，2是第二維的值以及6是第三維的值。

In more complex examples, there might be more than 100 dimensions that can encode a lot of information. Things like:

在更復雜的示例中，可能有100多個維度可以編碼很多信息。像：

gender,
性別，
race,
種族，
age,
年齡，
type of word
字詞類型

will be stored.

將被存儲。

A word such as one is a word that is a quantity like many. Therefore, both vectors are closer compared to words that are more different in their usage.

像一個這樣的詞是一個數量眾多的詞。因此，兩個向量與用法不同的詞相比更接近。

Simplified, if vectors are similar, then the words have similarities in their usage. For other NLP tasks, this has a lot of advantages because calculations can be made based upon a single vector with only a few hundreds of parameters in comparison to a huge dictionary with hundreds of thousands of IDs.

簡化后，如果向量相似，則單詞的用法相似。對于其他NLP任務，這具有很多優勢，因為與具有數十萬個ID的龐大字典相比，可以基于僅具有數百個參數的單個向量進行計算。

Additionally, if there are unknown words that were never seen before, then this is no problem. You just need a good word embedding of the new word, and calculations are similar. The same applies to other languages. This is basically the magic of word embeddings that enables things like fast learning, multi-language processing, and much more.

另外，如果有以前從未見過的未知單詞，那么這沒問題。您只需要在新單詞上嵌入一個好的單詞即可，并且計算結果相似。其他語言也一樣。基本上，這就是單詞嵌入的魔力，它可以實現快速學習，多語言處理等功能。

創建單詞嵌入 (Creation of word embeddings)

It’s very popular to extend the concept of word embeddings to other domains. For example, a movie rental platform can create movie embeddings and do calculations upon vectors instead of movie IDs.

將詞嵌入的概念擴展到其他領域非常流行。例如，電影租賃平臺可以創建電影嵌入并根據矢量而不是電影ID進行計算。

但是，如何創建此類嵌入？ (But how do you create such embeddings?)

There are various techniques out there, but all of them follow the key aspect that the meaning of a word is defined due to its usage.

那里有各種各樣的技術，但是所有這些技術都遵循一個關鍵方面，即由于單詞的用法而定義了單詞的含義。

Let’s say we have a set of sentences:

假設我們有一組句子：

text_for_training = ['he is a king','she is a queen','he is a man','she is a woman','she is a daughter','he is a son'
]

The sentences contain 10 unique words, and we want to create a word embedding for each word.

句子包含10個唯一的單詞，我們要為每個單詞創建一個單詞嵌入。

{0: 'he',1: 'a',2: 'is',3: 'daughter',4: 'man',5: 'woman',6: 'king',7: 'she',8: 'son',9: 'queen'
}

There are various approaches for how to create embeddings out of them. Let’s pick one of the most used approaches called word2vec. The concept behind this technique uses a very simple neural network to create vectors that represent meanings of words.

有多種方法可以用來創建嵌入。讓我們選擇一種最常用的方法word2vec 。該技術背后的概念使用非常簡單的神經網絡來創建代表單詞含義的向量。

Let’s start with the target word “king”. It is used within the context of the masculine pronoun “he”. Context in this example means it just is part of the same sentence. The same applies to “queen” and “she”. It also makes sense to do the same approach for more generic words. The word “he“ can be the target word and “is” is the context word.

讓我們從目標詞“ king ”開始。它在男性代詞“ he ”的上下文中使用。在此示例中，上下文意味著它只是同一句子的一部分。 “ 皇后 ”和“ 她 ”也一樣。對更通用的單詞執行相同的方法也很有意義。 “ 他 ”可以是目標詞，“ 是 ”是上下文詞。

If we do this for every combination, we can actually get simple word embeddings. More holistic approaches add more complexity and calculations, but they are all based on this approach.

如果對每種組合都執行此操作，則實際上可以得到簡單的單詞嵌入。更具整體性的方法會增加更多的復雜性和計算量，但它們都是基于此方法的。

To use a word as an input for a neural network we need a vector. We can decode a word's unique id in a vector by putting a 1 at the position of the word of our dictionary and keep every other index at 0. This is called a one-hot encoded vector:

要將單詞用作神經網絡的輸入，我們需要一個向量。我們可以通過在字典的單詞的位置放置1并將每個其他索引保持在0來解碼矢量中單詞的唯一ID，這稱為單熱編碼矢量：

Between the input and the output is a single hidden layer. This layer contains as many elements as the word embedding should have. The more elements word embeddings have, the more information they can store.

在輸入和輸出之間是單個隱藏層。該層包含的元素數量應與嵌入一詞一樣多。單詞嵌入的元素越多，它們可以存儲的信息就越多。

You might think, then just make it very big. But we have to consider that we need to store an embedding for each existing word that quickly adds up to a decent amount of data to be stored. Additionally, bigger embeddings mean a lot more calculations for neural networks that use embeddings.

您可能會認為，然后使其變得非常大。但是我們必須考慮到，我們需要為每個現有單詞存儲一個嵌入，以快速將大量的數據存儲起來。此外，更大的嵌入意味著使用嵌入的神經網絡需要進行更多的計算。

In our example, we will just use 5 as an embedding vector size.

在我們的示例中，我們將僅使用5作為嵌入矢量大小。

The magic of neural networks lies in what's in between the layers, called weights. They store information between layers, where each node of the previous layer is connected with each node of the next layer.

神經網絡的魔力在于層之間的權重。它們在層之間存儲信息，其中上一層的每個節點與下一層的每個節點連接。

Each connection between the layers is a so-called parameter. These parameters contain the important information of neural networks. 100 parameters - 50 between input and hidden layer, and 50 between hidden and output layer - are initialized with random values and adjusted by training the model.

層之間的每個連接都是所謂的參數。這些參數包含神經網絡的重要信息。使用隨機值初始化100個參數-輸入層和隱藏層之間的50個參數，以及隱藏層和輸出層之間的50個參數-并通過訓練模型進行調整。

In this example, all of them are initialized with 0.1 to keep it simple. Let’s think through an example training round, also called an epoch:

在此示例中，所有這些都使用0.1進行了初始化以保持簡單。讓我們通過一個示例訓練回合(也稱為紀元)來思考：

At the end of the calculation of the neural network, we don’t get our expected output that tells us for the given context “he” that the target is “king”.

在神經網絡計算的最后，我們沒有得到預期的輸出，該預期的輸出告訴我們在給定的上下文中“ 他 ”，目標是“ 國王 ”。

This difference between the result and the expected result is called the error of a network. By finding better parameter values, we can adjust the neural network to predict for future context inputs that deliver the expected target output.

結果與預期結果之間的差異稱為網絡錯誤。通過查找更好的參數值，我們可以調整神經網絡，以預測提供預期目標輸出的將來上下文輸入。

The contents of our layer connections will change after we try to find better parameters that get us closer to our expected output vector. The error is minimized as soon as the network predicts correctly for different target and context words. The weights between the input and hidden layer will contain all our word embeddings.

嘗試找到更好的參數使我們更接近預期的輸出矢量后，層連接的內容將發生變化。一旦網絡針對不同的目標詞和上下文詞正確預測，就可以將錯誤最小化。輸入層和隱藏層之間的權重將包含我們所有的詞嵌入。

You can find the complete example with executable code here. You can create a copy and play with it if you press “Open in playground.”

您可以在此處找到帶有可執行代碼的完整示例。如果您按“在操場上打開”，則可以創建副本并進行播放。

If you are not familiar with notebooks, it’s pretty simple: it can be read from top to bottom, and you can click and edit the Python code directly.

如果您不熟悉筆記本，它非常簡單：可以從上到下閱讀，并且可以直接單擊和編輯Python代碼。

By pressing “SHIFT+Enter,” you can execute code snippets. Just make sure to start at the top by clicking in the first snipped and pressing SHIFT+Enter, wait a bit and press again SHIFT+Enter, and so on and so on.

通過按“ SHIFT + Enter”，您可以執行代碼段。只需確保通過單擊第一個片段并單擊SHIFT + Enter并從頂部開始，稍等片刻然后再次按SHIFT + Enter，依此類推。

結論 (Conclusion)

In a nutshell, word embeddings are used to create neural networks in a more flexible way. They can be built using neural networks that have a certain task, such as prediction of a target word for a given context word. The weights between the layers are parameters the are adjusted over time. Et voilà, there are your word embeddings.

簡而言之，單詞嵌入用于以更靈活的方式創建神經網絡。可以使用具有特定任務的神經網絡來構建它們，例如預測給定上下文單詞的目標單詞。層之間的權重是隨時間調整的參數。等等，這里有您的單詞嵌入。

I hope you enjoyed the article. If you like it and feel the need for a round of applause, follow me on Twitter.

希望您喜歡這篇文章。如果您喜歡它并感到需要掌聲，請在Twitter上關注我。

I am a co-founder of our revolutionary journey platform called Explore The World. We are a young startup located in Dresden, Germany and will target the German market first. Reach out to me if you have feedback and questions about any topic.

我是我們的創新旅程平臺“ 探索世界”的共同創始人。我們是一家年輕的初創公司，位于德國德累斯頓，并將首先瞄準德國市場。如果您有關于任何主題的反饋和問題，請與我聯系。

Happy AI exploring :)

快樂的AI探索:)

References

參考文獻

Wikipedia natural language processing
維基百科自然語言處理

Wikipedia natural language processinghttps://en.wikipedia.org/wiki/Natural_language_processing
Wikipedia自然語言處理https://en.wikipedia.org/wiki/Natural_language_processing
Great paper about text classification created by co-founders of fastai
Fastai聯合創始人撰寫的有關文本分類的出色論文

Great paper about text classification created by co-founders of fastaihttps://arxiv.org/abs/1801.06146
fastai的聯合創始人撰寫的有關文本分類的出色論文https://arxiv.org/abs/1801.06146
Googles state of the art approach for NLP tasks
Google最先進的NLP任務處理方法

Googles state of the art approach for NLP taskshttps://arxiv.org/abs/1810.04805
Google處理NLP任務的最新方法https://arxiv.org/abs/1810.04805