單詞嵌入_神秘的文本分類:單詞嵌入簡介

單詞嵌入

Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown IBM experiment in 1954 was a big step towards a fully automated text translation. More than 60 Russian sentences were translated into English using simple reordering and replacing rules.

自然語言處理(NLP)是始于1950年代的一門古老科學。 1954年, 喬治敦IBM的實驗是朝著全自動文本翻譯邁出的一大步。 使用簡單的重新排序和替換規則,將60多個俄語句子翻譯成英語。

The statistical revolution in NLP started in late the 1980s. Instead of hand-crafting a set of rules, a large corpus of text was analyzed to create rules using statistical approaches. Different metrics were calculated for given input data, and predictions were made using decision trees or regression-based calculations.

NLP的統計革命始于1980年代后期。 與其手工制定一組規則,不如分析大量文本集以使用統計方法創建規則。 針對給定的輸入數據計算了不同的指標,并使用決策樹或基于回歸的計算進行了預測。

Today, complex metrics are replaced by more holistic approaches that create better results and that are easier to maintain.

如今,復雜的指標已被更全面的方法所取代,這些方法可以產生更好的結果并且更易于維護。

This post is about word embeddings, which is the first part of my machine learning for coders series (with more to follow!).

這篇文章是關于單詞嵌入的,這是我的機器學習編碼器系列文章的第一部分(還有更多后續內容!)。

什么是詞嵌入? (What are word embeddings?)

Traditionally, in natural language processing (NLP), words were replaced with unique IDs to do calculations. Let’s take the following example:

傳統上,在自然語言處理(NLP)中,單詞被替換為唯一的ID進行計算。 讓我們來看下面的例子:

This approach has the disadvantage that you will need to create a huge list of words and give each element a unique ID. Instead of using unique numbers for your calculations, you can also use vectors to that represent their meaning, so-called word embeddings:

這種方法的缺點是您將需要創建大量單詞并為每個元素賦予唯一的ID。 除了使用唯一的數字進行計算之外,您還可以使用向量來表示其含義,即所謂的詞嵌入:

In this example, each word is represented by a vector. The length of a vector can be different. The bigger the vector is, the more context information it can store. Additionally, the calculation costs go up as vector size increases.

在此示例中,每個單詞都由一個向量表示。 向量的長度可以不同。 向量越大,它可以存儲的上下文信息越多。 此外,計算成本隨著向量大小的增加而增加。

The element count of a vector is also called the number of vector dimensions. In the example above, the word example is expressed with (4 2 6), whereby 4 is the value of the first dimension, 2 of the 2nd, and 6 of the 3rd dimension.

向量的元素計數也稱為向量維數。 在上面的示例中,單詞example用(4 2 6)表示,其中4是第一維的值,2是第二維的值以及6是第三維的值。

In more complex examples, there might be more than 100 dimensions that can encode a lot of information. Things like:

在更復雜的示例中,可能有100多個維度可以編碼很多信息。 像:

  • gender,

    性別,
  • race,

    種族,
  • age,

    年齡,
  • type of word

    字詞類型

will be stored.

將被存儲。

A word such as one is a word that is a quantity like many. Therefore, both vectors are closer compared to words that are more different in their usage.

一個這樣的詞是一個數量眾多的詞。 因此,兩個向量與用法不同的詞相比更接近。

Simplified, if vectors are similar, then the words have similarities in their usage. For other NLP tasks, this has a lot of advantages because calculations can be made based upon a single vector with only a few hundreds of parameters in comparison to a huge dictionary with hundreds of thousands of IDs.

簡化后,如果向量相似,則單詞的用法相似。 對于其他NLP任務,這具有很多優勢,因為與具有數十萬個ID的龐大字典相比,可以基于僅具有數百個參數的單個向量進行計算。

Additionally, if there are unknown words that were never seen before, then this is no problem. You just need a good word embedding of the new word, and calculations are similar. The same applies to other languages. This is basically the magic of word embeddings that enables things like fast learning, multi-language processing, and much more.

另外,如果有以前從未見過的未知單詞,那么這沒問題。 您只需要在新單詞上嵌入一個好的單詞即可,并且計算結果相似。 其他語言也一樣。 基本上,這就是單詞嵌入的魔力,它可以實現快速學習,多語言處理等功能。

創建單詞嵌入 (Creation of word embeddings)

It’s very popular to extend the concept of word embeddings to other domains. For example, a movie rental platform can create movie embeddings and do calculations upon vectors instead of movie IDs.

將詞嵌入的概念擴展到其他領域非常流行。 例如,電影租賃平臺可以創建電影嵌入并根據矢量而不是電影ID進行計算。

但是,如何創建此類嵌入? (But how do you create such embeddings?)

There are various techniques out there, but all of them follow the key aspect that the meaning of a word is defined due to its usage.

那里有各種各樣的技術,但是所有這些技術都遵循一個關鍵方面,即由于單詞的用法而定義了單詞的含義。

Let’s say we have a set of sentences:

假設我們有一組句子:

text_for_training = ['he is a king','she is a queen','he is a man','she is a woman','she is a daughter','he is a son'
]

The sentences contain 10 unique words, and we want to create a word embedding for each word.

句子包含10個唯一的單詞,我們要為每個單詞創建一個單詞嵌入。

{0: 'he',1: 'a',2: 'is',3: 'daughter',4: 'man',5: 'woman',6: 'king',7: 'she',8: 'son',9: 'queen'
}

There are various approaches for how to create embeddings out of them. Let’s pick one of the most used approaches called word2vec. The concept behind this technique uses a very simple neural network to create vectors that represent meanings of words.

有多種方法可以用來創建嵌入。 讓我們選擇一種最常用的方法word2vec 。 該技術背后的概念使用非常簡單的神經網絡來創建代表單詞含義的向量。

Let’s start with the target word “king”. It is used within the context of the masculine pronoun “he”. Context in this example means it just is part of the same sentence. The same applies to “queen” and “she”. It also makes sense to do the same approach for more generic words. The word “he“ can be the target word and “is” is the context word.

讓我們從目標詞“ king ”開始。 它在男性代詞“ he ”的上下文中使用。 在此示例中,上下文意味著它只是同一句子的一部分。 “ 皇后 ”和“ ”也一樣。 對更通用的單詞執行相同的方法也很有意義。 “ ”可以是目標詞,“ ”是上下文詞。

If we do this for every combination, we can actually get simple word embeddings. More holistic approaches add more complexity and calculations, but they are all based on this approach.

如果對每種組合都執行此操作,則實際上可以得到簡單的單詞嵌入。 更具整體性的方法會增加更多的復雜性和計算量,但它們都是基于此方法的。

To use a word as an input for a neural network we need a vector. We can decode a word's unique id in a vector by putting a 1 at the position of the word of our dictionary and keep every other index at 0. This is called a one-hot encoded vector:

要將單詞用作神經網絡的輸入,我們需要一個向量。 我們可以通過在字典的單詞的位置放置1并將每個其他索引保持在0來解碼矢量中單詞的唯一ID,這稱為單熱編碼矢量:

Between the input and the output is a single hidden layer. This layer contains as many elements as the word embedding should have. The more elements word embeddings have, the more information they can store.

在輸入和輸出之間是單個隱藏層。 該層包含的元素數量應與嵌入一詞一樣多。 單詞嵌入的元素越多,它們可以存儲的信息就越多。

You might think, then just make it very big. But we have to consider that we need to store an embedding for each existing word that quickly adds up to a decent amount of data to be stored. Additionally, bigger embeddings mean a lot more calculations for neural networks that use embeddings.

您可能會認為,然后使其變得非常大。 但是我們必須考慮到,我們需要為每個現有單詞存儲一個嵌入,以快速將大量的數據存儲起來。 此外,更大的嵌入意味著使用嵌入的神經網絡需要進行更多的計算。

In our example, we will just use 5 as an embedding vector size.

在我們的示例中,我們將僅使用5作為嵌入矢量大小。

The magic of neural networks lies in what's in between the layers, called weights. They store information between layers, where each node of the previous layer is connected with each node of the next layer.

神經網絡的魔力在于層之間的權重。 它們在層之間存儲信息,其中上一層的每個節點與下一層的每個節點連接。

Each connection between the layers is a so-called parameter. These parameters contain the important information of neural networks. 100 parameters - 50 between input and hidden layer, and 50 between hidden and output layer - are initialized with random values and adjusted by training the model.

層之間的每個連接都是所謂的參數。 這些參數包含神經網絡的重要信息。 使用隨機值初始化100個參數-輸入層和隱藏層之間的50個參數,以及隱藏層和輸出層之間的50個參數-并通過訓練模型進行調整。

In this example, all of them are initialized with 0.1 to keep it simple. Let’s think through an example training round, also called an epoch:

在此示例中,所有這些都使用0.1進行了初始化以保持簡單。 讓我們通過一個示例訓練回合(也稱為紀元)來思考:

At the end of the calculation of the neural network, we don’t get our expected output that tells us for the given context “he” that the target is “king”.

在神經網絡計算的最后,我們沒有得到預期的輸出,該預期的輸出告訴我們在給定的上下文中“ ”,目標是“ 國王 ”。

This difference between the result and the expected result is called the error of a network. By finding better parameter values, we can adjust the neural network to predict for future context inputs that deliver the expected target output.

結果與預期結果之間的差異稱為網絡錯誤。 通過查找更好的參數值,我們可以調整神經網絡,以預測提供預期目標輸出的將來上下文輸入。

The contents of our layer connections will change after we try to find better parameters that get us closer to our expected output vector. The error is minimized as soon as the network predicts correctly for different target and context words. The weights between the input and hidden layer will contain all our word embeddings.

嘗試找到更好的參數使我們更接近預期的輸出矢量后,層連接的內容將發生變化。 一旦網絡針對不同的目標詞和上下文詞正確預測,就可以將錯誤最小化。 輸入層和隱藏層之間的權重將包含我們所有的詞嵌入。

You can find the complete example with executable code here. You can create a copy and play with it if you press “Open in playground.”

您可以在此處找到帶有可執行代碼的完整示例。 如果您按“在操場上打開”,則可以創建副本并進行播放。

If you are not familiar with notebooks, it’s pretty simple: it can be read from top to bottom, and you can click and edit the Python code directly.

如果您不熟悉筆記本,它非常簡單:可以從上到下閱讀,并且可以直接單擊和編輯Python代碼。

By pressing “SHIFT+Enter,” you can execute code snippets. Just make sure to start at the top by clicking in the first snipped and pressing SHIFT+Enter, wait a bit and press again SHIFT+Enter, and so on and so on.

通過按“ SHIFT + Enter”,您可以執行代碼段。 只需確保通過單擊第一個片段并單擊SHIFT + Enter并從頂部開始,稍等片刻然后再次按SHIFT + Enter,依此類推。

結論 (Conclusion)

In a nutshell, word embeddings are used to create neural networks in a more flexible way. They can be built using neural networks that have a certain task, such as prediction of a target word for a given context word. The weights between the layers are parameters the are adjusted over time. Et voilà, there are your word embeddings.

簡而言之,單詞嵌入用于以更靈活的方式創建神經網絡。 可以使用具有特定任務的神經網絡來構建它們,例如預測給定上下文單詞的目標單詞。 層之間的權重是隨時間調整的參數。 等等,這里有您的單詞嵌入。



I hope you enjoyed the article. If you like it and feel the need for a round of applause, follow me on Twitter.

希望您喜歡這篇文章。 如果您喜歡它并感到需要掌聲,請在Twitter上關注我 。

I am a co-founder of our revolutionary journey platform called Explore The World. We are a young startup located in Dresden, Germany and will target the German market first. Reach out to me if you have feedback and questions about any topic.

我是我們的創新旅程平臺“ 探索世界”的共同創始人。 我們是一家年輕的初創公司,位于德國德累斯頓,并將首先瞄準德國市場。 如果您有關于任何主題的反饋和問題,請與我聯系。

Happy AI exploring :)

快樂的AI探索:)



References

參考文獻

  • Wikipedia natural language processing

    維基百科自然語言處理

    Wikipedia natural language processinghttps://en.wikipedia.org/wiki/Natural_language_processing

    Wikipedia自然語言處理https://en.wikipedia.org/wiki/Natural_language_processing

  • Great paper about text classification created by co-founders of fastai

    Fastai聯合創始人撰寫的有關文本分類的出色論文

    Great paper about text classification created by co-founders of fastaihttps://arxiv.org/abs/1801.06146

    fastai的聯合創始人撰寫的有關文本分類的出色論文https://arxiv.org/abs/1801.06146

  • Googles state of the art approach for NLP tasks

    Google最先進的NLP任務處理方法

    Googles state of the art approach for NLP taskshttps://arxiv.org/abs/1810.04805

    Google處理NLP任務的最新方法https://arxiv.org/abs/1810.04805

翻譯自: https://www.freecodecamp.org/news/demystify-state-of-the-art-text-classification-word-embeddings/

單詞嵌入

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390745.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390745.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390745.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

使用Hadoop所需要的一些Linux基礎

Linux 概念 Linux 是一個類Unix操作系統,是 Unix 的一種,它 控制整個系統基本服務的核心程序 (kernel) 是由 Linus 帶頭開發出來的,「Linux」這個名稱便是以 「Linus’s unix」來命名的。 Linux泛指一類操作系統,具體的版本有&a…

python多項式回歸_Python從頭開始的多項式回歸

python多項式回歸Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learn…

《Linux命令行與shell腳本編程大全 第3版》Linux命令行---4

以下為閱讀《Linux命令行與shell腳本編程大全 第3版》的讀書筆記,為了方便記錄,特地與書的內容保持同步,特意做成一節一次隨筆,特記錄如下: 《Linux命令行與shell腳本編程大全 第3版》Linux命令行--- Linux命令行與she…

徹底搞懂 JS 中 this 機制

徹底搞懂 JS 中 this 機制 摘要:本文屬于原創,歡迎轉載,轉載請保留出處:https://github.com/jasonGeng88/blog 目錄 this 是什么this 的四種綁定規則綁定規則的優先級綁定例外擴展:箭頭函數this 是什么 理解this之前&a…

?如何在2分鐘內將GraphQL服務器添加到RESTful Express.js API

You can get a lot done in 2 minutes, like microwaving popcorn, sending a text message, eating a cupcake, and hooking up a GraphQL server.您可以在2分鐘內完成很多工作,例如微波爐爆米花,發送短信, 吃蛋糕以及連接GraphQL服務器 。 …

leetcode 1744. 你能在你最喜歡的那天吃到你最喜歡的糖果嗎?

給你一個下標從 0 開始的正整數數組 candiesCount ,其中 candiesCount[i] 表示你擁有的第 i 類糖果的數目。同時給你一個二維數組 queries ,其中 queries[i] [favoriteTypei, favoriteDayi, dailyCapi] 。 你按照如下規則進行一場游戲: 你…

回歸分析_回歸

回歸分析Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorith…

ruby nil_Ruby中的數據類型-True,False和Nil用示例解釋

ruby niltrue, false, and nil are special built-in data types in Ruby. Each of these keywords evaluates to an object that is the sole instance of its respective class.true , false和nil是Ruby中的特殊內置數據類型。 這些關鍵字中的每一個都求值為一個對…

淺嘗flutter中的動畫(淡入淡出)

在移動端開發中,經常會有一些動畫交互,比如淡入淡出,效果如圖: 因為官方封裝好了AnimatedOpacity Widget,開箱即用,所以我們用起來很方便,代碼量很少,做少量配置即可,所以&#xff0…

數據科學還是計算機科學_何時不使用數據科學

數據科學還是計算機科學意見 (Opinion) 目錄 (Table of Contents) Introduction 介紹 Examples 例子 When You Should Use Data Science 什么時候應該使用數據科學 Summary 摘要 介紹 (Introduction) Both Data Science and Machine Learning are useful fields that apply sev…

空間復雜度 用什么符號表示_什么是大O符號解釋:時空復雜性

空間復雜度 用什么符號表示Do you really understand Big O? If so, then this will refresh your understanding before an interview. If not, don’t worry — come and join us for some endeavors in computer science.您真的了解Big O嗎? 如果是這樣&#xf…

leetcode 523. 連續的子數組和

給你一個整數數組 nums 和一個整數 k ,編寫一個函數來判斷該數組是否含有同時滿足下述條件的連續子數組: 子數組大小 至少為 2 ,且 子數組元素總和為 k 的倍數。 如果存在,返回 true ;否則,返回 false 。 …

Docker學習筆記 - Docker Compose

一、概念 Docker Compose 用于定義運行使用多個容器的應用,可以一條命令啟動應用(多個容器)。 使用Docker Compose 的步驟: 定義容器 Dockerfile定義應用的各個服務 docker-compose.yml啟動應用 docker-compose up二、安裝 Note t…

創建shell腳本

1.寫一個腳本 a) 用touch命令創建一個文件:touch my_script b) 用vim編輯器打開my_script文件:vi my_script c) 用vim編輯器編輯my_script文件,內容如下: #!/bin/bash 告訴shell使用什么程序解釋腳本 #My first script l…

線性回歸算法數學原理_線性回歸算法-非數學家的高級數學

線性回歸算法數學原理內部AI (Inside AI) Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not p…

您應該在2020年首先學習哪種編程語言? ????d???s????:???su?

Most people’s journey toward learning to program starts with a single late-night Google search.大多數人學習編程的旅程都是從一個深夜Google搜索開始的。 Usually it’s something like “Learn ______”通常它類似于“學習______” But how do they decide which la…

Linux 概述

UNIX發展歷程 第一個版本是1969年由Ken Thompson(UNIX之父)在AT& T貝爾實驗室實現Ken Thompson和Dennis Ritchie(C語言之父)使用C語言對整個系統進行了再加工和編寫UNIX的源代碼屬于SCO公司(AT&T ->Novell …

課程一(Neural Networks and Deep Learning),第四周(Deep Neural Networks)—— 0.學習目標...

Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply it to computer vision. 學習目標 See deep neural networks as successive blocks put one after each otherBuild and train a deep L-layer Neura…

使用ActionTrail Python SDK

ActionTrail提供官方的Python SDK。本文將簡單介紹一下如何使用ActionTrail的Python SDK。 安裝Aliyun Core SDK。 pip install aliyun-python-sdk-core 安裝ActionTrail Python SDK。 pip install aliyun-python-sdk-actiontrail 下面是測試的代碼。調用LookupEventsRequest獲…

泰坦尼克:機器從災難中學習_用于災難響應的機器學習研究:什么才是好的論文?...

泰坦尼克:機器從災難中學習For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…