從經典學習 NLP：小白到大白：1. Word Tokenization

文章目錄

- - 1 Word Tokenization
  - - 1.1 Top-down/rule-based tokenization
    - 1.2 Byte-pair Encoding: A Bottom-up tokenization algorithm

1 Word Tokenization

來源：JM3 Chapter 2.5 p19-23

tokenization 就是把 running text 分割成為 words；

常有兩種方法：

top-down/rule-based tokenization
根據預先定義的標準與規則實現分詞；
bottom-up tokenization
使用字母序列的統計量實現分詞；

1.1 Top-down/rule-based tokenization

舉個例子，

首先要注意 clitic contraction ，也就是附著縮略詞的處理，clitic 是：“ A clitic is a part of a word that can’t stand on its own, and can only occur when it is attached to another word. ”

按照 Penn Treebank tokenization 標準：
doesn’t 被展開為 does + n’t
其中的 n’t 就是 clitic contraction 嘛
tokenization 會保持 hyphenated words，即連字符連接的詞在一起，并把所有 punctuation，即標點符號進行分割！具體例子可以見：
![[Pasted image 20240229223529.png]]

tokenization 可以展開這種 clitic contraction，也可以看到，tokenization 識別出了 does 這個詞，所以 tokenization 也和 NLP 的一個方向：named entity recognition 緊密相關！

由于 tokenization 在其他的 text precessing 前，所以需要比較迅速，常是基于 regular expression 正則表達式，正則表達式通常都編譯入非常高效的有限狀態自動機。

設計優秀的 top-down tokenization 所遵循的 deterministic algorithm 可以處理各種 ambiguity，比如不同的撇好apostrophe:

genitive marker 屬格
the book’s cover
quotative 引用
‘The other class’, she said
clitics 附著詞
they’re, doesn’t

不同的語言在進行tokenization時可能存有不同困難，english 天然的以 words 分，但 chinese 不使用 space 進行 words 的分割，而是以 character，即漢字，作為分割得到的 token。由于chinese本身的character，也就是漢字，具有豐富的意義，研究表明，chinese NLP 中，以 character 作為 input 會比 words 更好。

但像 japanese，thai 等語言，他的 character 本身作為一個 unit 太小，不足以表達含義，所以也需要 word segmentation 算法。

1.2 Byte-pair Encoding: A Bottom-up tokenization algorithm

tokenization 很重要的一點是要有能力處理 unknown words，我們希望也能夠處理 corpus，即語料庫，之外的的 unknown words。

首先要引入一個 subwords 子字的概念，這是一種 sets of tokens that include tokens smaller than words. 即一種比 words 更小的 token.

子詞并行：在自然語言處理中，一種處理詞匯的方法，將詞匯分解為更小的單元（子詞），以便更好地處理稀有詞匯和詞匯變化。

subwords 可以是 arbitrary substring，也可以是有一定意義的 morphemes，即語素，比如：-est, -er.

A morpheme is the smallest meaning-bearing unit of a language; for example the word unlikeliest has the morphemes un-, likely, and -est.

現代 tokenization 方法中的 token 常為 words，但也可以為某些 frequently morpheme 高頻語素，或者是一些其他的字詞，比如：-er.

基于 subword，任何 unknow words 都可以由某些 subwords units 序列構成，比如 lower，可以由 low 和 -er 這兩個 subwords 組成，或者，如果有必要，可以視為由 -l, -o,-w, -e, -r 等一系列 letter 構成。

tokenization schema 一般分為 token learner 和 token segmenter。前者從 corpus語料中學習，并產生vocabulary，即 set of tokens。后者作用于原始文本，對文本按照 vocabulary 進行分割，實現分詞得到一系列 tokens。

常有三種方法：

byte-pair encoding, BPE
unigram language modeling
SentencePiece library 含有上述兩種，但常指代后一種

BPE token learner 由所有部分均為 individual character 的 vocabulary 開始，根據 training corpus 中的 words，去尋找具有最高出現頻次的 adjacent symbols （symbol可以是多個character構成）。注意，最開始的 vocabulary 就是所有 character 構成的。

將最高頻次的 symbols 不斷merge，并加入到 vocabulary中，以一種greedy的思想去不斷 merge highest frequent adjacent symbols into vocabulary，直到添加完畢 $k$ 個 symbols 進入到 vocabulary中。注意，這里的 $k$ 是 BPE algorithm 的參數。通過BPE，最終得到的 vocabulary 就是由原來的 individual characters 加上 $k$ 個 merged symbols.

過程中，同樣頻率的 pairs of characters，哪個先 merge 沒有特定要求，是 arbitrary 的！

BPE的核心思想：
iteratively merge freaquent pairs of characters.

一些優點：

data-informed tokenization
language independent, can derive the vocabulary for the language with only corpus, this BPE can figure the corpus itself;
works for different languages
no need to design rules for different language
deal better with unknown words
worst case: unknown words 分成 individual characters

最終的 vocabulary 里，大部分都是 full words，少部分是 subwords。
最差情況下，unknown word 也是被分為多個 individual characters。

具體例子參考 jm3，p21. 簡單展示：

![[Pasted image 20240301174028.png]]
注意，有一個 end-of-word symbol _；
從過程來看，一般都是從 end-of-word 處開始 merge。

基于最終得到的 vocabulary：
對于 n e w e r _，其會被分為一整個 token：newer_；
對于 unknown word l o w e r _, 會被分為兩個 token: low 和 er_.

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/712281.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/712281.shtml
英文地址，請注明出處：http://en.pswp.cn/news/712281.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！