CLIP的tokenizer詳解

一、bytes_to_unicode

def bytes_to_unicode():"""Returns list of utf-8 byte and a corresponding list of unicode strings.The reversible bpe codes work on unicode strings.This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.This is a signficant percentage of your normal, say, 32K bpe vocab.To avoid that, we want lookup tables between utf-8 bytes and unicode strings.And avoids mapping to whitespace/control characters the bpe code barfs on."""bs = list(range(ord("!"), ord("~")+1))+list(range(ord("?"), ord("?")+1))+list(range(ord("?"), ord("?")+1))cs = bs[:]n = 0for b in range(2**8):if b not in bs:bs.append(b)cs.append(2**8+n)n += 1cs = [chr(n) for n in cs]return dict(zip(bs, cs))

可能的問題：

?1.Unicod != ACSII,ASCII是Unicode的子集

編碼	范圍	說明
ASCII	0 ~ 127	最早的字符編碼，只支持英語字母、數字、標點、控制字符
Unicode	0 ~ 1,114,111（約百萬個碼位）	全球通用字符編碼，包含了 ASCII 全部內容，還支持中文、日文、表情符號等

2.bs是idx，cs是對應的字符

3.為什么bs里append的是b而cs是2**8+n？

因為bs定義時會跳過一些不安全的控制字符，但是又希望bs是從0到255沒有空缺，所以在cs中就會用安全的ASCII碼之外的字符代替

4.cs = bs[:]

語法	解釋
`cs = bs`	兩個變量指向同一個列表，修改一個會影響另一個（引用賦值）
`cs = bs[:]`	復制了一個新的列表，內容一樣，但兩個變量互不影響（淺拷貝）

5.一些函數

問題	解答
`ord()`	字符 → Unicode 整數，比如 `ord('A') = 65`
`chr()`	Unicode 整數 → 字符，比如 `chr(65) = 'A'`
`dict(zip(bs, cs))`	把兩個列表配對，創建 `字節 → 字符` 的映射字典

?二、get_pairs

def get_pairs(word):"""Return set of symbol pairs in a word.Word is represented as tuple of symbols (symbols being variable-length strings)."""pairs = set()prev_char = word[0]for char in word[1:]:pairs.add((prev_char, char))prev_char = charreturn pairs

這里應該好理解，就是給定一個由多個符號（symbol）組成的單詞（word），返回它所有相鄰符號對的集合。下面是舉例：

word = ('ab', 'cd', 'e')
get_pairs(word)
→ {('ab', 'cd'), ('cd', 'e')}

三、basic_clean + whitespace_clean

函數或方法	作用	示例輸入	示例輸出	備注
`ftfy.fix_text(text)`	修復 Unicode 編碼錯誤、替換亂碼字符	`"a??helloa??"`	`"“hello”"`	修復網頁/數據庫導出亂碼最常用
`html.unescape(text)`	將 HTML 實體編碼還原為正常字符	`"<div>"`	`"<div>"`	常用于網頁文本解析
`text.strip()`	去除字符串前后的空白字符（包括 `\n`, `\t`, 空格）	`" hello world \n"`	`"hello world"`	不影響中間空格
`re.sub(r'\s+', ' ', text)`	將所有連續的空白字符替換為一個空格	`"This\nis\t\ta test"`	`"This is a test"`	中間所有空白都變成一個普通空格

?四、bpe

word = tuple(token[:-1]) + (token[-1] + '</w>',)

舉例：token = 'low' → ('l', 'o', 'w</w>')

為什么不是word = tuple(token) + (?'</w>',)，這樣的話就會把最后一個字符和'</w>'分開，導致無法標志單詞結束的位置

pairs = get_pairs(word)

舉例：('l', 'o', 'w</w>') → {('l','o'), ('o','w</w>')}

if not pairs:return token+'</w>'

舉例：?

輸入 `token`	處理后 `word`	`pairs`	輸出
`"a"`	`('a</w>',)`	`set()`（空集）	`"a</w>"`

?下面用“lowered”來舉例bpe的完整過程：

輪次	當前 `word`	合并的 bigram	合并后結果
1	`('l','o','w','e','r','e','d</w>')`	`('l','o')`	`('lo','w','e','r','e','d</w>')`
2	`('lo','w','e','r','e','d</w>')`	`('lo','w')`	`('low','e','r','e','d</w>')`
3	`('low','e','r','e','d</w>')`	`('e','r')`	`('low','er','e','d</w>')`
4	`('low','er','e','d</w>')`	?無匹配	循環結束

    def bpe(self, token):if token in self.cache:return self.cache[token]word = tuple(token[:-1]) + (token[-1] + '</w>',)pairs = get_pairs(word)if not pairs:return token+'</w>'while True:bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))if bigram not in self.bpe_ranks:breakfirst, second = bigramnew_word = []i = 0while i < len(word):try:j = word.index(first, i)new_word.extend(word[i:j])i = jexcept:new_word.extend(word[i:])breakif word[i] == first and i < len(word)-1 and word[i+1] == second:new_word.append(first+second)i += 2else:new_word.append(word[i])i += 1new_word = tuple(new_word)word = new_wordif len(word) == 1:breakelse:pairs = get_pairs(word)word = ' '.join(word)self.cache[token] = wordreturn word

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/90154.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/90154.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/90154.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！