一、bytes_to_unicode
def bytes_to_unicode():"""Returns list of utf-8 byte and a corresponding list of unicode strings.The reversible bpe codes work on unicode strings.This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.This is a signficant percentage of your normal, say, 32K bpe vocab.To avoid that, we want lookup tables between utf-8 bytes and unicode strings.And avoids mapping to whitespace/control characters the bpe code barfs on."""bs = list(range(ord("!"), ord("~")+1))+list(range(ord("?"), ord("?")+1))+list(range(ord("?"), ord("?")+1))cs = bs[:]n = 0for b in range(2**8):if b not in bs:bs.append(b)cs.append(2**8+n)n += 1cs = [chr(n) for n in cs]return dict(zip(bs, cs))
可能的問題:
?1.Unicod != ACSII,ASCII是Unicode的子集
編碼 | 范圍 | 說明 |
---|---|---|
ASCII | 0 ~ 127 | 最早的字符編碼,只支持英語字母、數字、標點、控制字符 |
Unicode | 0 ~ 1,114,111(約百萬個碼位) | 全球通用字符編碼,包含了 ASCII 全部內容,還支持中文、日文、表情符號等 |
2.bs是idx,cs是對應的字符
3.為什么bs里append的是b而cs是2**8+n?
因為bs定義時會跳過一些不安全的控制字符,但是又希望bs是從0到255沒有空缺,所以在cs中就會用安全的ASCII碼之外的字符代替
4.cs = bs[:]
語法 | 解釋 |
---|---|
cs = bs | 兩個變量指向 同一個列表,修改一個會影響另一個(引用賦值) |
cs = bs[:] | 復制了一個 新的列表,內容一樣,但兩個變量互不影響(淺拷貝) |
5.一些函數
問題 | 解答 |
---|---|
ord() | 字符 → Unicode 整數,比如 ord('A') = 65 |
chr() | Unicode 整數 → 字符,比如 chr(65) = 'A' |
dict(zip(bs, cs)) | 把兩個列表配對,創建 字節 → 字符 的映射字典 |
?二、get_pairs
def get_pairs(word):"""Return set of symbol pairs in a word.Word is represented as tuple of symbols (symbols being variable-length strings)."""pairs = set()prev_char = word[0]for char in word[1:]:pairs.add((prev_char, char))prev_char = charreturn pairs
這里應該好理解,就是給定一個由多個符號(symbol)組成的單詞(word),返回它所有相鄰符號對的集合。下面是舉例:
word = ('ab', 'cd', 'e')
get_pairs(word)
→ {('ab', 'cd'), ('cd', 'e')}
三、basic_clean + whitespace_clean
函數或方法 | 作用 | 示例輸入 | 示例輸出 | 備注 |
---|---|---|---|---|
ftfy.fix_text(text) | 修復 Unicode 編碼錯誤、替換亂碼字符 | "a??helloa??" | "“hello”" | 修復網頁/數據庫導出亂碼最常用 |
html.unescape(text) | 將 HTML 實體編碼還原為正常字符 | "<div>" | "<div>" | 常用于網頁文本解析 |
text.strip() | 去除字符串前后的空白字符(包括 \n , \t , 空格) | " hello world \n" | "hello world" | 不影響中間空格 |
re.sub(r'\s+', ' ', text) | 將所有連續的空白字符替換為一個空格 | "This\nis\t\ta test" | "This is a test" | 中間所有空白都變成一個普通空格 |
?四、bpe
word = tuple(token[:-1]) + (token[-1] + '</w>',)
舉例:token = 'low'
→ ('l', 'o', 'w</w>')
為什么不是word = tuple(token) + (?'</w>',),這樣的話就會把最后一個字符和'</w>'分開,導致無法標志單詞結束的位置
pairs = get_pairs(word)
舉例:('l', 'o', 'w</w>')
→ {('l','o'), ('o','w</w>')}
if not pairs:return token+'</w>'
舉例:?
輸入 token | 處理后 word | pairs | 輸出 |
---|---|---|---|
"a" | ('a</w>',) | set() (空集) | "a</w>" |
?下面用“lowered”來舉例bpe的完整過程:
輪次 | 當前 word | 合并的 bigram | 合并后結果 |
---|---|---|---|
1 | ('l','o','w','e','r','e','d</w>') | ('l','o') | ('lo','w','e','r','e','d</w>') |
2 | ('lo','w','e','r','e','d</w>') | ('lo','w') | ('low','e','r','e','d</w>') |
3 | ('low','e','r','e','d</w>') | ('e','r') | ('low','er','e','d</w>') |
4 | ('low','er','e','d</w>') | ?無匹配 | 循環結束 |
?
def bpe(self, token):if token in self.cache:return self.cache[token]word = tuple(token[:-1]) + (token[-1] + '</w>',)pairs = get_pairs(word)if not pairs:return token+'</w>'while True:bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))if bigram not in self.bpe_ranks:breakfirst, second = bigramnew_word = []i = 0while i < len(word):try:j = word.index(first, i)new_word.extend(word[i:j])i = jexcept:new_word.extend(word[i:])breakif word[i] == first and i < len(word)-1 and word[i+1] == second:new_word.append(first+second)i += 2else:new_word.append(word[i])i += 1new_word = tuple(new_word)word = new_wordif len(word) == 1:breakelse:pairs = get_pairs(word)word = ' '.join(word)self.cache[token] = wordreturn word