🌟 案例 1:加載現成的 BERT 分詞器
from tokenizers import Tokenizer# 加載一個預訓練的 BERT tokenizer(文件需要提前下載,比如bert-base-uncased)
tokenizer = Tokenizer.from_file("bert-base-uncased-tokenizer.json")# 對文本進行編碼
output = tokenizer.encode("Hello, I love studying AI with BERT!")print("Tokens:", output.tokens) # 分出來的 token
print("IDs:", output.ids) # 對應的 token id
🌟 案例 2:自己訓練一個小分詞器
from tokenizers import Tokenizer, models, trainers, pre_tokenizers# 使用 WordPiece 作為分詞模型(BERT 用的就是這個)
tokenizer = Tokenizer(models.WordPiece())# 設置預分詞器(按空格和標點分)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()# 訓練器
trainer = trainers.WordPieceTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])# 用一些文本來訓練(這里隨便寫幾個)
texts = ["I love natural language processing.","BERT is a transformer model.","Deep learning is fun!"
]tokenizer.train_from_iterator(texts, trainer)# 保存分詞器
tokenizer.save("my-tokenizer.json")# 使用訓練好的分詞器
output = tokenizer.encode("I love BERT!")print("Tokens:", output.tokens)
print("IDs:", output.ids)
🌟 案例 3:解碼(從 ID 還原文本)
from tokenizers import Tokenizertokenizer = Tokenizer.from_file("my-tokenizer.json")output = tokenizer.encode("BERT makes NLP easier.")
print("IDs:", output.ids)# 解碼回文本
decoded = tokenizer.decode(output.ids)
print("Decoded:", decoded)
🌟 案例 4:批量處理
from tokenizers import Tokenizertokenizer = Tokenizer.from_file("my-tokenizer.json")batch = tokenizer.encode_batch(["I like AI.","Transformers are powerful models."
])for out in batch:print(out.tokens, out.ids)