小白的進階之路系列之十六----人工智能從初步到精通pytorch綜合運用的講解第九部分

從零開始學習NLP

在這個由三部分組成的系列中，你將構建并訓練一個基本的字符級循環神經網絡 (RNN) 來對單詞進行分類。

你將學習

如何從零開始構建循環神經網絡
NLP 的基本數據處理技術
如何訓練 RNN 以識別單詞的語言來源。

從零開始學自然語言處理：使用字符級 RNN 對名字進行分類

我們將構建并訓練一個基本的字符級循環神經網絡 (RNN) 來對單詞進行分類。展示了如何預處理數據以建模 NLP。特別是，這些教程展示了如何以低層次處理數據來建模 NLP。

字符級 RNN 將單詞讀取為一系列字符 - 在每個步驟輸出一個預測和“隱藏狀態”，并將其先前的隱藏狀態饋送到下一個步驟中。我們將最終預測作為輸出，即單詞所屬的類別。

具體來說，我們將訓練數據集包含來自 18 種語言的數千個姓氏，并根據拼寫預測名字來自哪種語言。

準備 Torch

設置 torch，使其默認使用正確的設備，并根據您的硬件（CPU 或 CUDA）使用 GPU 加速。

import torch# Check if CUDA is available
device = torch.device('cpu')
if torch.cuda.is_available():device = torch.device('cuda')torch.set_default_device(device)
print(f"Using device = {torch.get_default_device()}")

輸出：

Using device = cuda:0

準備數據

從此處下載數據并將其解壓到當前目錄。

data/names 目錄中包含 18 個文本文件，文件名格式為 [Language].txt。每個文件包含許多名字，每行一個，大部分已羅馬化（但我們仍然需要從 Unicode 轉換為 ASCII）。

第一步是定義和清理我們的數據。首先，我們需要將 Unicode 轉換為純 ASCII 以限制 RNN 輸入層。這通過將 Unicode 字符串轉換為 ASCII 并只允許一小部分允許的字符來實現。

import string
import unicodedata# We can use "_" to represent an out-of-vocabulary character, that is, any character we are not handling in our model
allowed_characters = string.ascii_letters + " .,;'" + "_"
n_letters = len(allowed_characters)# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):return ''.join(c for c in unicodedata.normalize('NFD', s)if unicodedata.category(c) != 'Mn'and c in allowed_characters)

這是一個將 Unicode 字母名字轉換為純 ASCII 的示例。這簡化了輸入層

print (f"converting '?lusàrski' to {unicodeToAscii('?lusàrski')}")

輸出：

converting '?lusàrski' to Slusarski

將名字轉換為張量（Tensors）

現在我們已經組織好所有名字，我們需要將它們轉換為張量才能使用它們。

為了表示單個字母，我們使用大小為 <1 x n_letters> 的“one-hot 向量”。one-hot 向量除了當前字母索引處為 1 外，其余均為 0，例如 "b" = <0 1 0 0 0 ...>。

為了構成一個單詞，我們將這些向量連接成一個 2D 矩陣 <line_length x 1 x n_letters>。

額外的維度 1 是因為 PyTorch 假定所有內容都是批量的 - 我們這里僅使用批量大小為 1。

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):# return our out-of-vocabulary character if we encounter a letter unknown to our modelif letter not in allowed_characters:return allowed_characters.find("_")else:return allowed_characters.find(letter)# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):tensor = torch.zeros(len(line), 1, n_letters)for li, letter in enumerate(line):tensor[li][0][letterToIndex(letter)] = 1return tensor

這里有一些如何對單個字符或多個字符字符串使用 lineToTensor() 的示例。

print (f"The letter 'a' becomes {lineToTensor('a')}") #notice that the first position in the tensor = 1
print (f"The name 'Ahn' becomes {lineToTensor('Ahn')}") #notice '

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/87713.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/87713.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/87713.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！