NLPAUG
這個python庫幫助您為機器學習項目增加nlp。訪問此簡介了解Data Augmentation in NLP。Augmenter是增廣的基本元素,而Flow是將多個增廣器組合在一起的管道。
起動指南
增強器TargetAugmenterActionDescriptionCharacterRandomAuginsertInsert character randomly
substituteSubstitute character randomly
swapSwap character randomly
deleteDelete character randomly
OcrAugsubstituteSimulate OCR engine error
KeyboardAugsubstituteSimulate keyboard distance error
WordRandomWordAugswapSwap word randomly
deleteDelete word randomly
SpellingAugsubstituteSubstitute word according to spelling mistake dictionary
WordNetAugsubstituteSubstitute word according to WordNet's synonym
WordEmbsAuginsertInsert word randomly from word2vec, GloVe or fasttext dictionary
substituteSubstitute word based on word2vec, GloVe or fasttext embeddings
TfIdfAuginsertInsert word randomly trained TF-IDF model
substituteSubstitute word based on TF-IDF score
BertAuginsertInsert word based by feeding surroundings word to BERT language model
substituteSubstitute word based by feeding surroundings word to BERT language model
SpectrogramFrequencyMaskingAugsubstituteSet block of values to zero according to frequency dimension
TimeMaskingAugsubstituteSet block of values to zero according to time dimension
AudioNoiseAugsubstituteInject noise
PitchAugsubstituteAdjust audio's pitch
ShiftAugsubstituteShift time dimension forward/ backward
SpeedAugsubstituteAdjust audio's speed
CropAugdeleteDelete audio's segment
LoudnessAugsubstituteAdjust audio's volume
MaskAugsubstituteMask audio's segment
流量PipelineDescriptionSequentialApply list of augmentation functions sequentially
SometimesApply some augmentation functions randomly
安裝
該庫在linux和windows平臺上支持python 3.5+。
要安裝庫:pip install nlpaug
或者直接從github安裝最新版本(包括beta版功能)pip install git+https://github.com/makcedward/nlpaug.git
如果您使用bertaug,請同時安裝以下依賴項pip install pytorch_pretrained_bert torch
如果使用wordembsaug(word2vec、glove或fasttext),請先下載經過培訓的模型from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.')# Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.')# Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.')# Download fasttext model
最近的更改
beta2019年8月16日添加新增強器(Cropaug、LoudnessAug、Maskaug)
QWERTYAUG已棄用。它將被鍵盤所取代
刪除StopWordSaug。它將被randomWordAug替換
代碼重構
為word2vec、glove和fasttext添加了模型下載功能
^{str 1}0.0.6美元2019年7月29日:
有關詳細信息,請參見changelog。
測試Word2vec, GloVe, Fasttext models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
- MODEL_DIR={MODEL FILE PATH}Folder structure of model should be
-- root directory
- glove.6B.50d.txt
- GoogleNews-vectors-negative300.bin
- wiki-news-300d-1M.vec
研究參考
以上的一些增強器是受到以下研究論文的啟發。但是,由于不同的原因,它并不總是遵循最初的實現。如果需要原始實現,請參考原始源代碼。
數據源
用于構建增強器/測試用例的來自Internet的飽和數據。
有關詳細信息,請參見data source。
歡迎加入QQ群-->: 979659372
推薦PyPI第三方庫