Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]
據估計,全球數據的70%–85%是文本(非結構化數據)。 大多數英語和歐盟業務數據格式為字節文本,MS Word或Adobe PDF。 [1]
Organizations web displays of Adobe Postscript Document Format documents (PDF). [2]
組織Web顯示Dobe Postscript文檔格式文檔( PDF )。 [2]
In this blog, I detail the following :
在此博客中,我將詳細介紹以下內容:
- Create a file path from the web file name and local file name; 從Web文件名和本地文件名創建文件路徑;
- Change byte encoded Gutenberg project file into a text corpus; 將字節編碼的Gutenberg項目文件更改為文本語料庫;
- Change a PDF document into a text corpus; 將PDF文檔更改為文本語料庫;
- Segment continuous text into a Corpus of word text. 將連續文本分割成單詞文本的語料庫。
將常用文檔格式轉換為文本 (Converting Popular Document Formats into Text)
1.從Web文件名或本地文件名創建本地文件路徑 (1. Create local filepath from the web filename or local filename)
The following function will take either a local file name or a remote file URL and return a filepath object.
以下函數將采用本地文件名或遠程文件URL并返回文件路徑對象。
#in file_to_text.py
--------------------------------------------
from io import StringIO, BytesIO
import urllib
def file_or_url(pathfilename:str) -> Any:
"""
Reurn filepath given local file or URL.
Args:
pathfilename:
Returns:
filepath odject istance
"""
try:
fp = open(pathfilename, mode="rb") # file(path, 'rb')
except:
pass
else:
url_text = urllib.request.urlopen(pathfilename).read()
fp = BytesIO(url_text)
return fp
2.將Unicode字節編碼的文件更改為Python Unicode字符串 (2. Change Unicode Byte encoded file into a o Python Unicode String)
You will often encounter text blob downloads in the size 8-bit Unicode format (in the romantic languages). You need to convert 8-bit Unicode into Python Unicode strings.
您經常會遇到8位Unicode格式的文本blob下載(浪漫語言)。 您需要將8位Unicode轉換為Python Unicode字符串。
#in file_to_text.py
--------------------------------------------
def unicode_8_to_text(text: str) -> str:
return text.decode("utf-8", "replace")import urllib
from file_to_text import unicode_8_to_texttext_l = 250text_url = r'http://www.gutenberg.org/files/74/74-0.txt'
gutenberg_text = urllib.request.urlopen(text_url).read()
%time gutenberg_text = unicode_8_to_text(gutenberg_text)
print('{}: size: {:g} \n {} \n'.format(0, len(gutenberg_text) ,gutenberg_text[:text_l]))
output =>
輸出=>
CPU times: user 502 μs, sys: 0 ns, total: 502 μs
Wall time: 510 μs
0: size: 421927
The Project Gutenberg EBook of The Adventures of Tom Sawyer, Complete by
Mark Twain (Samuel Clemens)
This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.guten
The result is that text.decode('utf-8')
can format into a Python string of a million characters in about 1/1000th second. A rate that far exceeds our production rate requirements.
結果是text.decode('utf-8')
可以在大約1/1000秒內格式化為一百萬個字符的Python字符串。 生產率遠遠超過我們的生產率要求。
3.將PDF文檔更改為文本語料庫。 (3. Change a PDF document into a text corpus.)
“Changing a PDF document into a text corpus" is one of the most troublesome and common tasks I do for NLP text pre-processing.
“將PDF文檔轉換為文本語料庫 ”是我為NLP文本預處理所做的最麻煩,最常見的任務之一。
#in file_to_text.py
--------------------------------------------
def PDF_to_text(pathfilename: str) -> str:
"""
Chane PDF format to text.
Args:
pathfilename:
Returns:
"""
fp = file_or_url(pathfilename)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(
fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True,
):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
-------------------------------------------------------arvix_list =['https://arxiv.org/pdf/2008.05828v1.pdf'
, 'https://arxiv.org/pdf/2008.05981v1.pdf'
, 'https://arxiv.org/pdf/2008.06043v1.pdf'
, 'tmp/inf_finite_NN.pdf' ]
for n, f in enumerate(arvix_list):
%time pdf_text = PDF_to_text(f).replace('\n', ' ')
print('{}: size: {:g} \n {} \n'.format(n, len(pdf_text) ,pdf_text[:text_l])))
output =>
輸出=>
CPU times: user 1.89 s, sys: 8.88 ms, total: 1.9 s
Wall time: 2.53 s
0: size: 42522
On the Importance of Local Information in Transformer Based Models Madhura Pande, Aakriti Budhraja, Preksha Nema Pratyush Kumar, Mitesh M. Khapra Department of Computer Science and Engineering Robert Bosch Centre for Data Science and AI (RBC-DSAI) Indian Institute of Technology Madras, Chennai, India {mpande,abudhra,preksha,pratyush,miteshk}@
CPU times: user 1.65 s, sys: 8.04 ms, total: 1.66 s
Wall time: 2.33 s
1: size: 30586
ANAND,WANG,LOOG,VANGEMERT:BLACKMAGICINDEEPLEARNING1BlackMagicinDeepLearning:HowHumanSkillImpactsNetworkTrainingKanavAnand1anandkanav92@gmail.comZiqiWang1z.wang-8@tudelft.nlMarcoLoog12M.Loog@tudelft.nlJanvanGemert1j.c.vangemert@tudelft.nl1DelftUniversityofTechnology,Delft,TheNetherlands2UniversityofCopenhagenCopenhagen,DenmarkAbstractHowdoesauser’sp
CPU times: user 4.82 s, sys: 46.3 ms, total: 4.87 s
Wall time: 6.53 s
2: size: 57204
0 2 0 2 g u A 3 1 ] G L . s c [ 1 v 3 4 0 6 0 . 8 0 0 2 : v i X r a Of?ine Meta-Reinforcement Learning with Advantage Weighting Eric Mitchell1, Rafael Rafailov1, Xue Bin Peng2, Sergey Levine2, Chelsea Finn1 1 Stanford University, 2 UC Berkeley em7@stanford.edu Abstract Massive datasets have proven critical to successfully
CPU times: user 12.2 s, sys: 36.1 ms, total: 12.3 s
Wall time: 12.3 s
3: size: 89633
0 2 0 2 l u J 1 3 ] G L . s c [ 1 v 1 0 8 5 1 . 7 0 0 2 : v i X r a Finite Versus In?nite Neural Networks: an Empirical Study Jaehoon Lee Samuel S. Schoenholz? Jeffrey Pennington? Ben Adlam?? Lechao Xiao? Roman Novak? Jascha Sohl-Dickstein {jaehlee, schsam, jpennin, adlam, xlc, romann, jaschasd}@google.com Google Brain
On this hardware configuration, “Converting a PDF file into Python string” requires 150 seconds per million characters. Not fast enough for a Web interactve production application.
在此硬件配置上,“ 將PDF文件轉換為Python字符串 ”每百萬個字符需要150秒。 對于Web interactve生產應用程序來說不夠快。
You may want to stage formatting in the background.
您可能要在后臺進行格式化。
4.將連續文本分割成單詞文本的語料庫 (4. Segment continuous text into Corpus of word text)
When we read https://arxiv.org/pdf/2008.05981v1.pdf', it came back as continuous text with no separation character. Using the package from wordsegment, we separate the continuous string into words.
當我們閱讀https://arxiv.org/pdf/2008.05981v1.pdf'時 ,它以沒有分隔符的連續文本形式出現。 使用來自wordsegment的包,我們將連續的字符串分成單詞。
from wordsegment import load, clean, segment
%time words = segment(pdf_text)
print('size: {:g} \n'.format(len(words)))
' '.join(words)[:text_l*4]
output =>
輸出=>
CPU times: user 1min 43s, sys: 1.31 s, total: 1min 44s
Wall time: 1min 44s
size: 5005'an and wang loog van gemert blackmagic in deep learning 1 blackmagic in deep learning how human skill impacts network training kanavanand1anandkanav92g mailcom ziqiwang1zwang8tudelftnl marco loog12mloogtudelftnl jan van gemert 1jcvangemerttudelftnl1 delft university of technology delft the netherlands 2 university of copenhagen copenhagen denmark abstract how does a users prior experience with deep learning impact accuracy we present an initial study based on 31 participants with different levels of experience their task is to perform hyper parameter optimization for a given deep learning architecture the results show a strong positive correlation between the participants experience and then al performance they additionally indicate that an experienced participant nds better solutions using fewer resources on average the data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyperparameters our study investigates the subjective human factor in comparisons of state of the art results and scientic reproducibility in deep learning 1 introduction the popularity of deep learning in various elds such as image recognition 919speech1130 bioinformatics 2124questionanswering3 etc stems from the seemingly favorable tradeoff between the recognition accuracy and their optimization burden lecunetal20 attribute their success t'
You will notice that wordsegment accomplishes a fairly accurate separation into words. There are some errors , or words that we don’t want, that NLP text pre-processing clear away.
您會注意到, wordsegment實現了相當準確的單詞分離。 NLP文本預處理會清除一些錯誤或我們不希望使用的單詞。
The Apache wordsegment is slow. It is barely adequate in production for small, less than 1 thousand word documents. Can we find some faster way to segment?
Apache 單詞段速度很慢。 對于少于一千個單詞的小型文檔,它幾乎不能滿足生產要求。 我們可以找到更快的細分方式嗎?
4b。 將連續文本分割成單詞文本的語料庫 (4b. Segment continuous text into Corpus of word text)
There seems to be a faster method to "Segment continuous text into Corpus of word text."
似乎有一種更快的方法“將連續文本分割成單詞文本的語料庫”。
As discussed in the following blog:
如以下博客中所述:
SymSpell is 100x -1000x faster. Wow!
SymSpell是100倍-1000x更快。 哇!
Note: ed: 8/24/2020 Wolf Garbe deserves credit for pointing out
注意:ed:8/24/2020 Wolf Garbe值得一提
The benchmark results (100x -1000x faster) given in the SymSpell blog post are referring solely to spelling correction, not to word segmentation. In that post SymSpell was compared to other spelling correction algorithms, not to word segmentation algorithms. — Wolfe Garbe 8/23/2020
SymSpell博客文章中給出的基準測試結果(快100倍-1000倍)僅指拼寫校正,而不是指分詞。 在那篇文章中,將SymSpell與其他拼寫校正算法進行了比較,而不是與分詞算法進行了比較。 -Wolfe Garbe 2020年8月23日
and
和
Also, there is an easier way to call a C# library from Python: https://stackoverflow.com/questions/7367976/calling-a-c-sharp-library-from-python — Wolfe Garbe 8/23/2020
此外,還有一種從Python調用C#庫的簡便方法: https ://stackoverflow.com/questions/7367976/calling-ac-sharp-library-from-python — Wolfe Garbe 8/23/2020
Note: ed: 8/24/2020. I am going to try Garbe's C## implementation. If I do not get the same results (and probably if I do) I will try cython port and see if I can fit into spacy as a pipeline element. I will let you know my results.
注意:ed:8/24/2020。 我將嘗試Garbe的C ##實現。 如果沒有得到相同的結果(可能的話),我將嘗試cython port,看看是否可以將spacy作為管??道元素使用。 我會讓你知道我的結果。
However, it is implemented in C#. Since I am not going down the infinite ratholes of:
但是,它是用C#實現的。 由于我沒有遇到以下無限困難:
Convert all my NLP into C#. Not a viable option.
將我所有的NLP轉換為C# 。 不是可行的選擇。
Calling C# from Python. I talked to two engineer managers of Python groups. They have Python-C# capability, but it involves :
從Python調用C# 。 我與兩位Python組的工程師經理進行了交談。 它們具有Python-C#功能,但涉及到:
Note:
注意:
Translating to VB-vanilla;
翻譯成VB -vanilla;
- Manual intervention and translation must pass tests for reproducibility; 手動干預和翻譯必須通過再現性測試;
Translating from VB-vanilla to C;
從VB- vanilla轉換為C ;
- Manual intervention and translation must pass tests for reproducibility. 手動干預和翻譯必須通過再現性測試。
Instead, we work with a port to Python. Here is a version:
相反,我們使用Python的端口。 這是一個版本:
def segment_into_words(input_term):
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 0
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename(
"symspellpy", "frequency_bigramdictionary_en_243_342.txt")
# term_index is the column of the term and count_index is the
# column of the term frequency
if not sym_spell.load_dictionary(dictionary_path, term_index=0,
count_index=1):
print("Dictionary file not found")
return
if not sym_spell.load_bigram_dictionary(dictionary_path, term_index=0,
count_index=2):
print("Bigram dictionary file not found")
returnresult = sym_spell.word_segmentation(input_term)
return result.corrected_string%time long_s = segment_into_words(pdf_text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l*4]))
output =>
輸出=>
CPU times: user 20.4 s, sys: 59.9 ms, total: 20.4 s
Wall time: 20.4 s
size: 36585 ANAND,WANG,LOOG,VANGEMER T:BLACKMAGICINDEEPLEARNING1B lack MagicinDeepL earning :HowHu man S kill Imp acts Net work T raining Ka nav An and 1 an and kana v92@g mail . com ZiqiWang1z. wang -8@tu delft .nlM arc oLoog12M.Loog@tu delft .nlJ an van Gemert1j.c. vang emert@tu delft .nl1D elf tUniversityofTechn ology ,D elf t,TheN ether lands 2UniversityofC open hagen C open hagen ,Den mark Abs tract How does a user ’s prior experience with deep learning impact accuracy ?We present an initial study based on 31 participants with different levels of experience .T heir task is to perform hyper parameter optimization for a given deep learning architecture .T here -s ult s show a strong positive correlation between the participant ’s experience and the ?n al performance .T hey additionally indicate that an experienced participant ?nds better sol u-t ions using fewer resources on average .T he data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyper pa-ra meters .Our study investigates the subjective human factor in comparisons of state of the art results and sci enti?c reproducibility in deep learning .1Intro duct ion T he popularity of deep learning in various ? eld s such as image recognition [9,19], speech [11,30], bio informatics [21,24], question answering [3] etc . stems from the seemingly fav or able trade - off b
SymSpellpy is is about 5x faster implemented in Python.We are not seeing 100x -1000x faster.
SymSpellpy在Python中實現的速度要快大約5倍。我們看不到100倍-1000倍的速度。
I guess that SymSpell-C# is comparing to different segmentation algorithms implemented in Python.
我猜想SymSpell-C#正在與Python中實現的不同細分算法進行比較。
Perhaps we see speedup due to C#, a compiled statically typed language. Since C# and C are about the same computing speed, we should expect a speedup of C# 100x -1000x faster than a Python implementation.
也許由于C# (一種編譯的靜態類型語言)而使我們看到了加速。 由于C#和C的計算速度大致相同,因此我們應該期望C#的加速比Python實現快100倍-1000倍。
Note: There is a spacy pipeline implementation spacy_symspell, which directly calls SymSpellpy. I recommend you don’t use spacy_symspell. Spacy first generates tokens as the first step of the pipeline, which is immutable. spacy_symspell generates new text from Segmenting continuous text. It can not generate new tokens in the spacy as spacy already generated tokens. .A spacy pipeline works a token sequence, not a stream of text. One would have to spin off a changed version of spacy. Why bother? Instead, segment continuous text into Corpus of word text. Then correct the text of embedded whitespace in a word and hyphenated words in the text. Do any other raw cleaning you want to do. Then feed the raw text to spacy.
注意:有一個spacy管道實現spacy_symspell,它直接調用SymSpellpy。 我建議您不要使用spacy_symspell。 Spacy首先生成令牌,這是流水線的第一步,這是不可變的。 spacy_symspell從生成新文本 分割連續文本。 由于spacy已生成令牌,因此無法在spacy中生成新令牌。 .A spacy管道工程令牌序列,文本不流。 人們將不得不衍生出一種變化的spacy版本。 何必呢? 相反,段連續的文本到文本字語料庫。 然后更正單詞中嵌入的空格文本和文本中帶連字符的單詞。 進行您想做的其他任何原始清潔。 然后將原始文本輸入spacy 。
I show spacy_symspell. Again my advice is not to use it.
我展示spacy_symspell。 同樣,我的建議是不要使用它。
import spacy
from spacy_symspell import SpellingCorrector
def segment_into_words(input_term):
nlp = spacy.load(“en_core_web_lg”, disable=[“tagger”, “parser”])
corrector = SpellingCorrector()
nlp.add_pipe(corrector)
結論 (Conclusion)
In future blogs, I will detail many common and uncommon Fast Text Pre-Processing Methods. Also, I will show the expected speedup from moving SymSpellpy to cython.
在以后的博客中,我將詳細介紹許多常見和不常見的快速文本預處理方法。 另外,我將展示從SymSpellpy遷移到cython的預期加速。
There will be many more formats and APIs you need to support in the world of “Changing X format into a text corpus.”
在“將X格式更改為文本語料庫”的世界中,您將需要支持更多的格式和API。
I detailed two of the more common document formats, PDF, and Gutenberg Project formats. Also, I gave two NLP utility functions segment_into_words
and file_or_url.
我詳細介紹了兩種較常見的文檔格式PDF和Gutenberg Project格式。 另外,我提供了兩個NLP實用程序功能segment_into_words
和file_or_url.
I hope you learned something and can use some of the code in this blog.
希望您學到了一些知識,并可以使用此博客中的一些代碼。
If you have some format conversions or better yet a package of them, let me know.
如果您進行了某些格式轉換或更好的轉換,請告訴我 。
翻譯自: https://towardsdatascience.com/natural-language-processing-in-production-converting-pdf-and-gutenberg-document-formats-into-text-9e7cd3046b33
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392466.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392466.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392466.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!