文本數據可視化_如何使用TextHero快速預處理和可視化文本數據

文本數據可視化

自然語言處理 (Natural Language Processing)

When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc and sometimes visualization too. After experimenting TextHero on a couple of NLP datasets I found this library to be extremely useful for preprocessing and visualization. This will save us some time writing custom functions. Aren’t you excited!!? So let’s dive in.

在進行任何NLP項目或競賽時,我們將大部分時間用于預處理文本,例如刪除數字,標點符號,停用詞,空白等,有時還會進行可視化處理。 在幾個NLP數據集上試驗TextHero之后,我發現此庫對于預處理和可視化非常有用。 這將節省我們一些編寫自定義函數的時間。 你不興奮!!? 因此,讓我們開始吧。

We will apply techniques that we are going to learn in this article to Kaggle’s Spooky Author Identification dataset. You can find the dataset here. The complete code is given at the end of the article.

我們將把本文中要學習的技術應用于Kaggle的Spooky Author Identification數據集。 您可以在此處找到數據集。 完整的代碼在文章末尾給出。

Note: TextHero is still in beta. The library may undergo major changes. So some of the code snippets or functionalities below might get changed.

注意:TextHero仍處于測試版。 圖書館可能會發生重大變化。 因此,下面的某些代碼段或功能可能會更改。

安裝 (Installation)

pip install texthero

前處理 (Preprocessing)

As the name itself says clean method is used to clean the text. By default, the clean method applies 7 default pipelines to the text.

顧名思義, clean方法用于清理文本。 默認情況下, clean方法將7個default pipelines應用于文本。

from texthero import preprocessing
df[‘clean_text’] = preprocessing.clean(df[‘text’])
  1. fillna(s)

    fillna(s)

  2. lowercase(s)

    lowercase(s)

  3. remove_digits()

    remove_digits()

  4. remove_punctuation()

    remove_punctuation()

  5. remove_diacritics()

    remove_diacritics()

  6. remove_stopwords()

    remove_stopwords()

  7. remove_whitespace()

    remove_whitespace()

We can confirm the default pipelines used with below code:

我們可以確認以下代碼使用的默認管道:

Image for post

Apart from the above 7 default pipelines, TextHero provides many more pipelines that we can use. See the complete list here with descriptions. These are very useful as we deal with all these during text preprocessing.

除了上述7個默認管道之外, TextHero還提供了更多可以使用的管道。 請參閱此處的完整列表及其說明。 這些非常有用,因為我們在文本預處理期間會處理所有這些問題。

Based on our requirements, we can also have our custom pipelines as shown below. Here in this example, we are using two pipelines. However, we can use as many pipelines as we want.

根據我們的要求,我們還可以具有如下所示的自定義管道。 在此示例中,我們使用兩個管道。 但是,我們可以使用任意數量的管道。

from texthero import preprocessing custom_pipeline = [preprocessing.fillna, preprocessing.lowercase] df[‘clean_text’] = preprocessing.clean(df[‘text’], custom_pipeline)

自然語言處理 (NLP)

As of now, this NLP functionality provides only named_entity and noun_phrases methods. See the sample code below. Since TextHero is still in beta, I believe, more functionalities will be added later.

到目前為止,此NLP功能僅提供named_entitynoun_phrases方法。 請參見下面的示例代碼。 由于TextHero仍處于測試階段,我相信以后會添加更多功能。

named entity

命名實體

s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)print(nlp.named_entities(s)[0])Output:
[('Narendra Damodardas Modi', 'PERSON', 0, 24),
('Indian', 'NORP', 31, 37),
('14th', 'ORDINAL', 64, 68),
('India', 'GPE', 99, 104),
('2014', 'DATE', 111, 115)]

noun phrases

名詞短語

s = pd.Series(“Narendra Damodardas Modi is an Indian politician serving as the 14th and current Prime Minister of India since 2014”)print(nlp.noun_chunks(s)[0])Output:
[(‘Narendra Damodardas Modi’, ‘NP’, 0, 24),
(‘an Indian politician’, ‘NP’, 28, 48),
(‘the 14th and current Prime Minister’, ‘NP’, 60, 95),
(‘India’, ‘NP’, 99, 104)]

表示 (Representation)

This functionality is used to map text data into vectors (Term Frequency, TF-IDF), for clustering (kmeans, dbscan, meanshift) and also for dimensionality reduction (PCA, t-SNE, NMF).

此功能用于將文本數據映射到vectors (術語頻率,TF-IDF), clustering (kmeans,dbscan,meanshift)以及降dimensionality reduction (PCA,t-SNE,NMF)。

Let’s look at an example with TF-TDF and PCA on the Spooky author identification train dataset.

讓我們看一下Spooky作者標識訓練數據集中的TF-TDFPCA的示例。

train['pca'] = (
train['text']
.pipe(preprocessing.clean)
.pipe(representation.tfidf, max_features=1000)
.pipe(representation.pca)
)visualization.scatterplot(train, 'pca', color='author', title="Spooky Author identification")
Image for post

可視化 (Visualization)

This functionality is used to plotting Scatter-plot, word cloud, and also used to get top n words from the text. Refer to the examples below.

此功能用于繪制Scatter-plot ,詞云,還用于從文本中獲取top n words 。 請參考以下示例。

Scatter-plot example

散點圖示例

train['tfidf'] = (
train['text']
.pipe(preprocessing.clean)
.pipe(representation.tfidf, max_features=1000)
)train['kmeans_labels'] = (
train['tfidf']
.pipe(representation.kmeans, n_clusters=3)
.astype(str)
)train['pca'] = train['tfidf'].pipe(representation.pca)visualization.scatterplot(train, 'pca', color='kmeans_labels', title="K-means Spooky author")
Image for post

Wordcloud示例 (Wordcloud example)

from texthero import visualization
visualization.wordcloud(train[‘clean_text’])
Image for post

熱門單詞示例 (Top words example)

Image for post

完整的代碼 (Complete Code)

結論 (Conclusion)

We have gone thru most of the functionalities provided by TextHero. Except for the NLP functionality, I found that rest of the features are really useful which we can try to use it for the next NLP project.

我們已經通過了TextHero提供的大多數功能。 除了NLP功能以外,我發現其余功能確實有用,我們可以嘗試將其用于下一個NLP項目。

Thank you so much for taking out time to read this article. You can reach me at https://www.linkedin.com/in/chetanambi/

非常感謝您抽出寶貴的時間閱讀本文。 您可以通過https://www.linkedin.com/in/chetanambi/與我聯系

翻譯自: https://medium.com/towards-artificial-intelligence/how-to-quickly-preprocess-and-visualize-text-data-with-texthero-c86957452824

文本數據可視化

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390706.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390706.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390706.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Less變量

Less變量 定義變量 Less 中的變量和其他編程語言一樣,可以實現值的復用,同樣它也有作用域(scope)。簡單的講,變量作用域就是局部變量和全局變量的概念。 Less 中,變量作用域采用的是就近原則,換…

漸進式web應用程序_如何在漸進式Web應用程序中添加到主屏幕

漸進式web應用程序添加到主屏幕 (Add To Homescreen) Here the web app install banner is focused on web app, with the feature of add to homescreen.在此,Web應用程序安裝標語專注于Web應用程序,具有添加到主屏幕的功能。 瀏覽器對“添加到主屏幕”…

linux shell 編程

shell的作用 shell是用戶和系統內核之間的接口程序shell是命令解釋器 shell程序 Shell程序的特點及用途: shell程序可以認為是將shell命令按照控制結構組織到一個文本文件中,批量的交給shell去執行 不同的shell解釋器使用不同的shell命令語法 shell…

Leetcode之javascript解題(No33-34)

附上我的github倉庫,會不斷更新leetcode解題答案,提供一個思路,大家共勉 在我的主頁和github上可以看到更多的關于leetcode的解題報告!(因為不知道為什么掘金沒有將其發布出來,目前已經聯系掘金客服&#x…

真實感人故事_您的數據可以告訴您真實故事嗎?

真實感人故事Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute…

轉:防止跨站攻擊,安全過濾

轉:http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻擊 本文說一下SpringMVC如何防御CSRF(Cross-site request forgery跨站請求偽造)和XSS(Cross site script跨站腳本攻擊)。 說說CSRF 對CSRF來說,其實Spring…

Linux c編程

c語言標準 ANSI CPOSIX(提高UNIX程序可移植性)SVID(POSIX的擴展超集)XPG(X/Open可移植性指南)GNU C(唯一能編譯Linux內核的編譯器) gcc 簡介 名稱: GNU project C an…

html怎么注釋掉代碼_HTML注釋:如何注釋掉您HTML代碼

html怎么注釋掉代碼HTML中的注釋 (Comments in HTML) The comment tag is an element used to leave notes, mostly related to the project or the website. This tag is frequently used to explain something in the code or leave some recommendations about the project.…

k均值算法 二分k均值算法_使用K均值對加勒比珊瑚礁進行分類

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您見過加勒比礁嗎? 好吧,如果沒有,請做好準備。 Today, we will be answering a question that, at face value, appears quite sim…

您好,這是我的第一篇文章

您好我是CYL 這是一個辣雞博客 歡迎指教 轉載于:https://www.cnblogs.com/pigba/p/8823472.html

08_MySQL DQL_SQL99標準中的多表查詢(內連接)

# sql99語法/*語法: select 查詢列表 from 表1 別名 【連接類型】 join 表2 別名 on 連接條件 【where 篩選條件】 【group by 分組】 【having 分組后篩選】 【order by 排序列表】分類內連接(重點): inner外連接 左外&#xff0…

java中抽象類繼承抽象類_Java中的抽象類用示例解釋

java中抽象類繼承抽象類Abstract classes are classes declared with abstract. They can be subclassed or extended, but cannot be instantiated. You can think of them as a class version of interfaces, or as an interface with actual code attached to the methods.抽…

新建VUX項目

使用Vue-cli安裝Vux2 特別注意配置vux-loader。來自為知筆記(Wiz)

衡量試卷難度信度_我們可以通過數字來衡量語言難度嗎?

衡量試卷難度信度Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interactio…

Linux 題目總結

守護進程的工作就是打開一個端口,并且等待(Listen)進入連接。 如果客戶端發起一個連接請求,守護進程就創建(Fork)一個子進程響應這個連接,而主進程繼續監聽其他的服務請求。 xinetd能夠同時監聽…

《精通Spring4.X企業應用開發實戰》讀后感第二章

一、配置Maven\tomcat https://www.cnblogs.com/Miracle-Maker/articles/6476687.html https://www.cnblogs.com/Knowledge-has-no-limit/p/7240585.html 二、創建數據庫表 DROP DATABASE IF EXISTS sampledb; CREATE DATABASE sampledb DEFAULT CHARACTER SET utf8; USE sampl…

換了電腦如何使用hexo繼續寫博客

前言 我們知道,使用 Githubhexo 搭建一個個人博客確實需要花不少時間的,我們搭好博客后使用的挺好,但是如果我們有一天電腦突然壞了,或者換了系統,那么我們怎么使用 hexo 再發布文章到個人博客呢? 如果我們…

leetcode 525. 連續數組

給定一個二進制數組 nums , 找到含有相同數量的 0 和 1 的最長連續子數組,并返回該子數組的長度。 示例 1: 輸入: nums [0,1] 輸出: 2 說明: [0, 1] 是具有相同數量 0 和 1 的最長連續子數組。 示例 2: 輸入: nums [0,1,0] 輸出: 2 說明: [0, 1] (或 [1, 0]) 是…

實踐作業2:黑盒測試實踐(小組作業)每日任務記錄1

會議時間:2017年11月24日20:00 – 20:30 會議地點:在線討論 主 持 人:王晨懿 參會人員:王晨懿、余晨晨、鄭錦波、楊瀟、侯歡、汪元 記 錄 人:楊瀟 會議議題:軟件測試課程作業-黑盒測試實踐的啟動計劃 會議內…

視圖可視化 后臺_如何在單視圖中可視化復雜的多層主題

視圖可視化 后臺Sometimes a dataset can tell many stories. Trying to show them all in a single visualization is great, but can be too much of a good thing. How do you avoid information overload without oversimplification?有時數據集可以講述許多故事。 試圖在…