vray陰天室內

When working with text data and NLP projects, word-frequency is often a useful feature to identify and look into. However, creating good visuals is often difficult because you don’t have a lot of options outside of bar charts. Lets face it; bar charts get old and boring quick! This is where word clouds come into play. In this blog learn how to spice up your visualizations using word clouds on your next project.

在處理文本數據和NLP項目時，單詞頻率通常是識別和調查的有用功能。但是，創建良好的視覺效果通常很困難，因為在條形圖之外您沒有太多選擇。面對現實吧; 條形圖變老又無聊！這就是詞云發揮作用的地方。在此博客中，學習如何在下一個項目中使用詞云為您的可視化增添趣味。

Up until my most recent project I actually didn’t know a word cloud library existed in python, but I assure you it does, and it has some amazing features!

在我最近的項目之前，我實際上還不知道python中存在詞云庫，但是我向您保證，它確實存在，并且它具有一些驚人的功能！

The full WordCloud library and documentation can be found here for those interested.

完整的WordCloud庫和文檔可以在 此處找到 感興趣的人。

TLDR (TLDR)

Part 1 of this blog will walk you through obtaining the appropriate libraries and the basic parameters and functions of the wordcloud library as well as how to create a generic word cloud. Part 2 will build upon this and walk you through creating custom masks for word clouds and other unique visual options.

本博客的第1部分將引導您獲得合適的庫以及wordcloud庫的基本參數和功能，以及如何創建通用詞云。第2部分將以此為基礎，并引導您為詞云和其他獨特的視覺選項創建自定義蒙版。

WordCloud入門 (Getting Started With WordCloud)

Before we can start making visuals, we’ll need to make sure we have the libraries we need to create our word clouds. You’ll need the following libraries:

在開始制作視覺效果之前，我們需要確保擁有創建詞云所需的庫。您將需要以下庫：

numpy
麻木
matplotlib
matplotlib
PIL
皮爾
wordcloud
詞云
nltk (This is only necessary for the purpose of this blog and as a source of sample text to create word clouds from)
nltk (這僅對于本博客而言是必需的，并且作為從其創建詞云的示例文本的來源)

All of these libraries can be pip installed if you’re unable to import them. For my specific project, I used Google Colab which required a slightly more unique solution to import wordcloud. For Google Colab users, you can use the following command to install wordcloud:

如果您無法導入所有這些庫，則可以通過pip安裝。對于我的特定項目，我使用了Google Colab，它需要一個稍微獨特的解決方案來導入wordcloud。對于Google Colab用戶，您可以使用以下命令來安裝wordcloud：

!pip install git+https://github.com/amueller/word_cloud.git #egg=wordcloud

！pip安裝git + https：//github.com/amueller/word_cloud.git＃egg = wordcloud

That last part is important for Colab because it identifies and effectively names the library so that it can be properly imported.

最后一部分對Colab很重要，因為它可以識別并有效地命名庫，以便可以正確導入它。

Once we have all of our needed libraries installed, we can use the following set of import statements:

一旦我們安裝了所有需要的庫，就可以使用以下一組導入語句：

We’re now ready to create some word clouds!

現在我們準備創建一些詞云！

通用詞云 (Generic Word Clouds)

To start with, lets explore generic word clouds. For those that want to follow along, we’ll use some corpora from the nltk library.

首先，讓我們探索通用詞云。對于那些想要繼續學習的人，我們將使用nltk庫中的一些語料庫。

First off, we’ll need to acquire our text. I’ll note here that there are two forms of text that WordCloud can use to generate a visual. The first, and the main one we’ll use, is in the form of a string. The second, is from a dictionary of words and their frequency as key-value pairs.

首先，我們需要獲取文本。我將在此處指出，WordCloud可使用兩種形式的文本來生成視覺效果。我們將使用的第一個也是主要的字符串形式。第二個是來自單詞字典及其作為鍵值對的頻率。

If you’re following along, or want to attempt this using other sample text from nltk, you can use the following code to acquire our text samples:

如果您正在遵循，或者想使用來自nltk的其他示例文本來嘗試此操作，則可以使用以下代碼獲取我們的文本示例：

This shows a list of the different authors and texts we have to choose from within nltk’s gutenberg files

Feel free to attempt creating word clouds from any of the above options. The one that we’ll continue with in these examples, however, will be Moby Dick.

隨意嘗試從以上任何選項創建詞云。但是，在這些示例中我們將繼續討論的是Moby Dick。

To gather our sample text as a single string you can use the following command:

要將示例文本作為單個字符串收集，可以使用以下命令：

Now that we have our text, let’s take a look at how to turn this into a word cloud. What we’re doing in the code block below is instantiating a WordCloud object, we then use that object to generate a cloud based upon the text that we pass in. Once we have the cloud generated, we then want to be able to show it without the unnecessary x and y axis.

現在我們有了文本，讓我們看一下如何將其變成詞云。在下面的代碼塊中，我們正在實例化一個WordCloud對象，然后使用該對象根據傳入的文本生成一個云。一旦生成了云，我們便希望能夠顯示它沒有不必要的x和y軸。

Look at that! We made a word cloud!

看那個！我們做了一個詞云！

Now personally, I’m not a fan of the black background and it seems a little small, so let’s change that with some simple parameters.

現在我個人不喜歡黑色背景，而且看起來有點小，所以讓我們用一些簡單的參數來更改它。

Now we’re talking! Although, there seems to be some strange things showing up in our generic word cloud doesn’t there?

現在我們在說話！雖然，在通用詞云中似乎有一些奇怪的事情出現了嗎？

參數和語言處理 (Parameters and Language Processing)

Looking at the cloud above we notice some things. Some words seem to be paired.

看著上面的云，我們注意到一些事情。有些話似乎成對出現。

the whale
鯨魚
the ship
船
the sea
海
the captain
隊長
White Whale
白鯨

So on and so forth. Our word cloud is still showing word frequencies however one of the parameters WordCloud has is ‘collocations’ which it defaults to True. What this does is also looks at pairs of words and their frequencies. In some instances this can definitely be useful, but in this one I think we’ll get better results not using it.

等等等等。我們的詞云仍在顯示詞頻，但是WordCloud的參數之一是“配置”，默認為True。這還著眼于單詞對及其頻率。在某些情況下，這絕對是有用的，但在我看來，不使用它會得到更好的結果。

Notice the difference?

注意區別嗎？

A keen eye may recognize that the word ‘the’ no longer appears in our word cloud. This is because ‘the’ is recognized as a stop-word and excluded from the cloud even though it appears quite frequently in the text.

敏銳的眼睛可能會意識到“ the”一詞不再出現在我們的詞云中。這是因為“ the”被識別為停用詞，即使在文本中出現頻率很高，也被排除在云端之外。

You may be wondering where stop-words came into play, and that is one of the really cool features of the wordcloud library. The library comes with it’s own list of stop-words that it uses by default. The library actually uses quite a few NLP practices by default that makes creating the clouds that much easier and also adjustable for the more experienced NLP practitioner. Some of these additional NLP parameters that are used are:

您可能想知道停用詞在哪里起作用，而這是wordcloud庫的真正酷功能之一。該庫附帶了它自己的默認停用詞列表。默認情況下，該庫實際上使用了許多NLP實踐，這使得創建云變得更加容易，并且對于經驗豐富的NLP從業者而言也是可調整的。使用的一些其他NLP參數是：

regexp — an optional parameter that if left blank will use r”\w[\w’]+” by default. Custom regex string can be passed in here.
regexp —一個可選參數，如果保留為空白，默認情況下將使用r” \ w [\ w'] +” 。自定義正則表達式字符串可以在此處傳遞。
normalize_plurals — default = True; For words that appear both with and without a trailing ‘s’, that ‘s’ is removed from the plural and it’s counted as another of it’s singular version
normalize_plurals —默認= True；對于同時帶有和不帶有尾部“ s”的單詞，該“ s”將從復數形式中刪除，并被視為另一個單數形式

In our original import statement we imported STOPWORDS from the wordcloud library. You can print this to see the entire list of words that are being excluded by default, but it currently uses 192 of the most common stop-words. You can also add to this list if you have additional words you want excluded. You can also supply your own stop-words if prefer. Note that the stopwords must be passed in as a set and not a list.

在原始的導入語句中，我們從wordcloud庫中導入了STOPWORDS。您可以打印此內容以查看默認情況下排除的單詞的整個列表，但當前它使用192個最常用的停用詞。如果您想排除其他單詞，也可以添加到此列表中。如果愿意，您也可以提供自己的停用詞。 請注意，停用詞必須作為集合而不是列表傳遞。

What a difference!

有什么不同！

One last thing we’ll talk about before moving on to making fun and unique word clouds is “relative scaling”.

在繼續取笑和獨特的詞云之前，我們要談論的最后一件事是“相對縮放”。

Relative scaling is what’s used to determine the size of the word based upon its frequency. By default, relative scaling is set to 0.5, which is essentially the equivalent of saying that a word that occurs twice as often as another word will be 50% larger.

相對縮放是根據單詞的頻率來確定單詞大小的方法。默認情況下，相對縮放比例設置為0.5，這基本上等于說一個單詞出現的頻率是另一個單詞的兩倍將增加50％。

Relative scaling can be set to any number between 0 and 1. With 0 being essentially kind of pointless as all words will be the same size, and 1 being that words that occur twice as often will be twice as large. In some cases this can be useful to better identify the differences in frequency. However, this doesn’t always look very good and can affect the fit of a word cloud to a mask which we will talk about later.

相對縮放比例可以設置為0到1之間的任何數字。0本質上是毫無意義的，因為所有單詞的大小都相同，而1表示出現頻率兩倍的單詞將是兩倍大。在某些情況下，這有助于更好地識別頻率差異。但是，這并不總是看起來很好，并且可能會影響詞云與蒙版的匹配度，我們將在后面討論。

In this case, using a relative scaling of 1 actually doesn’t look too bad! We’ll soon see how this translates to using it with an image mask.

在這種情況下，使用1的相對比例實際上看起來還不錯！我們將很快看到如何將其轉換為與圖像蒙版一起使用。

保存您的詞云 (Saving Your Word Cloud)

Once you have your word cloud the way you want it, you’ll probably want to save it. To do so, you can run the following code which will save the current state of your WordCloud object.

一旦有了您想要的詞云，就可能要保存它。為此，您可以運行以下代碼來保存WordCloud對象的當前狀態。

Keep in mind this will save the image to your local folder and if you have a specific location in mind, you will need to add in the appropriate path.

請記住，這會將圖像保存到本地文件夾，如果您有特定的位置，則需要添加適當的路徑。

值得一玩的其他參數 (Other Parameters Worth Playing With)

We looked at the key parameters for making word clouds, but there are many more that are worth looking into and toying with. These parameters are fairly self-explanatory and can be used to further tweak your clouds:

我們研究了制作詞云的關鍵參數，但是還有很多值得研究和研究的參數。這些參數是不言自明的，可用于進一步調整云：

prefer_horizontal — (float)If set to 1, all words will appear horizontal while lower values will increase the frequency of vertical words. default = 0.9
preferred_horizo??ntal —(浮動)如果設置為1，則所有單詞將顯示為水平，而較低的值將增加垂直單詞的頻率。默認值= 0.9
min_font_size — (int) Smallest font size to be used. default = 4
min_font_size —(int)要使用的最小字體大小。默認= 4
max_words — (int) default = 200
max_words —(整數)默認= 200
min_word_length — (int) Minimum number of letters required in a word to be in the cloud. default = 0
min_word_length —(int)單詞在云中所需的最小字母數。默認值= 0
include_numbers — (bool) default = False
include_numbers —(布爾值)默認= False
repeat — (bool) Determines if words/phrases will be repeated until max_words or min_font_size is reached. (Can be used to create word clouds from a single word) default = False
repeat —(布爾)確定是否重復單詞/短語，直到達到max_words或min_font_size。 (可用于從單個單詞創建單詞云)default = False

獨特和自定義詞云 (Unique and Custom Word Clouds)

Due to this blog turning out much longer than I had initially planned, I’ll discuss using image masks to create custom word clouds, how to create your own image masks from any image, and how to apply an image’s color to your cloud in a soon to follow, Part 2 of this blog.

由于此博客的發布時間比我最初計劃的要長得多，因此我將討論使用圖像蒙版創建自定義文字云，如何從任何圖像創建自己的圖像蒙版以及如何將圖像的顏色應用于云中。不久之后，該博客的第2部分。

翻譯自: https://medium.com/swlh/cloudy-with-a-chance-of-words-part-1-d34a29739dba

vray陰天室內

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/391018.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/391018.shtml
英文地址，請注明出處：http://en.pswp.cn/news/391018.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！