數據可視化工具
Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histograms always work best. The most recognized libraries for visualizations are matplotlib and seaborn. Seaborn is built on top of matplotlib, so it is worth looking at matplotlib first, but in this article we’ll look at matplotlib only. Let’s get started. First, we will import all the libraries we will be working with.
可視化是顯示數據要講述的故事的好方法。 但是,并非所有可視化文件的構建都是相同的。 我的經驗法則是堅持簡單,易于理解且標簽清晰的圖形。 折線圖,條形圖和直方圖總是最有效。 最受認可的可視化庫是matplotlib和seaborn。 Seaborn是建立在matplotlib之上的,因此值得首先看一下matplotlib,但是在本文中,我們將只看一下matplotlib。 讓我們開始吧。 首先,我們將導入將要使用的所有庫。
import numpy as npimport matplotlib.pyplot as plt%matplotlib inline
We imported numpy, so we can generate random data. From matplotlib, we imported pyplot. If you are working on visualizations in jupyter notebook, you can call the %matplotlib inline command. This will allow jupyter notebook to display your visualizations directly under the code that was ran. If you’d like an interactive chart, you can call the %matplotlib command. This will allow you to manipulate your visualizations such as zoom in, zoom out, and move them around their axis.
我們導入了numpy,因此我們可以生成隨機數據。 從matplotlib中,我們導入了pyplot。 如果要在jupyter Notebook中進行可視化,則可以調用%matplotlib內聯命令。 這將使jupyter Notebook在運行的代碼下直接顯示可視化效果。 如果需要交互式圖表,可以調用%matplotlib命令。 這將允許您操縱可視化效果,例如放大,縮小并圍繞其軸移動它們。
直方圖 (Histograms)
Fist, lets take a look at histograms on matplotlib. We will look at a normal distribution. Let’s build it with numpy and visualize it with matplotlib.
拳頭,讓我們看看matplotlib上的直方圖。 我們將看一個正態分布。 讓我們使用numpy構建它,并使用matplotlib對其進行可視化。
normal_distribution = np.random.normal(0,1,10000)plt.hist(normal_distribution)plt.show()

Great! We had a histogram. But, what did we just do? First, we took 10,000 random sample from a distribution of mean 0 and standard deviation of 1. Then, we called the method hist() from matplotlib. Last, we called the show() method to display our figure. However, our histogram looks kind of…squared. Fear not! You can modify the width of each bin with the bins argument. Matplotlib defaults to 10 if an argument isn’t given. There are multiple ways to calculate bins, but I prefer to set it to ‘auto’.
大! 我們有一個直方圖。 但是,我們只是做什么? 首先,我們從均值為0且標準差為1的分布中抽取了10,000個隨機樣本。然后,從matplotlib中調用了hist()方法。 最后,我們調用了show()方法來顯示我們的圖形。 但是,我們的直方圖看起來有點...平方。 不要怕! 您可以使用bins參數修改每個垃圾箱的寬度。 如果未提供參數,則Matplotlib默認為10。 有多種計算垃圾箱的方法,但我更喜歡將其設置為“自動”。
plt.hist(normal_distribution,bins='auto')plt.show()
!
!

Much better. You know what would be twice as fun? If we visualize another distribution, but if we visualize it on the same histogram, then that would be three times as fun. That’s exactly what we are going to do.
好多了。 您知道會帶來兩倍的樂趣嗎? 如果我們可視化另一個分布,但是如果我們在相同的直方圖中可視化它,那將是三倍的樂趣。 這正是我們要做的。
modified_distribution = np.random.normal(3,1.5,10000)plt.hist(normal_distribution,bins='auto',color='purple',label='Purple')plt.hist(modified_distribution,bins='auto',color='green',alpha=.75,label='Green')plt.legend(['Purple','Green'])plt.show()

Wow! Now that’s a fun histogram. What just happen? We went from one blue histogram to one purple and one green. Let’s go over what we added. First, we created another distribution named modified_distribution. Then, we changed the color of each distribution with the color argument, passed the labelargument to name each distribution, and we passed the alpha argument to make the green distribution see through. Last, we passed the name of each distribution to the legend() method. When you have more than one set of data on a single chart, it is required to label the data to be able to tell the data apart. In this example, the data can be told apart easy, but in the real world each data can represent things that cannot be identified by color. For example, green can represent height of male college students, and purple the height of female college students. Of course, if that was the case the X axis would be on a different scale.
哇! 現在,這是一個有趣的直方圖。 剛剛發生什么事 我們從一個藍色直方圖變為一個紫色和一個綠色。 讓我們來看看添加的內容。 首先,我們創建了另一個名為modified_distribution的發行版。 然后,我們使用color參數更改每個分布的顏色 ,傳遞label參數以命名每個分布,然后傳遞alpha參數使綠色分布透明。 最后,我們將每個發行版的名稱傳遞給legend()方法。 如果單個圖表上有多個數據集,則需要標記數據以便能夠區分數據。 在此示例中,可以輕松區分數據,但在現實世界中,每個數據都可以代表無法用顏色標識的事物。 例如,綠色可以代表男性大學生的身高,而紫色可以代表女性大學生的身高。 當然,如果是這種情況,則X軸將處于不同的比例。
條形圖 (Bar Charts)
Let’s continue the fun. Now we are going to look at bar charts. This kind of charts are really useful when trying to visualize quantities, so let’s look at an example. In this case we will visualize at how people voted when asked about what type of pet they have or would like to have. Let’s randomly generate data.
讓我們繼續樂趣。 現在我們來看看條形圖。 當試圖可視化數量時,這種圖表非常有用,因此讓我們看一個示例。 在這種情況下,我們將可視化當人們問起他們擁有或想要擁有哪種類型的寵物時人們如何投票。 讓我們隨機生成數據。
options = ['Cats','Dogs','Parrots','Hamsters']
votes = [np.random.randint(10,100) for i in range(len(options)]
votes.sort(reverse=True)
Perfect, we have a list of pets and a list of randomly generated numbers. Notice, we sorted the list in descending order. I like to order list this way because it is easier to see which category is the largest and smallest. Of course, in this example we just ordered the votes without ordering the options that match up to it. In reality, we would have to order both. I found that the easiest way to go about this is to make a dictionary and order the dictionary by values. Click herefor a helpful guide on stackoverflow on how to order dictionaries by values. Now, let’s visualize our data.
完美,我們有一個寵物清單和一個隨機生成的數字清單。 注意,我們以降序對列表進行排序。 我喜歡以此方式訂購商品,因為這樣可以更輕松地查看最大和最小的類別。 當然,在此示例中,我們只是對投票進行了排序,而沒有對與之匹配的選項進行排序。 實際上,我們將必須同時訂購兩者。 我發現最簡單的方法是制作字典并按值對字典進行排序。 單擊此處以獲取有關如何按值對字典進行排序的stackoverflow的有用指南。 現在,讓我們可視化我們的數據。
plt.bar(options,votes)plt.title('Favorite Pet Survey')plt.xlabel('Options')plt.ylabel('Votes')plt.show()

Great! We have an amazing looking graph. Notice, we can easily tell cats got the most votes, and hamsters got the least votes. Let’s look at the code. After we defined our X and height, we called the bar() method to build a bar chart. We passed options as X and votes as height. Then, we labeled the title, X axis, and y axis with methods title(), xlabel(), and ylabel() respectively. Easy enough! However, this bar chart looks a bit boring. Let’s make it look fun.
大! 我們有一個驚人的外觀圖。 注意,我們可以很容易地看出貓的得票最多,而倉鼠的得票最少。 讓我們看一下代碼。 定義X和高度后,我們調用bar()方法來構建條形圖。 我們將選項作為X傳遞,將投票作為高度。 然后,我們分別使用方法title(),xlabel()和ylabel()標記標題,X軸和y軸。 很簡單! 但是,此條形圖看起來有些無聊。 讓它看起來有趣。
with plt.style.context('ggplot'):
plt.bar(options,votes)
plt.title('Favorite Pet Survey')
plt.xlabel('Options')
plt.ylabel('Votes')
plt.show()

This graph is so much fun. How did we do this? Notice, all our code looks mostly the same, but there is important code we added, and we changed the format. We added the with keyword and the context() method from plt.style to change our chart style. Really cool thing is that it only changes it for everything that’s directly under it and indented. It is important to indent the code after the first line. We used the ggplot style to make our graph more fun. Click here to view all the styles available in matplotlib. If we want to compare two datasets with the same options, it is a little harder than in histograms, but it is equally as fun. Let’s say we want to visualize male vs female vote on each category.
該圖非常有趣。 我們是如何做到的? 注意,我們所有的代碼看起來幾乎相同,但是添加了重要的代碼,并且更改了格式。 我們從plt.style添加了with關鍵字和context()方法來更改圖表樣式。 真正很酷的事情是,它僅針對直接在其下方并縮進的所有內容進行更改。 在第一行之后縮進代碼很重要。 我們使用了ggplot樣式使我們的圖表更加有趣。 單擊此處查看matplotlib中可用的所有樣式。 如果我們想比較兩個具有相同選項的數據集,這比直方圖要難一些,但同樣有趣。 假設我們要形象化每個類別的男性和女性投票。
votes_male = votes
votes_female = [np.random.randint(10,100) for i in range(len(options))]import pandas as pdwith plt.style.context('ggplot'):
pd.DataFrame({'male':votes_male,'female':votes_female,index=options).plot(kind='bar')
plt.title('Favorite Pet Survey (Male vs Female)')
plt.xticks(rotation=0)
plt.ylabel('votes')
plt.show()


Lots going on here, but you have seen most of it already. Let’s start from the top. First, we renamed the votes data to votes_male, and we generated new data for votes_female. Then, we imported pandas which is a library to work with data frames. We created a data frame for our data with male and female as our columns and pet options as our index. After, we called the plot() method from the data frame and passed bar for the kind arguement, so we can plot a bar chart. With the data frame plot method, it adds X labels for you, but they are at a 90-degree angle. To fix this, you can call the xticks() method from pyplot and pass the argument rotation 0. This will make the text like the graph above.
這里有很多事情,但是您已經看到了大部分。 讓我們從頭開始。 首先,我們將選票數據重命名為voices_male,并為votes_female生成了新數據。 然后,我們導入了pandas,這是一個用于處理數據框的庫。 我們為數據創建了一個數據框,其中男性和女性為列,寵物選項為索引。 之后,我們從數據框中調用plot()方法,并通過柱形圖進行類型爭論,因此可以繪制條形圖。 使用數據框繪圖方法時,它會為您添加X標簽,但它們成90度角。 要解決此問題,您可以從pyplot調用xticks()方法并傳遞參數rotation0。這將使文本像上面的圖形一樣。
線形圖 (Line Graph)
Now, let’s look at line graphs. These graphs are great to visualize how Y changes as X changes. Most commonly, they are used to visualize time series data. In this example, we will visualize how much water a new town uses as their population grows.
現在,讓我們看一下折線圖。 這些圖表非常適合可視化Y隨著X的變化。 最常見的是,它們用于可視化時間序列數據。 在此示例中,我們將可視化一個新城鎮隨著人口增長而消耗的水量。
town_population = np.linspace(0,10,10)
town_water_usage = [i*5 for i in town_population]with plt.style.context('seaborn'):
plt.plot(town_population,town_water_usage)
plt.title('Water Usage of Cool Town by Population')
plt.xlabel('Population (in thousands)')
plt.ylabel('Water usage (in thousand gallons)')
plt.show()

What a nice graph! As you can see, we used everything we learned so far to create this graph. The only difference is the method we called is not as intuitive as the other ones. In this case we called the plot() method. We passed our X and Y, labeled our chart, and we visualized it with our show() method. Let’s add more data. This time, we are going to add the water usage of a nearby town.
多么漂亮的圖! 如您所見,我們使用到目前為止所學的所有知識來創建該圖。 唯一的區別是我們調用的方法不像其他方法那樣直觀。 在這種情況下,我們稱為plot()方法。 我們傳遞了X和Y,標記了圖表,然后使用show()方法將其可視化。 讓我們添加更多數據。 這次,我們將增加附近城鎮的用水量。
nearby_town_water_usage = [i*.85 for i in town_water_usage]with plt.style.context('seaborn'):
plt.plot(town_population,town_water_usage,label='Cool Town')
plt.plot(town_population,nearby_town_water_usage,label='Lame Town')
plt.title('Water Usage of Cool Town and Lame Town')
plt.xlabel('Population (in thousands)')
plt.ylabel('Water usage (in thousand gallons)')
plt.legend(['Cool Town','Lame Town'])
plt.show()

As you can see we just added another plot(), labeled, each line, updated the title, and we showed a legend of the graph. For the most part is the same process as other graphs. From the graph we can see that Lame Town is actually using less water than Cool town. I guess Lame Town isn’t so lame after all.
如您所見,我們只是添加了另一個plot(),標記為每行,更新了標題,并顯示了圖例。 在大多數情況下,該過程與其他圖形相同。 從圖中可以看出,me腳鎮實際上比涼爽鎮使用的水少。 我猜La子鎮畢竟不是那么la子。
結論 (Conclusion)
We covered some of the basics of visualizing data. We even went into how to generate random data! As you can see these are very versatile and efficient ways of showing data. Nothing too crazy, just old school ways of showing the story that the data tells.
我們介紹了可視化數據的一些基礎知識。 我們甚至研究了如何生成隨機數據! 如您所見,這些是顯示數據的非常通用和有效的方法。 沒什么太瘋狂的了,只是老式的方式來顯示數據所講述的故事。
翻譯自: https://medium.com/@a.colocho/data-visualization-9e151698a921
數據可視化工具
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389274.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389274.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389274.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!