iris數據集 測試集_IRIS數據集的探索性數據分析

iris數據集 測試集

Let’s explore one of the simplest datasets, The IRIS Dataset which basically is a data about three species of a Flower type in form of its sepal length, sepal width, petal length, and petal width. The data set consists of 50 samples from each of the three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Our objective is to classify a new flower as belonging to one of the 3 classes given the 4 features.

讓我們探索最簡單的數據集之一,IRIS數據集,該數據集基本上是有關花類型的三種物種的數據,其形式為萼片長度,萼片寬度,花瓣長度和花瓣寬度。 所述數據集包括從每三個物種鳶尾的50個樣品( 山鳶尾虹膜錦葵 變色鳶尾 )。 從每個樣品中測量出四個特征: 萼片和花瓣的長度和寬度,以厘米為單位。 我們的目標是根據4個特征將新花歸為3類之一。

Download IRIS data from here.

從此處下載IRIS數據。

Here I'm importing the libraries in ipython notebook using Anaconda Navigator(download: https://www.anaconda.com/products/individual). which can be useful in our exploratory data analysis like pandas, matplotlib, numpy and seaborn.

在這里,我使用Anaconda Navigator(下載: https ://www.anaconda.com/products/individual)在ipython Notebook中導入庫。 這對我們的探索性數據分析(如熊貓matplotlibnumpyseaborn)很有用

Image for post
Exploring the data
探索數據
Image for post
Exploring the data
探索數據

Here, IRIS is a balanced dataset because the number of data points for every class Setosa, Virginica, and Versicolor is 50. If the classes are having the different numbers of data points each then it’s an imbalanced dataset.

在這里,IRIS是一個平衡的數據集,因為Setosa,Virginica和Versicolor每個類的數據點數均為50。如果每個類的數據點數均不同,則它是一個不平衡的數據集。

2D散點圖 (2D Scatter Plot)

By using the pandas object we created before we can plot a simple 2D graph of the features we give as x and y parameters of the plot() method of pandas. Matplotlib method show() helps to actually plot the data.

通過使用我們創建的pandas對象,我們可以繪制簡單的二維圖形來繪制作為pandas plot()方法的x和y參數的要素。 Matplotlib方法show()有助于實際繪制數據。

Image for post
2D Scatter Plot
2D散點圖

But by Seaborn we can plot a more informative graph by color-coding by each flower type.

但是通過Seaborn,我們可以通過每種花的顏色編碼來繪制更具信息量的圖。

Image for post
2D Scatter Plot using Seaborn
使用Seaborn的2D散點圖
Image for post

Here in the above graph notice that Blue Setosa points can be easily separated from Orange Versicolor and Green Verginica points by simply drawing a line but the Orange and Green points are still complex to be separated because they are overlapping. So by using sepal_length and sepal_width features of the data we can get this much information.

在上圖中,通過簡單畫一條線可以很容易地將Blue Setosa點與Orange Versicolor點和Green Verginica點分離,但是Orange點和Green點由于重疊而仍然很復雜,難以分離。 因此,通過使用數據的sepal_lengthsepal_width功能,我們可以獲得很多信息。

2D散點圖:對圖 (2D Scatter Plot: Pair Plot)

Pair Plot by Seaborn is capable of drawing multiple 2D Scatter Plots for each possible combination of features in one go.

Seaborn的結對圖能夠一次性繪制多個2D散點圖,以用于每種可能的特征組合。

Image for post
Pair Plot by Seaborn
Seaborn的配對圖
Image for post
Pair Plots
對圖

So here if we observe the pair plots then we can say petal_length and petal_width are the most essential features to identify various flower types. While Setosa can be easily linearly separable, Virnica and Versicolor have some overlap. So we can separate them by a line and some “if-else” conditions.

因此,在這里,如果我們觀察對圖,那么我們可以說花瓣長度花瓣寬度是識別各種花朵類型的最基本特征。 雖然Setosa可以很容易地線性分離,但Virnica和Versicolor有一些重疊。 因此,我們可以通過一行和一些“ if-else”條件將它們分開。

一維散點圖,直方圖,PDF和CDF (1D Scatter Plot, Histogram, PDF & CDF)

Image for post
1D Scatter Plot of Petal-Length
花瓣長度的一維散點圖

As we can observe the graph, it's very hard to make sense as points are overlapping a lot. There are better ways to visualize the scatter plots. By Seaborn, we can plot a Probability Distribution Function cum Histogram.

正如我們可以觀察到的圖形一樣,由于點重疊很多,很難理解。 有更好的方法可視化散點圖。 通過Seaborn,我們可以繪制概率分布函數和直方圖

Histogram : Histogram is the plot representing the frequency counts of each data window of the feature for which the plot is drawn (Bar shapes in the graph).

直方圖 :直方圖是表示繪制該圖的要素的每個數據窗口的頻率計數的圖(圖中的條形)。

PDF : Probability Density Function is basically a smoothed histogram. Every point on the PDF represents the probability for that particular value in the data (bell shaped curve in the graph). PDF gets formatted using Kernel Density Estimation. For each value of the point on x-axis, y-axis value represents its probabily of occuring in the dataset. More the y value more of that value exists in the dataset.

PDF概率密度函數基本上是平滑的直方圖。 PDF上的每個點都代表數據中該特定值(圖中的鐘形曲線)的概率。 使用內核密度估計來格式化PDF。 對于x軸上每個點的值,y軸值表示其在數據集中出現的概率。 y值越大,數據集中存在的值越多。

Image for post
PDF & Histogram of petal_length
花瓣長度的PDF和直方圖
Image for post
PDF & Histogram of petal_length
花瓣長度的PDF和直方圖
Image for post
PDF &Histogram of petal_width
花瓣寬度的PDF和直方圖
Image for post
PDF &Histogram of petal_width
花瓣寬度的PDF和直方圖
Image for post
PDF &Histogram of sepal_length
PDF和Sepal_length的直方圖
Image for post
PDF &Histogram of sepal_length
PDF和Sepal_length的直方圖
Image for post
PDF &Histogram of sepal_width
PDF格式的sepal_width
Image for post
PDF &Histogram of sepal_width
PDF格式的sepal_width

Now from these graphs, we can observe that by using just one feature a simple model can be formed by if..else condition as if(petal_length) < 2.5 then flower type is Setosa.

現在從這些圖形中,我們可以觀察到,僅使用一個功能,就可以通過if..else條件( if(petal_length)<2.5)形成簡單模型, 然后花朵類型為Setosa

Now, what if we need the percentage of Versicolor points having a petal_length of less than 5 ? here comes CDF in our rescue!

現在,如果我們需要花瓣長度小于5的Versicolor點的百分比呢? CDF來了!

CDF: Cumulative Density Function is the cumulative sum of the PDF. Every point on the CDF curve represents integration of the PDF till that point of CDF. Below is the histogram of the Yield. Every point on the CDF represents how much percentage of the total points belong to below that point.

CDF:累積密度函數是PDF的累積和。 CDF曲線上的每個點都代表PDF到CDF為止的積分。 以下是收益的直方圖。 CDF上的每個點代表該點以下的總點數百分比。

To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size(for more information: https://www.datacamp.com/community/tutorials/histograms-matplotlib).

要構建直方圖,第一步是將值的范圍“ bin”(即,將值的整個范圍劃分為一系列間隔),然后計算每個間隔中有多少值。 通常將bin指定為變量的連續,不重疊的間隔。 垃圾箱(間隔)必須相鄰,并且經常(但不是必須)大小相等(有關更多信息,請訪問: https : //www.datacamp.com/community/tutorials/histograms-matplotlib )。

Image for post
Image for post

Now by plotting of CDF of petal_length for various types of flowers in a combined manner we can get an overall picture of the data.

現在,通過組合繪制各種類型花朵的petlet_length的CDF,可以得到數據的整體圖。

Image for post
Image for post

Mean, Variance and Standard Deviation

均值,方差和標準差

Mean: https://en.wikipedia.org/wiki/Mean

意思是: https : //en.wikipedia.org/wiki/Mean

Variance: https://en.wikipedia.org/wiki/Variance

差異: https : //en.wikipedia.org/wiki/Variance

Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation

標準偏差: https : //en.wikipedia.org/wiki/Standard_deviation

Image for post

Median, Percentile, Quantile, MAD, IQR

中位數,百分位數,分位數,MAD,IQR

Median: https://en.wikipedia.org/wiki/Median

中位數: https : //en.wikipedia.org/wiki/Median

Percentile: https://en.wikipedia.org/wiki/Percentile

百分位數: https : //en.wikipedia.org/wiki/Percentile

Quantile: https://en.wikipedia.org/wiki/Quantile

分位數: https : //en.wikipedia.org/wiki/Quantile

MAD: Median Absolute Deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation

MAD:中位數絕對偏差: https : //en.wikipedia.org/wiki/Median_absolute_deviation

IQR: Interquantile Range: https://en.wikipedia.org/wiki/Interquartile_range

IQR:分位數范圍: https ://en.wikipedia.org/wiki/Interquartile_range

Image for post
Image for post

箱形圖 (Box Plots)

Box plots with whiskers is another method for visualizing the 1D Scatter Plot more intuitively. The boxes in the graph represent Interquantile Range as the first horizontal line from the bottom of the box represents 25th percentile value, the middle line represents the 50th percentile and the top line represents the 75th percentile. The black lines outside of the boxes are called whiskers. It’s not fixed what whiskers represent but it might be the minimum value of the feature at below horizontal line and maximum value at the top horizontal line in some cases.

帶晶須的箱形圖是另一種更直觀地可視化1D散布圖的方法。 圖中的框代表分位數范圍,因為從框底部開始的第一條水平線代表第25個百分位數,中線代表第50個百分位數,頂線代表第75個百分位數。 盒子外面的黑線稱為晶須。 晶須代表什么并不確定,但在某些情況下可能是特征在水平線以下的最小值和在水平線頂部的最大值。

Image for post

小提琴圖 (Violin Plots)

Violin plot by Seaborn combine PDF and Box-Plot. As in the below plot, on all three colors, PDFs of petal_length are on the sides of the shape, and in the center in black, there is a representation of Box-Plots.

Seaborn的小提琴圖結合了PDF和Box-Plot。 如下圖所示,在所有三種顏色上,petlet_length的PDF都位于形狀的側面,而黑色的中心則是Box-Plots的表示形式。

Image for post

多元概率密度:輪廓圖 (Multivariate Probability Density: Contour Plot)

Seaborn provides jointplot() method for contours. The name is “jointplot” because it represents Contours as well as PDFs on the edges. More the darker the region the more the probability of occurring that value of features for which the graph is plotted.

Seaborn提供了用于輪廓的jointplot()方法。 名稱為“ jointplot”,因為它表示輪廓以及邊緣的PDF 。 區域越黑,繪制該圖的要素的值出現的可能性就越大。

Image for post
Image for post

翻譯自: https://medium.com/swlh/exploratory-data-analysis-of-iris-dataset-2ab58e1a5dc6

iris數據集 測試集

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388039.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388039.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388039.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Oracle 12c 安裝 Linuxx86_64

1)下載Oracle Database 12cRelease 1安裝介質 官方的下載地址&#xff1a; 1&#xff1a;http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html 2&#xff1a;https://edelivery.oracle.com/EPD/Download/get_form?egroup_aru_number16496…

Linux入門實驗

學習Linux要先做實驗來熟悉操作系統本次先寫點入門的操作。 關于Linux入門實驗的操作如下&#xff1a; 【例1】顯示當前使用的shell [rootcentos7 ~]# echo ${SHELL} /bin/bash 【例2】顯示當前系統使用的所有shell [rootcentos7 ~]#cat /etc/shells /bin/sh /bin/bash /usr/bi…

flink 檢查點_Flink檢查點和恢復

flink 檢查點Apache Flink is a popular real-time data processing framework. It’s gaining more and more popularity thanks to its low-latency processing at extremely high throughput in a fault-tolerant manner.Apache Flink是一種流行的實時數據處理框架。 它以容…

917. 僅僅反轉字母

給定一個字符串 S&#xff0c;返回 “反轉后的” 字符串&#xff0c;其中不是字母的字符都保留在原地&#xff0c;而所有字母的位置發生反轉。 示例 1&#xff1a; 輸入&#xff1a;"ab-cd" 輸出&#xff1a;"dc-ba"示例 2&#xff1a; 輸入&#xff1a;&q…

C# socket nat 映射 網絡 代理 轉發

using System;using System.Collections.Generic;using System.Net;using System.Net.Sockets;using System.Text;using System.Threading;namespace portmap_net{/// <summary>/// 映射器實例狀態/// </summary>sealed internal class state{#region Fields (5)pu…

python初學者_初學者使用Python的完整介紹

python初學者A magical art of teaching a computer to perform a task is called computer programming. Programming is one of the most valuable skills to have in this competitive world of computers. We, as modern humans, are living with lots of gadgets such as …

c# nat udp轉發

UdpClient myClient;Thread recvThread;//打開udp端口開始接收private void startRecv(int port){myClient new UdpClient(port);recvThread new Thread(new ThreadStart(receive));recvThread.Start();}//停止接收private void stopRecv(){recvThread.Abort();}private void…

【Code-Snippet】TextView

1. TextView文字過長&#xff0c;顯示省略號 【參考】 必須要同時設置XML和JAVA&#xff0c;而且&#xff0c;java中設置文字必須是在最后。 android:ellipsize"start|end|middle" //省略號的位置 android:singleLine"true" android:lines"2"…

Object 的靜態方法之 defineProperties 以及數據劫持效果

再提一下什么是靜態方法&#xff1a; 靜態方法&#xff1a;在類身上的方法&#xff0c;  動態方法:在實例身上的方法 Object.defineProperties(obj, props)obj&#xff1a;被添加屬性的對象props&#xff1a;添加或更新的屬性對象給對象定義屬性&#xff0c;如果存在該屬性&a…

Spring實現AOP的4種方式

Spring實現AOP的4種方式 先了解AOP的相關術語: 1.通知(Advice): 通知定義了切面是什么以及何時使用。描述了切面要完成的工作和何時需要執行這個工作。 2.連接點(Joinpoint): 程序能夠應用通知的一個“時機”&#xff0c;這些“時機”就是連接點&#xff0c;例如方法被調用時、…

如何使用Plotly在Python中為任何DataFrame繪制地圖的衛星視圖

Chart-Studio和Mapbox簡介 (Introduction to Chart-Studio and Mapbox) Folium and Geemap are arguably the best GIS libraries/tools to plot satellite-view maps or any other kinds out there, but at times they require an additional authorization to use the Google…

Java入門系列-26-JDBC

認識 JDBC JDBC (Java DataBase Connectivity) 是 Java 數據庫連接技術的簡稱&#xff0c;用于連接常用數據庫。 Sun 公司提供了 JDBC API &#xff0c;供程序員調用接口和類&#xff0c;集成在 java.sql 和 javax.sql 包中。 Sun 公司還提供了 DriverManager 類用來管理各種不…

3.19PMP試題每日一題

在房屋建造過程中&#xff0c;應該先完成衛生管道工程&#xff0c;才能進行電氣工程施工&#xff0c;這是一個&#xff1a;A、強制性依賴關系B、選擇性依賴關系C、外部依賴關系D、內部依賴關系 作者&#xff1a;Tracy19890201&#xff08;同微信號&#xff09;轉載于:https://…

Can't find temporary directory:internal error

今天我機子上的SVN突然沒有辦法進行代碼提交了&#xff0c;出現的錯誤提示信息為&#xff1a; Error&#xff1a;Cant find temporary directory:internal error 然后試了下其他的SVN源&#xff0c;發現均無法提交&#xff0c;并且update時也出現上面的錯誤信息。對比項目文件…

snowflake 數據庫_Snowflake數據分析教程

snowflake 數據庫目錄 (Table of Contents) Introduction 介紹 Creating a Snowflake Datasource 創建雪花數據源 Querying Your Datasource 查詢數據源 Analyzing Your Data and Adding Visualizations 分析數據并添加可視化 Using Drilldowns on Your Visualizations 在可視化…

jeesite緩存問題

jeesite&#xff0c;其框架主要為&#xff1a; 后端 核心框架&#xff1a;Spring Framework 4.0 安全框架&#xff1a;Apache Shiro 1.2 視圖框架&#xff1a;Spring MVC 4.0 服務端驗證&#xff1a;Hibernate Validator 5.1 布局框架&#xff1a;SiteMesh 2.4 工作流引擎…

高級Python:定義類時要應用的9種最佳做法

重點 (Top highlight)At its core, Python is an object-oriented programming (OOP) language. Being an OOP language, Python handles data and functionalities by supporting various features centered around objects. For instance, data structures are all objects, …

Java 注解 攔截器

場景描述&#xff1a;現在需要對部分Controller或者Controller里面的服務方法進行權限攔截。如果存在我們自定義的注解&#xff0c;通過自定義注解提取所需的權限值&#xff0c;然后對比session中的權限判斷當前用戶是否具有對該控制器或控制器方法的訪問權限。如果沒有相關權限…

醫療大數據處理流程_我們需要數據來大規模改善醫療流程

醫療大數據處理流程Note: the fictitious examples and diagrams are for illustrative purposes ONLY. They are mainly simplifications of real phenomena. Please consult with your physician if you have any questions.注意&#xff1a;虛擬示例和圖表僅用于說明目的。 …

What's the difference between markForCheck() and detectChanges()

https://stackoverflow.com/questions/41364386/whats-the-difference-between-markforcheck-and-detectchanges轉載于:https://www.cnblogs.com/chen8840/p/10573295.html