數據科學中的數據可視化

數據可視化簡介 (Introduction to Data Visualization)

Data visualization is the process of creating interactive visuals to understand trends, variations, and derive meaningful insights from the data. Data visualization is used mainly for data checking and cleaning, exploration and discovery, and communicating results to business stakeholders. Most of the data scientists pay little attention to graphs and focuses only on the numerical calculations which at times can be misleading. To understand the importance of visualization let’s take a look at Anscombe’s Data Quartet in Figures 1 and 2 below.

數據可視化是創建交互式視覺效果以了解趨勢,變化并從數據中獲得有意義的見解的過程。 數據可視化主要用于數據檢查和清理,探索和發現以及將結果傳達給業務涉眾。 大多數數據科學家很少關注圖形,而只關注于有時會引起誤解的數值計算。 為了理解可視化的重要性,讓我們在下面的圖1和圖2中查看Anscombe的Data Quartet。

Image for post
Figure 1. Anscombe’s Data Quartet showing how a pair of X and Y can have different values yet have different central tendency and correlation values. Data Credits — Anscombe, Francis J. (1973)
圖1. Anscombe的數據四重奏顯示了一對X和Y如何具有不同的值卻具有不同的集中趨勢和相關值。 數據信用-Anscombe,Francis J.(1973)

The same data points, when represented using visualization in Figure 2 below, depicts a different trend altogether.

當使用下面的圖2中的可視化表示相同的數據點時,它們總共描述了不同的趨勢。

Image for post
Figure 2. Illustrates how four identical datasets when examined using simple summary statistics look similar but vary considerably when graphed. Image Credits — Anscombe, Francis J. (1973)
圖2.說明了使用簡單的匯總統計數據檢查時,四個相同的數據集看起來如何相似,但繪制時卻相差很大。 圖片來源-弗朗西斯·J·安斯科姆(1973)

It is important to visualize the data before any calculations are carried out. The visual representation can convey much more information when compared to descriptive statistics.

在執行任何計算之前,對數據進行可視化非常重要。 與描述性統計數據相比,視覺表示可以傳達更多的信息。

數據可視化的作用 (Role of Data Visualization)

Multiple Business Intelligence Tools (BI) are currently ruling the market with each having its pros and cons. The concept of self-service dashboards was devised to allow stakeholders with little or no knowledge of data science, work independently on data, and derive some findings that might assist their day to day business decisions. We will look at some of the applications of data visualization using Tableau or Python in the examples below.

目前,多種商業智能工具(BI)統治著市場,每種都有其優缺點。 自助服務儀表板的概念旨在使幾乎不了解數據科學或根本不了解數據科學的利益相關者,獨立地處理數據并得出一些有助于其日常業務決策的發現。 在下面的示例中,我們將介紹一些使用Tableau或Python進行數據可視化的應用程序。

數據檢查與清理 (Data Checking and Cleaning)

Data visualization can be used to look for obvious errors in the dataset including nulls, random values, distinct records, the format of dates, sensibility of spatial data, and string and character encoding.

數據可視化可用于查找數據集中的明顯錯誤,包括空值,隨機值,不同的記錄,日期格式,空間數據的敏感性以及字符串和字符編碼。

Image for post
Figure 3. Illustrates the distribution of Pedestrian volume in Melbourne captured by different sensors situated in and around CBD. The idea is to analyze if the latitude and longitude information is valid for a given dataset. The image is developed by the author using Tableau.
圖3.說明了位于CBD內和周圍的不同傳感器捕獲的墨爾本行人流量分布。 這個想法是分析經緯度信息對于給定的數據集是否有效。 該圖像由作者使用Tableau開發。

資料分配 (Data Distribution)

Data visualization can be used to understand the distribution of the data, look for central tendencies (mean, median, and mode), understand the presence of outliers using a boxplot, check for skewness, and ever understand the impact of winsorization on data distribution. Figure 4 below illustrates how box plots can be developed to understand the presence of outliers.

數據可視化可用于了解數據的分布,尋找中心趨勢(均值,中位數和眾數),使用箱線圖了解異常值,檢查偏斜度,以及了解Winsorization對數據分布的影響。 下面的圖4說明了如何繪制箱形圖以了解異常值的存在。

Image for post
Figure 4. Displays the presence of outliers (outliers in pedestrian volume) across different sensors installed across various parts of Melbourne. The dataset used for this analysis can be found here. The image is developed by the author using Jupyter Notebook.
圖4.顯示跨墨爾本各個地區安裝的不同傳感器的異常值(行人量中的異常值)的存在。 可以在此處找到用于此分析的數據集。 該圖像由作者使用Jupyter Notebook開發。

模型假設 (Model Assumptions)

Linear regression and other classification models follow certain underlying assumptions like data has to be normally distributed, the correlation between different independent variables shouldn’t exist, homoscedasticity of error terms, and many more. Hence visualizations are a key to validating some of these assumptions as well.

線性回歸和其他分類模型遵循某些基本假設,例如數據必須正態分布,不應該存在不同自變量之間的相關性,誤差項的均方差等等。 因此,可視化也是驗證其中一些假設的關鍵。

Image for post
Figure 5. Illustrates the correlation plot of numerical variables using a heat map. The correlation plot is used to drop variables that are highly correlated while building a classification model to predict customer satisfaction using flight and facilities data. The image is developed by the author using Jupyter Notebook.
圖5.使用熱圖說明數值變量的相關圖。 相關圖用于刪除高度相關的變量,同時建立分類模型以使用航班和設施數據預測客戶滿意度。 該圖像由作者使用Jupyter Notebook開發。

人在環分析 (Human-in-the-Loop Analytics)

Data scientists often use humans in the loop analytics to get a look and feel of the data, make a hypothesis, run appropriate analytics to validate the hypothesis, and repeat the process till conclusive evidence is determined. E.g. in Python a very popular package Seaborn has a function called pair plot. Pair plots are very useful in determining the relationship between dependent and independent variables. The idea of the visualization is to get a better understanding of the directional sense of if some of the independent variables impact the model results or not.

數據科學家經常在循環分析中使用人工來獲得數據的外觀和感覺,做出假設,運行適當的分析以驗證假設,并重復該過程直到確定結論性證據為止。 例如,在Python中,一個非常受歡迎的軟件包Seaborn具有一個稱為結對圖的函數。 配對圖對于確定因變量和自變量之間的關系非常有用。 可視化的想法是更好地理解方向性,即某些自變量是否影響模型結果。

Image for post
Figure 6. Illustrates the pair plot representation of a dependent variable (say customer satisfaction of airline passengers) across independent variables like distance of the flight, the delay in arrival, and the delay in departure. The image is developed by the author using Jupyter Notebook.
圖6.圖示了跨自變量(例如,飛行距離,到達延遲和起飛延遲)的因變量(例如,航空公司乘客的客戶滿意度)的對圖表示。 該圖像由作者使用Jupyter Notebook開發。

降維 (Dimension Reduction)

While working with multiple variables it is difficult to visualize the data in an n-dimension space. E.g. in a data set that has different customer attributes (say numerical) it is difficult to plot the customers considering all attributes. In scenarios like this, dimension reduction techniques like Principal Component Analysis (PCA) or Factor Analysis can be useful to bring down the attributes to fewer dimensions. PCA finds linear combinations of variables that best explain the observations whereas Factor analysis finds linear combinations of variables that best explain the relationship between the variables. The reduced dimension can then be plotted to analyze the customers in a 2D space.

使用多個變量時,很難在n維空間中可視化數據。 例如,在具有不同客戶屬性(例如數字)的數據集中,很難考慮所有屬性來繪制客戶。 在這種情況下,降維技術(例如主成分分析(PCA)或因子分析)可用于將屬性降低到更少的維度。 PCA找到最能解釋觀測結果的變量線性組合,而因子分析則找到最能解釋變量之間關系的變量線性組合。 然后可以繪制縮小的尺寸以分析2D空間中的客戶。

More information on how to recreate these charts in Python can be found here.

可在此處找到有關如何在Python中重新創建這些圖表的更多信息。

分析問題中的數據集類型 (Type of Datasets in Analytical Problems)

It is important to understand the type of datasets to determine the type of visualization that can be applied. E.g. when working with a tabular data a combination of bar graphs and line charts might be useful when compared to spatial data where a map with a density plot might communicate the result effectively. Before we take a deeper look into the type of visualization let’s understand some of the key data types that are commonly used.

重要的是了解數據集的類型,以確定可以應用的可視化類型。 例如,當與表格數據一起使用時,與空間數據相比,條形圖和折線圖的組合可能會很有用,在空間數據中,帶有密度圖的地圖可能會有效地傳達結果。 在深入研究可視化類型之前,讓我們了解一些常用的關鍵數據類型。

表格數據 (Tabular data)

Data organized in tables, a row for each data item, and a column for each of its attributes. E.g. Datasets that are available in Excel, CSV files, Pandas data frame, etc.

數據組織在表格中,每個數據項一行,其每個屬性列。 例如,Excel,CSV文件,Pandas數據框等中可用的數據集。

網絡數據 (Network data)

Nodes in the network are data items and links between the nodes are relations between. For example a social network.

網絡中的節點是數據項,節點之間的鏈接是它們之間的關系。 例如社交網絡。

空間數據: (Spatial data:)

Data which is naturally organized and understood in terms of its spatial location or extent. E.g. latitude and longitude of locations, geography information, suburbs, streets, etc.

根據空間位置或范圍自然組織和理解的數據。 例如,位置,地理信息,郊區,街道等的緯度和經度。

文字數據: (Textual data:)

This kind of data set consists of sequences of words and punctuation. E.g. twitter feed or customer complaints.

這種數據集由單詞和標點的序列組成。 例如Twitter提要或客戶投訴。

視覺詞匯 (Visual Vocabulary)

The figures below provide a picture of how different visualizations can be used to depict different scenarios in the data.

下圖提供了如何使用不同的可視化圖像描述數據中不同場景的圖片。

Image for post
Figure 7. Illustrates some of the graphs useful for visualizing trends w.r.t deviations from reference points. Image Credits — Github.io
圖7.說明了一些圖表,這些圖表可用于可視化與參考點之間的偏差趨勢。 圖片積分— Github.io
Image for post
Figure 8. Illustrates some of the graphs useful for visualizing the correlation between multiple data points. Image Credits — Github.io
圖8.說明了一些圖形,這些圖形對于可視化多個數據點之間的相關性很有用。 圖片積分— Github.io
Image for post
Figure 9. Illustrates how visualizations can be used to understand the variation of attributes concerning time. Image Credits — Github.io
圖9.說明了如何使用可視化來了解與時間有關的屬性的變化。 圖片積分— Github.io
Image for post
Figure 10. Illustrates how different visualizations can be used to understand rankings or order of different components. Image Credits — Github.io
圖10.說明了如何使用不同的可視化效果來理解不同組件的排名或順序。 圖片積分— Github.io

You can find examples of other visualizations here.

您可以在此處找到其他可視化示例。

跨數據類型的可視化效果 (Effectiveness of Visualization across Data Types)

The table below displays the effectiveness of different visuals across data types. To understand the table better we need to have a better understanding of how variables (attributes from the data) can be categorized into different data types. Categorical variables are the ones that don’t have any ordering e.g. Gender, Grades, Marital Status, Job Position, etc. Numerical Variables are segmented into Ordinal and Quantitative variables. Ordinal variables are categories that can be ranked. E.g. Satisfaction (Good, Bad, and Average), Potential (High, Medium, and Low), etc. Quantitative variables are the ones that can take any range of numeric values between -infinity to +infinity. E.g. Age, Salary, Revenue, Sales, etc.

下表顯示了跨數據類型的不同視覺效果的有效性。 為了更好地理解表,我們需要更好地了解如何將變量(來自數據的屬性)歸類為不同的數據類型。 分類變量是沒有任何排序的變量 ,例如性別,等級,婚姻狀況,工作職位等。 數字變量分為序數 變量定量變量。 有序變量是可以排序的類別。 例如,滿意度(好,壞和平均),潛力(高,中和低)等。 定量變量是可以采用-infinity到+ infinity之間任意數值范圍的變量 。 例如年齡,薪水,收入,銷售等

Image for post
Figure 11. Illustrates how different graphs can be used to visualize patterns in the data taking into consideration the data type of the variable. Image credits — Developed by the author using PowerPoint.
圖11.說明了如何使用不同的圖來可視化數據中的模式,同時考慮到變量的數據類型。 圖片來源-由作者使用PowerPoint開發。
Image for post
Figure 12. Illustrates the type of visualization that can be used for different data types. Image credit — Developed by the author using Excel.
圖12.說明了可用于不同數據類型的可視化類型。 圖像信用—由作者使用Excel開發。

結論 (Conclusion)

Data visualization forms the backbone of all analytical projects. It not only helps in gaining insights into the data but can be used as a tool for data pre-processing. Having the right set of visualizations for different data types and business scenarios is the key to effective communication of results.

數據可視化構成所有分析項目的基礎。 它不僅有助于獲得對數據的見解,而且可以用作數據預處理的工具。 為不同的數據類型和業務場景提供正確的可視化設置是有效傳達結果的關鍵。

About the Author: Advanced analytics professional and management consultant helping companies find solutions for diverse problems through a mix of business, technology, and math on organizational data. A Data Science enthusiast, here to share, learn and contribute; You can connect with me on Linked and Twitter;

作者簡介:高級分析專家和管理顧問,通過組織數據的業務,技術和數學相結合,幫助公司找到各種問題的解決方案。 數據科學愛好者,在這里分享,學習和貢獻; 您可以在 Linked Twitter上 與我 聯系

翻譯自: https://towardsdatascience.com/data-visualization-in-data-science-5681cbdde5bf

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391926.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391926.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391926.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

打針小說軟件測試,UPDATE注射(mysql+php)的兩個模式

一.---- 表的結構 userinfo--CREATE TABLE userinfo (groudid varchar(12) NOT NULL default 1,user varchar(12) NOT NULL default heige,pass varchar(122) NOT NULL default 123456) ENGINEMyISAM DEFAULT CHARSETlatin1;---- 導出表中的數據 userinfo--INSERT INTO userinf…

前端速成班_在此速成班中學習Go

前端速成班Learn everything you need to get started programming in Go with this crash course tutorial.通過該速成課程教程,學習在Go中開始編程所需的一切。 First, learn how to install a Go Programming Environment on Windows, Mac, or Linux. Then, lea…

手把手教你webpack3(6)css-loader詳細使用說明

CSS-LOADER配置詳解 前注: 文檔全文請查看 根目錄的文檔說明。 如果可以,請給本項目加【Star】和【Fork】持續關注。 有疑義請點擊這里,發【Issues】。 1、概述 對于一般的css文件,我們需要動用三個loader(是不是覺得好…

shell遠程執行命令

1、先要配置免密登陸&#xff0c;查看上一篇免密傳輸內容 2、命令行執行少量命令&#xff1a;ssh ip "command1;command2"。例&#xff1a;ssh 172.1.1.1 "cd /home;ls" 3、腳本批量執行命令&#xff1a; #&#xff01;/bin/bash ssh ip << remotes…

Python調用C語言

Python中的ctypes模塊可能是Python調用C方法中最簡單的一種。ctypes模塊提供了和C語言兼容的數據類型和函數來加載dll文件&#xff0c;因此在調用時不需對源文件做任何的修改。也正是如此奠定了這種方法的簡單性。 示例如下 實現兩數求和的C代碼&#xff0c;保存為add.c //samp…

多重線性回歸 多元線性回歸_了解多元線性回歸

多重線性回歸 多元線性回歸Video Link影片連結 We have taken a look at Simple Linear Regression in Episode 4.1 where we had one variable x to predict y, but what if now we have multiple variables, not just x, but x1,x2, x3 … to predict y — how would we app…

tp703n怎么做無線打印服務器,TP-Link TL-WR703N無線路由器無線AP模式怎么設置

TP-Link TL-WR703N無線路由器配置簡單&#xff0c;不過對于沒有網絡基礎的用戶來說&#xff0c;完成路由器的安裝和無線AP模式的設置&#xff0c;仍然有一定的困難&#xff0c;本文學習啦小編主要介紹TP-Link TL-WR703N無線路由器無線AP模式的設置方法!TP-Link TL-WR703N無線路…

unity 克隆_使用Unity開發Portal游戲克隆

unity 克隆Learn game development principles by coding a Portal-like game using Unity and C#. The principles you learn in this lecture from Colton Ogden can apply to any programming language and any game.通過使用Unity和C&#xff03;編寫類似于Portal的游戲來學…

swift基礎學習(八)

####1.主要用到的知識點 CAGradientLayer 處理漸變色AVAudioPlayer 音頻播放Timer 定時器CABasicAnimation 動畫#####2.效果圖 ####3.代碼 import UIKit import AVFoundationclass ViewController: UIViewController, AVAudioPlayerDelegate {var gradientLayer: CAGradientLay…

pandas之groupby分組與pivot_table透視

一、groupby 類似excel的數據透視表&#xff0c;一般是按照行進行分組&#xff0c;使用方法如下。 df.groupby(byNone, axis0, levelNone, as_indexTrue, sortTrue, group_keysTrue,squeezeFalse, observedFalse, **kwargs) 分組得到的直接結果是一個DataFrameGroupBy對象。 df…

js能否打印服務器端文檔,js打印遠程服務器文件

js打印遠程服務器文件 內容精選換一換對于密碼鑒權方式創建的Windows 2012彈性云服務器&#xff0c;使用初始密碼以MSTSC方式登錄時&#xff0c;登錄失敗&#xff0c;系統顯示“第一次登錄之前&#xff0c;你必須更改密碼。請更新密碼&#xff0c;或者與系統管理員或技術支持聯…

spring—JdbcTemplate使用

JdbcTemplate基本使用 01-JdbcTemplate基本使用-概述(了解) JdbcTemplate是spring框架中提供的一個對象&#xff0c;是對原始繁瑣的Jdbc API對象的簡單封裝。spring框架為我們提供了很多的操作模板類。例如&#xff1a;操作關系型數據的JdbcTemplate和HibernateTemplate&…

vanilla_如何在Vanilla JavaScript中操作DOM

vanillaby carlos da costa通過卡洛斯達科斯塔 如何在Vanilla JavaScript中操作DOM (How to manipulate the DOM in Vanilla JavaScript) So you have learned variables, selection structures, and loops. Now it is time to learn about DOM manipulation and to start doi…

NOIP201202尋寶

題目 試題描述傳說很遙遠的藏寶樓頂層藏著誘人的寶藏。 小明歷盡千辛萬苦終于找到傳說中的這個藏寶樓&#xff0c;藏寶樓的門口豎著一個木板&#xff0c;上面寫有幾個大字&#xff1a;尋寶說明書。說明書的內容如下&#xff1a;藏寶樓共有N1層&#xff0c;最上面一層是頂層&…

修改UITextField中的placeholder的字體

修改字體顏色&#xff1a; [textField setValue:[UIColor redColor] forKeyPath:"_placeholderLabel.textColor"]; 復制代碼 修改字體大小&#xff1a; [textField setValue:[UIFont boldSystemFontOfSize:16] forKeyPath:"_placeholderLabel.font"]; 復…

如何使用Python處理丟失的數據

The complete notebook and required datasets can be found in the git repo here完整的筆記本和所需的數據集可以在git repo中找到 Real-world data often has missing values.實際數據通常缺少值 。 Data can have missing values for a number of reasons such as observ…

MySQL—隔離級別

READ UNCOMMITED(讀未提交) 即讀取到了正在修改但是卻還沒有提交的數據&#xff0c;這就會造成數據讀取的錯誤。 READ COMMITED(提交讀/不可重復讀) 它與READ UNCOMMITED的區別在于&#xff0c;它規定讀取的時候讀到的數據只能是提交后的數據。 這個級別所帶來的問題就是不可…

做虛擬化服務器的配資一致嘛,服務器虛擬化技術在校園網管理中的應用探討.pdf...

第 卷 第 期 江 蘇 建 筑 職 業 技 術 學 院 學 報14 3 Vol.14 曧.3年 月 JOURNAL OF JIANGSU JIANZHU INSTITUTE2014 09 Se .2014p服務器虛擬化技術在校園網管理中的應用探討,汪小霞 江建( , )健雄職業技術學院 軟件與服務外包學院 江蘇 太倉 215411: , ,摘 要 高校校園網數據…

aws中部署防火墻_如何在AWS中設置自動部署

aws中部署防火墻by Harry Sauers哈里紹爾斯(Harry Sauers) 如何在AWS中設置自動部署 (How to set up automated deployment in AWS) 設置和配置服務器 (Provisioning and Configuring Servers) 介紹 (Introduction) In this tutorial, you’ll learn how to use Amazon’s AWS…

Runtime的應用

來自&#xff1a;http://www.imlifengfeng.com/blog/?p397 1、快速歸檔 (id)initWithCoder:(NSCoder *)aDecoder { if (self [super init]) { unsigned int outCount; Ivar * ivars class_copyIvarList([self class], &outCount); for (int i 0; i < outCount; i ) …