數據可視化 信息可視化_可視化數據以幫助清理數據

數據可視化 信息可視化

The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been greatly magnified as businesses look to expand insight about the market and their customers with easily obtainable data.

數據科學家的作用涉及檢索大量結構化或非結構化數據之間的隱藏關系,以達到或調整某些業務標準。 近年來,隨著企業希望通過易于獲得的數據來擴大對市場及其客戶的洞察力,這一作用的重要性已大大提高。

It is the data scientists job to take that data and return a deeper understanding of the business problem or opportunity. This often involves the use of scientific methods of which include machine learning (ML) or neural networks (NN). While these types of structures may find meaning in thousands of data points much faster than a human can, they can be unreliable if the data that is fed into them is messy data.

數據科學家的工作是獲取這些數據并返回對業務問題或機會的更深刻理解。 這通常涉及使用科學方法,包括機器學習(ML)或神經網絡(NN)。 盡管這些類型的結構可以在數千個數據點中找到比人類更快得多的含義,但是如果饋入其中的數據是凌亂的數據,則它們可能不可靠。

Messy data could cause have very negative consequences on your models they are of many forms of which include:

雜亂的數據可能會對您的模型造成非常不利的影響,它們的形式很多,包括:

缺少數據(Missing data:)

Represented as ‘NaN’ (an acronym of Not a Number) or as a ‘None’ a Python singleton object.

表示為“ NaN”(不是數字的縮寫)或Python單例對象的“無”。

Sometimes the best way to deal with problems is the simplest.

有時,解決問題的最佳方法是最簡單的。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('train.csv')df.info()

A quick inspection of the returned values shows the column count of 891 is inconsistent across the different columns a clear sign of missing information. We also notice some fields are of type “object” we’ll look at that next.

快速檢查返回的值會發現在不同的列中891的列數不一致,明顯缺少信息。 我們還注意到,接下來將要介紹一些字段屬于“對象”類型。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
survived 891 non-null int64
pclass 891 non-null int64
name 891 non-null object
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
ticket 891 non-null object
fare 891 non-null float64
cabin 204 non-null object
embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Alternatively you can plot the missing values on a heatmap using seaborn but this could be very time consuming if handling big dataframes.

或者,您可以使用seaborn在熱圖上繪制缺失值,但是如果處理大數據幀,這可能會非常耗時。

sns.heatmap(df.isnull(), cbar=False)

數據不一致(Inconsistent data:)

  • Inconsistent columns types: Columns in dataframes can differ as we saw above. Columns could be of a different types such as objects, integers or floats and while this is usually the case mismatch between column type and the type of value it holds might be problematic. Most important of format types include datetime used for time and date values.

    列類型不一致:數據框中的列可能會有所不同,如上所述。 列可以具有不同的類型,例如對象,整數或浮點數,雖然通常這是列類型與其所擁有的值類型不匹配的情況,但可能會出現問題。 最重要的格式類型包括用于時間和日期值的日期時間。
  • Inconsistent value formatting: While this type of problem might mainly arise during categorical values if misspelled or typos are present it can be checked with the following:

    值格式不一致:雖然這種類型的問題可能主要在分類值期間出現(如果存在拼寫錯誤或錯字),但可以使用以下方法進行檢查:
df[‘age’].value_counts()
Image for post

This will return the number of iterations each value is repeated throughout the dataset.

這將返回在整個數據集中重復每個值的迭代次數。

離群數據(Outlier data:)

A dataframe column holds information about a specific feature within the data. Hence we can have a basic idea of the range of those values. For example age, we know there is going to be a range between 0 or 100. This does not mean that outliers would not be present between that range.

數據框列保存有關數據中特定功能的信息。 因此,我們可以對這些值的范圍有一個基本的了解。 例如,年齡,我們知道將有一個介于0或100之間的范圍。這并不意味著在該范圍之間不會出現異常值。

A simple illustration of the following can be seen graphing a boxplot:

可以通過繪制箱形圖來簡單了解以下內容:

sns.boxplot(x=df['age'])
plt.show()
Image for post

The values seen as dots on the righthand side could be considered as outliers in this dataframe as they fall outside the the range of commonly witnessed values.

在此數據框中,右側的點表示的值可以視為離群值,因為它們不在通常見證的值范圍之內。

多重共線性: (Multicollinearity:)

While multicollinearity is not considered to be messy data it just means that the columns or features in the dataframe are correlated. For example if you were to have a a column for “price” a column for “weight” and a third for “price per weight” we expect a high multicollinearity between these fields. This could be solved by dropping some of these highly correlated columns.

雖然多重共線性不被認為是凌亂的數據,但這僅意味著數據框中的列或要素是相關的。 例如,如果您有一個“價格”列,一個“重量”列和一個“每重量價格”列,那么我們期望這些字段之間具有較高的多重共線性。 這可以通過刪除一些高度相關的列來解決。

f, ax = plt.subplots(figsize=(10, 8))corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)
Image for post

In this case we can see that the values do not exceed 0.7 either positively nor negatively and hence it can be considered safe to continue.

在這種情況下,我們可以看到值的正或負均不超過0.7,因此可以認為繼續操作是安全的。

使此過程更容易: (Making this process easier:)

While data scientists often go through these initial tasks repetitively, it could be made easier by creating structured functions that allows the easy visualisation of this information. Lets try:

盡管數據科學家經常重復地完成這些初始任務,但通過創建結構化的函數可以使此信息的可視化變得更加容易。 我們試試吧:

----------------------------------------------------------------
from quickdata import data_viz # File found in repository
----------------------------------------------------------------from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
print(data[‘DESCR’][:830])
X = pd.DataFrame(data[‘data’],columns=data[‘feature_names’])
y = data[‘target’]

1-Checking Multicollinearity

1-檢查多重共線性

The function below returns a heatmap of collinearity between independent variables as well as with the target variable.

下面的函數返回自變量之間以及目標變量之間共線性的熱圖。

data = independent variable df X

數據 =自變量df X

target = dependent variable list y

目標 =因變量列表y

remove = list of variables not to be included (default as empty list)

remove =不包括的變量列表(默認為空列表)

add_target = boolean of whether to view heatmap with target included (default as False)

add_target =是否查看包含目標的熱圖的布爾值(默認為False)

inplace = manipulate your df to save the changes you made with remove/add_target (default as False)

inplace =操縱df保存使用remove / add_target所做的更改(默認為False)

*In the case remove was passed a column name, a regplot of that column and the target is also presented to help view changes before proceeding*

*如果為remove傳遞了一個列名,該列的重新繪制圖和目標,則在繼續操作之前還會顯示目標以幫助查看更改*

data_viz.multicollinearity_check(data=X, target=y, remove=[‘Latitude’], add_target=False, inplace=False)

data_viz.multicollinearity_check(data = X,target = y,remove = ['Latitude'],add_target = False,inplace = False)

Image for post
Image for post

2- Viewing Outliers:This function returns a side-by-side view of outliers through a regplot and a boxplot visualisation of a the input data and target values over a specified split size.

2-查看離群值:此函數通過regplot和箱形圖可視化返回離群值的并排視圖,該圖顯示輸入數據和目標值在指定分割范圍內的情況。

data = independent variable df X

數據 =自變量df X

target = dependent variable list y

目標 =因變量列表y

split = adjust the number of plotted rows as decimals between 0 and 1 or as integers

split =將繪制的行數調整為0到1之間的小數或整數

data_viz.view_outliers(data=X, target=y, split_size= 0.3 )

data_viz.view_outliers(data = X,target = y,split_size = 0.3)

Image for post

It is important that these charts are read by the data scientist and not automated away to the machine. Since not all datasets follow the same rules it is important that a human interprets the visualisations and acts accordingly.

這些圖表必須由數據科學家讀取,而不是自動傳送到計算機,這一點很重要。 由于并非所有數據集都遵循相同的規則,因此重要的是,人類必須解釋視覺效果并據此采取行動。

I hope this short run-through of data visualisation helps provide more clear visualisations of your data to better fuel your decisions when data cleaning.

我希望這段簡短的數據可視化過程有助于為您的數據提供更清晰的可視化,以便在清理數據時更好地推動您的決策。

The functions used in the example above is available here :

上面示例中使用的功能在此處可用:

Feel free to customise these as you see fit!

隨意自定義這些內容!

翻譯自: https://medium.com/@rani_64949/visualisations-of-data-for-help-in-data-cleaning-dce15a94b383

數據可視化 信息可視化

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389213.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389213.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389213.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

VS2005 ASP.NET2.0安裝項目的制作(包括數據庫創建、站點創建、IIS屬性修改、Web.Config文件修改)

站點&#xff1a; 如果新建默認的Web安裝項目&#xff0c;那它將創建的默認網站下的一個虛擬應用程序目錄而不是一個新的站點。故我們只有創建新的安裝項目&#xff0c;而不是Web安裝項目。然后通過安裝類進行自定義操作&#xff0c;創建新站如下圖&#xff1a; 2、創建新的安項…

docker的基本命令

docker的三大核心&#xff1a;倉庫(repository),鏡像(image),容器(container)三者相互轉換。 1、鏡像(image) 鏡像&#xff1a;組成docker容器的基礎.類似安裝系統的鏡像 docker pull tomcat 通過pull來下載tomcat docker push XXXX 通過push的方式發布鏡像 2、容器(container)…

seaborn添加數據標簽_常見Seaborn圖的數據標簽快速指南

seaborn添加數據標簽In the course of my data exploration adventures, I find myself looking at such plots (below), which is great for observing trend but it makes it difficult to make out where and what each data point is.在進行數據探索的過程中&#xff0c;我…

使用python pandas dataframe學習數據分析

?? Note — This post is a part of Learning data analysis with python series. If you haven’t read the first post, some of the content won’t make sense. Check it out here.Note? 注意 -這篇文章是使用python系列學習數據分析的一部分。 如果您還沒有閱讀第一篇文…

實現TcpIp簡單傳送

private void timer1_Tick(object sender, EventArgs e) { IPAddress ipstr IPAddress.Parse("192.168.0.106"); TcpListener serverListener new TcpListener(ipstr,13);//創建TcpListener對象實例 ser…

SQLServer之函數簡介

用戶定義函數定義 與編程語言中的函數類似&#xff0c;SQL Server 用戶定義函數是接受參數、執行操作&#xff08;例如復雜計算&#xff09;并將操作結果以值的形式返回的例程。 返回值可以是單個標量值或結果集。 用戶定義函數準則 在函數中&#xff0c;將會區別處理導致語句被…

無向圖g的鄰接矩陣一定是_矩陣是圖

無向圖g的鄰接矩陣一定是To study structure,tear away all flesh soonly the bone shows.要研究結構&#xff0c;請盡快撕掉骨頭上所有的肉。 Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work …

移動pc常用Meta標簽

移動常用 <meta charset"UTF-8"><title>{$configInfos[store_title]}</title><meta content"widthdevice-width,minimum-scale1.0,maximum-scale1.0,shrink-to-fitno,user-scalableno,minimal-ui" name"viewport"><m…

前端繪制繪制圖表_繪制我的文學風景

前端繪制繪制圖表Back when I was a kid, I used to read A LOT of books. Then, over the last couple of years, movies and TV series somehow stole the thunder, and with it, my attention. I did read a few odd books here and there, but not with the same ferocity …

Rapi

本頁內容 ●引言●SMARTPHONE SDK API 庫●管理設備中的目錄文件●取系統信息●遠程操作電話和短信功能 Windows Mobile日益成熟&#xff0c;開發者隊伍也越來越壯大。作為一個10年的計算機熱愛者和程序員&#xff0c;我也經受不住新技術的誘惑&#xff0c;倒騰起Mobile這個玩具…

android 字符串特殊字符轉義

XML轉義字符 以下為XML標志符的數字和字符串轉義符 " ( 或 &quot;) ( 或 &apos;) & ( 或 &amp;) lt(<) (< 或 <) gt(>) (> 或 >) 如題&#xff1a; 比如&#xff1a;在string.xml中定義如下一個字符串&#xff0c;…

如何描繪一個vue的項目_描繪了一個被忽視的幽默來源

如何描繪一個vue的項目Source)來源 ) Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing …

數據存儲加密和傳輸加密_將時間存儲網絡應用于加密預測

數據存儲加密和傳輸加密I’m not going to string you along until the end, dear reader, and say “Didn’t achieve anything groundbreaking but thanks for reading ;)”.親愛的讀者&#xff0c;我不會一直待到最后&#xff0c;然后說&#xff1a; “沒有取得任何開創性的…

熊貓分發_熊貓新手:第一部分

熊貓分發For those just starting out in data science, the Python programming language is a pre-requisite to learning data science so if you aren’t familiar with Python go make yourself familiar and then come back here to start on Pandas.對于剛接觸數據科學的…

多線程 進度條 C# .net

前言  在我們應用程序開發過程中&#xff0c;經常會遇到一些問題&#xff0c;需要使用多線程技術來加以解決。本文就是通過幾個示例程序給大家講解一下多線程相關的一些主要問題。 執行長任務操作  許多種類的應用程序都需要長時間操作&#xff0c;比如&#xff1a;執行一…

window 10 多版本激活工具

window 10 通用版激活工具 云盤地址&#xff1a;https://pan.baidu.com/s/1bo3L4Kn 激活工具網站&#xff1a;http://www.tudoupe.com/win10/win10jihuo/2017/0516/6823.html 轉載于:https://www.cnblogs.com/ipyanthony/p/9288007.html

android 動畫總結筆記 一

終于有時間可以詳細去了解一下 android動畫&#xff0c;先從android動畫基礎著手。在android 3.0之前android動畫api主要是android.view.Animation包下的內容&#xff0c;來先看看這個包里面主要的類![Animation成員](https://img-blog.csdn.net/20150709115201928 "Anima…

《Linux內核原理與分析》第六周作業

課本&#xff1a;第五章 系統調用的三層機制&#xff08;下&#xff09; 中斷向量0x80和system_call中斷服務程序入口的關系 0x80對應著system_call中斷服務程序入口&#xff0c;在start_kernel函數中調用了trap_init函數&#xff0c;trap_init函數中調用了set_system_trap_gat…

使用C#調用外部Ping命令獲取網絡連接情況

使用C#調用外部Ping命令獲取網絡連接情況 以前在玩Windows 98的時候&#xff0c;幾臺電腦連起來&#xff0c;需要測試網絡連接是否正常&#xff0c;經常用的一個命令就是Ping.exe。感覺相當實用。 現在 .net為我們提供了強大的功能來調用外部工具&#xff0c;并通過重定向輸…