數據可視化 信息可視化
The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been greatly magnified as businesses look to expand insight about the market and their customers with easily obtainable data.
數據科學家的作用涉及檢索大量結構化或非結構化數據之間的隱藏關系,以達到或調整某些業務標準。 近年來,隨著企業希望通過易于獲得的數據來擴大對市場及其客戶的洞察力,這一作用的重要性已大大提高。
It is the data scientists job to take that data and return a deeper understanding of the business problem or opportunity. This often involves the use of scientific methods of which include machine learning (ML) or neural networks (NN). While these types of structures may find meaning in thousands of data points much faster than a human can, they can be unreliable if the data that is fed into them is messy data.
數據科學家的工作是獲取這些數據并返回對業務問題或機會的更深刻理解。 這通常涉及使用科學方法,包括機器學習(ML)或神經網絡(NN)。 盡管這些類型的結構可以在數千個數據點中找到比人類更快得多的含義,但是如果饋入其中的數據是凌亂的數據,則它們可能不可靠。
Messy data could cause have very negative consequences on your models they are of many forms of which include:
雜亂的數據可能會對您的模型造成非常不利的影響,它們的形式很多,包括:
缺少數據 : (Missing data:)
Represented as ‘NaN’ (an acronym of Not a Number) or as a ‘None’ a Python singleton object.
表示為“ NaN”(不是數字的縮寫)或Python單例對象的“無”。
Sometimes the best way to deal with problems is the simplest.
有時,解決問題的最佳方法是最簡單的。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('train.csv')df.info()
A quick inspection of the returned values shows the column count of 891 is inconsistent across the different columns a clear sign of missing information. We also notice some fields are of type “object” we’ll look at that next.
快速檢查返回的值會發現在不同的列中891的列數不一致,明顯缺少信息。 我們還注意到,接下來將要介紹一些字段屬于“對象”類型。
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
survived 891 non-null int64
pclass 891 non-null int64
name 891 non-null object
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
ticket 891 non-null object
fare 891 non-null float64
cabin 204 non-null object
embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB
Alternatively you can plot the missing values on a heatmap using seaborn but this could be very time consuming if handling big dataframes.
或者,您可以使用seaborn在熱圖上繪制缺失值,但是如果處理大數據幀,這可能會非常耗時。
sns.heatmap(df.isnull(), cbar=False)
數據不一致 : (Inconsistent data:)
- Inconsistent columns types: Columns in dataframes can differ as we saw above. Columns could be of a different types such as objects, integers or floats and while this is usually the case mismatch between column type and the type of value it holds might be problematic. Most important of format types include datetime used for time and date values. 列類型不一致:數據框中的列可能會有所不同,如上所述。 列可以具有不同的類型,例如對象,整數或浮點數,雖然通常這是列類型與其所擁有的值類型不匹配的情況,但可能會出現問題。 最重要的格式類型包括用于時間和日期值的日期時間。
- Inconsistent value formatting: While this type of problem might mainly arise during categorical values if misspelled or typos are present it can be checked with the following: 值格式不一致:雖然這種類型的問題可能主要在分類值期間出現(如果存在拼寫錯誤或錯字),但可以使用以下方法進行檢查:
df[‘age’].value_counts()

This will return the number of iterations each value is repeated throughout the dataset.
這將返回在整個數據集中重復每個值的迭代次數。
離群數據 : (Outlier data:)
A dataframe column holds information about a specific feature within the data. Hence we can have a basic idea of the range of those values. For example age, we know there is going to be a range between 0 or 100. This does not mean that outliers would not be present between that range.
數據框列保存有關數據中特定功能的信息。 因此,我們可以對這些值的范圍有一個基本的了解。 例如,年齡,我們知道將有一個介于0或100之間的范圍。這并不意味著在該范圍之間不會出現異常值。
A simple illustration of the following can be seen graphing a boxplot:
可以通過繪制箱形圖來簡單了解以下內容:
sns.boxplot(x=df['age'])
plt.show()

The values seen as dots on the righthand side could be considered as outliers in this dataframe as they fall outside the the range of commonly witnessed values.
在此數據框中,右側的點表示的值可以視為離群值,因為它們不在通常見證的值范圍之內。
多重共線性: (Multicollinearity:)
While multicollinearity is not considered to be messy data it just means that the columns or features in the dataframe are correlated. For example if you were to have a a column for “price” a column for “weight” and a third for “price per weight” we expect a high multicollinearity between these fields. This could be solved by dropping some of these highly correlated columns.
雖然多重共線性不被認為是凌亂的數據,但這僅意味著數據框中的列或要素是相關的。 例如,如果您有一個“價格”列,一個“重量”列和一個“每重量價格”列,那么我們期望這些字段之間具有較高的多重共線性。 這可以通過刪除一些高度相關的列來解決。
f, ax = plt.subplots(figsize=(10, 8))corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)

In this case we can see that the values do not exceed 0.7 either positively nor negatively and hence it can be considered safe to continue.
在這種情況下,我們可以看到值的正或負均不超過0.7,因此可以認為繼續操作是安全的。
使此過程更容易: (Making this process easier:)
While data scientists often go through these initial tasks repetitively, it could be made easier by creating structured functions that allows the easy visualisation of this information. Lets try:
盡管數據科學家經常重復地完成這些初始任務,但通過創建結構化的函數可以使此信息的可視化變得更加容易。 我們試試吧:
----------------------------------------------------------------
from quickdata import data_viz # File found in repository
----------------------------------------------------------------from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
print(data[‘DESCR’][:830])
X = pd.DataFrame(data[‘data’],columns=data[‘feature_names’])
y = data[‘target’]
1-Checking Multicollinearity
1-檢查多重共線性
The function below returns a heatmap of collinearity between independent variables as well as with the target variable.
下面的函數返回自變量之間以及目標變量之間共線性的熱圖。
data = independent variable df X
數據 =自變量df X
target = dependent variable list y
目標 =因變量列表y
remove = list of variables not to be included (default as empty list)
remove =不包括的變量列表(默認為空列表)
add_target = boolean of whether to view heatmap with target included (default as False)
add_target =是否查看包含目標的熱圖的布爾值(默認為False)
inplace = manipulate your df to save the changes you made with remove/add_target (default as False)
inplace =操縱df保存使用remove / add_target所做的更改(默認為False)
*In the case remove was passed a column name, a regplot of that column and the target is also presented to help view changes before proceeding*
*如果為remove傳遞了一個列名,該列的重新繪制圖和目標,則在繼續操作之前還會顯示目標以幫助查看更改*
data_viz.multicollinearity_check(data=X, target=y, remove=[‘Latitude’], add_target=False, inplace=False)
data_viz.multicollinearity_check(data = X,target = y,remove = ['Latitude'],add_target = False,inplace = False)


2- Viewing Outliers:This function returns a side-by-side view of outliers through a regplot and a boxplot visualisation of a the input data and target values over a specified split size.
2-查看離群值:此函數通過regplot和箱形圖可視化返回離群值的并排視圖,該圖顯示輸入數據和目標值在指定分割范圍內的情況。
data = independent variable df X
數據 =自變量df X
target = dependent variable list y
目標 =因變量列表y
split = adjust the number of plotted rows as decimals between 0 and 1 or as integers
split =將繪制的行數調整為0到1之間的小數或整數
data_viz.view_outliers(data=X, target=y, split_size= 0.3 )
data_viz.view_outliers(data = X,target = y,split_size = 0.3)

It is important that these charts are read by the data scientist and not automated away to the machine. Since not all datasets follow the same rules it is important that a human interprets the visualisations and acts accordingly.
這些圖表必須由數據科學家讀取,而不是自動傳送到計算機,這一點很重要。 由于并非所有數據集都遵循相同的規則,因此重要的是,人類必須解釋視覺效果并據此采取行動。
I hope this short run-through of data visualisation helps provide more clear visualisations of your data to better fuel your decisions when data cleaning.
我希望這段簡短的數據可視化過程有助于為您的數據提供更清晰的可視化,以便在清理數據時更好地推動您的決策。
The functions used in the example above is available here :
上面示例中使用的功能在此處可用:
Feel free to customise these as you see fit!
隨意自定義這些內容!
翻譯自: https://medium.com/@rani_64949/visualisations-of-data-for-help-in-data-cleaning-dce15a94b383
數據可視化 信息可視化
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389213.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389213.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389213.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!