機器學習模型非線性模型

A Case Study of Shap and pdp using Diabetes dataset

使用糖尿病數據集對Shap和pdp進行案例研究

Explaining Machine Learning Models has always been a difficult concept to comprehend in which model results and performance stay black box (hidden). In this post, I seek to explain the result of a ML model I built with the help of two major libraries SHAP and Partial Dependence Plot. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

解釋機器學習模型一直是一個很難理解的概念，其中模型的結果和性能始終處于黑匣子(隱藏)狀態。在這篇文章中，我試圖解釋借助兩個主要庫SHAP和Partial Dependence Plot構建的ML模型的結果。該數據集最初來自美國國立糖尿病與消化與腎臟疾病研究所。數據集的目的是基于數據集中包含的某些診斷測量值來診斷預測患者是否患有糖尿病。從較大的數據庫中選擇這些實例受到一些限制。特別是，這里的所有患者均為女性，至少有21歲的皮馬印第安人血統。

The dataset consists of several medical predictor variables. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. The major goal of the data was to answer the question “Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes”.

數據集由幾個醫學預測變量組成。預測變量包括患者懷孕次數，BMI，胰島素水平，年齡等。數據的主要目的是回答“您是否可以建立一個機器學習模型來準確預測數據集中的患者是否患有糖尿病”這一問題。

The steps took to build the machine learning model are as follows:

建立機器學習模型的步驟如下：

Exploratory Data Analysis (Uni-variate and Bi-variate Analysis)
探索性數據分析(單變量和雙變量分析)

2. Feature Engineering and Feature Selection

2.特征工程與特征選擇

3. Baseline and comparison of several ML models on a metric

3.在一個度量標準上對幾種ML模型進行基線和比較

4. Performing hyperparameter Tuning on the Selected Algorithm

4.對所選算法執行超參數調整

5. Interpretation of the model Result

5.解釋模型結果

?xploratory數據分析 (Exploratory Data Analysis)

Exploring Data is a paramount and important step for any data science project. This will enable you (Data Scientist) to understand the kind of features you are working with and how it relate to the question you are trying to answer. EDA can be broadly split into Uni-variate Analysis and Bi-variate Analysis.

對于任何數據科學項目來說，探索數據都是至關重要的一步。這將使您(數據科學家)能夠了解正在使用的功能類型以及它們與您要回答的問題的關系。 EDA可以大致分為單變量分析和雙變量分析。

Uni-variate Analysis broadly entails checking the feature distribution, this can help answer the question if the feature distribution is skewed or if there are missing values in the feature.

單變量分析廣泛地需要檢查特征分布，這可以幫助回答以下問題：特征分布是否傾斜或特征中是否缺少值。

Bi-variate Analysis is concerned with two features to explore how they relate with each other. Generally, it is advised to take one (1) feature and the target variable as this will save time and effort especially if a model is to be built. The complete analysis can be found here

雙變量分析涉及兩個功能，以探討它們之間的關系。通常，建議采用一(1)個功能和目標變量，因為這將節省時間和精力，尤其是在要構建模型的情況下。完整的分析可以在這里找到

Let us Understand the describe function of pandas Quick Question: What does it mean to have a standard deviation (std) close to the mean? Firstly, what is std? Standard deviation is a number used to tell how measurement for a group are spread out from the average (mean), or expected value. A low std means that most of the numbers are closer to the meanwhile a higher std means that the numbers are more spread out. So to answer the question what it means to have a std closer to mean like mean is 100 and std is 50, then the data in this group are widely spread out and if a group has mean = 100 and std = 10, then the data is marginally spread out that is the data has a long cone when visually checked out, that is the data are within a small range.

讓我們了解大熊貓的描述功能 快速問題 ：標準偏差(std)接近均值意味著什么？首先，什么是std？標準偏差是一個數字，用于指示組的度量如何從平均值(均值)或期望值中擴展。 std值低意味著大多數數字都接近于此，而std值越高意味著數字越分散。因此，要回答這個問題，使標準差更接近均值(例如均值是100，標準差是50)意味著什么，那么該組中的數據將廣泛散布，如果一組均值= 100并且標準差= 10，則該數據略微散開，即在視覺上檢出時數據具有較長的圓錐，即數據在較小范圍內。

Another Quick Question: What do you understand by 25% and 75% of a group of data? so to answer that we need to know what percentile really means? so what is percentile? Percentile is defined in various ways, percentile is a number where a certain percentage of scores fall below or equal to that number. so in a test you score 67 out of 90 and you fall into the 90th percentile, that means you scored better than 90% of people who took the test. the 25th percentile is called first quartile, the 50th percentile is called the median and 75th percentile is called the third quartile. The difference between the third and first quartile is the interquartile range. So to answer the question, 25% means 25% of the data falls below that value and the 75% means 75% of the data falls below that value.

另一個快速問題 ：您對一組數據的25％和75％了解什么？所以要回答，我們需要知道百分位數真正意味著什么？那么什么是百分位數？百分位數有多種定義，百分位數是分數的某個百分比低于或等于該數字的數字。因此，在一項測試中，您在90分中獲得67分，并且落在第90個百分位，這意味著您的得分比參加該測試的人90％要好。第25個百分位數稱為第一四分位數，第50個百分位數稱為中位數，第75個百分位數稱為第三四分位數。第三個四分位數與第一個四分位數之間的差是四分位數間距。因此要回答這個問題，25％表示25％的數據低于該值，而75％表示75％的數據低于該值。

According to the site where the data was gotten from, 0 are used to represent missing values so when we check for missing with isnull() we get nothing we need to check for zero and need to be careful doing so too because we don't want to change the target values.

根據獲取數據的站點，0用于表示缺失值，因此當我們使用isull()檢查缺失時，我們什么也沒有得到，我們需要檢查零，也需要小心，因為我們不會要更改目標值。

Image for post — Distribution of pregnancy count

Feature Engineering

特征工程

This is the process of generating/ adding new features to already existing features either by aggregating or transforming features in other to improve the performance of the model. Log transform was taken to correct the data skewness.

這是通過對其他要素進行聚合或轉換以提高模型性能的方法，將新要素生成/添加到現有要素中。進行對數變換以糾正數據偏斜。

df['log_SkinThickness'] = np.log(df['SkinThickness'])df['log_Insulin'] = np.log(df['Insulin'])df['log_BMI'] = np.log(df['BMI'])df['log_Age'] = np.log(df['Age'])df.head()

df['BloodSugar'] = np.abs(df['Insulin'] - df['Glucose'])df['high_BMI'] = df['BMI'].apply(lambda x: 1 if x > 30 else 0)cm = np.corrcoef(df[cols].values.T)f, ax = plt.subplots(figsize=(14, 12))sns.heatmap(cm, vmax=.8, linewidths=0.01, square=True, annot=True, cmap='viridis', linecolor='white',xticklabels=cols.values, annot_kws={'size':12}, yticklabels=cols.values)

Select Features for model

選擇模型特征

cols = cols.drop(['Outcome'])features = df[cols]target = df['Outcome']features.head()

Performance Metric = Accuracy

Performance Metric = 準確性

The baseline for this model based on the chosen metric will be 70%. What this means is that any algorithm giving accuracy less than 70% will be discarded, at the long run the algorithm with the highest accuracy score will be selected. The data was scaled using Standard Scaler to reduce model bias.

基于所選指標的此模型的基準為70％。這意味著任何精度低于70％的算法都將被丟棄，從長遠來看，將選擇精度最高的算法。使用Standard Scaler縮放數據以減少模型偏差。

Establishing a Baseline Model

建立基準模型

A Baseline Model is a simple model that is to be improved. It is basically, building a model using the default parameters of an algorithm to understand the performance and detect some important features. Some Models that were built include Logistic Regression, Random Forest, Gradient Boosting, K-nearest Neighbors.

基準模型是一個有待改進的簡單模型。基本上，使用算法的默認參數構建模型以了解性能并檢測一些重要功能。建立的一些模型包括邏輯回歸，隨機森林，梯度提升，K近鄰。

邏輯回歸 (Logistic Regression)

log_reg = linear_model.LogisticRegression()list_scores = []log_reg.fit(features, target)log_reg_score = cross_val_score(log_reg, features, target, cv=10, scoring='accuracy').mean()print(log_reg_score)list_scores.append(log_reg_score)0.7695488721804511

鄰居分類器 (KNeighbours Classifier)

cv_scores = []# --- number of folds ---folds = 10#---creating odd list of K for KNN--ks = list(range(1,int(len(features) * ((folds - 1)/folds)), 2))#---perform k-fold cross validation--for k in ks:knn = neighbors.KNeighborsClassifier(n_neighbors=k)score = cross_val_score(knn, features, target, cv=folds, scoring='accuracy').mean()cv_scores.append(score)#---get the maximum score--knn_score = max(cv_scores)#---find the optimal k that gives the highest score--optimal_k = ks[cv_scores.index(knn_score)]print(f"The optimal number of neighbors is {optimal_k}")print(knn_score)0.7747436773752565list_scores.append(knn_score)The optimal number of neighbors is 17

Other Models that were built include: Gradient Boost Classifier, Support Vector Machine and Random Forest.

構建的其他模型包括：梯度Boost分類器，支持向量機和隨機森林。

Selecting Best Model

選擇最佳模型

K-Neighbor performance as measured by accuracy is the best model and i moved to hyper-parameter tuning.

用精度衡量的K鄰域性能是最好的模型，我轉向了超參數調整。

Performing Hyper parameter Tuning on the Selected Algorithm

對所選算法執行超參數調整

cv_scores = []# --- number of folds ---folds = 30#---creating odd list of K for KNN--ks = list(range(1,int(len(features) * ((folds - 1)/folds)), 2))#---perform k-fold cross validation--for k in ks:    knn = neighbors.KNeighborsClassifier(n_neighbors=k)    score = cross_val_score(knn, features, target, cv=folds,            scoring='accuracy').mean()   cv_scores.append(score)#---get the maximum score--knn_score = max(cv_scores)#---find the optimal k that gives the highest score--optimal_k = ks[cv_scores.index(knn_score)]print(f"The optimal number of neighbors is {optimal_k}")print(knn_score)# result.append(knn_score)The optimal number of neighbors is 19New Accuracy after Hyper-parameter tune is 0.80

解釋模型結果 (Interpreting the model Result)

Permutation Importance using ELI5 library

使用ELI5庫的排列重要性

What features does a model think are important? What features might have a greater impact on the model predictions than the others? This is called feature importance and permutation importance which are techniques used widely for calculating feature importance. It helps to see when our model produces counter-intuitive results and it helps to show the others when our model is working as we’d hope.

模型認為哪些功能很重要？哪些功能可能比其他功能對模型預測的影響更大？這被稱為特征重要性和置換重要性，它們是廣泛用于計算特征重要性的技術。它有助于查看我們的模型何時產生違反直覺的結果，并有助于向其他人展示我們的模型何時如我們期望的那樣工作。

The idea is simple: Randomly permutate or shuffle a single column in the validation dataset leaving all the other columns intact. A feature is considered important if the model’s accuracy drops a lot and causes an increase in error when it not included as a feature to build the model. On the other hand, a feature is considered ‘unimportant’ if shuffling its values do not affect the models accuracy.

這個想法很簡單：在驗證數據集中隨機排列或隨機排列單個列，而所有其他列保持不變。如果模型的準確性下降很多，并且在不將其作為構建模型的特征時會導致誤差增加，則認為該特征很重要。另一方面，如果將特征的值改組不影響模型的準確性，則認為該特征“不重要”。

Permutation importance is useful for debugging, understanding your model, and communicating a high-level overview from your model. Permutation importance is calculated after a model has been fitted. ELI5 is a python library which allows to visualize and debug various ML models using unified API. it has built-in support for several ML frameworks and provides a way to explain black-box models

排列的重要性對于調試，理解模型以及傳達模型的高層概述很有用。擬合重要性是在模型擬合后計算的。 ELI5是一個Python庫，可以使用統一的API可視化和調試各種ML模型。它具有對多個ML框架的內置支持，并提供了一種解釋黑匣子模型的方法

# calculating and Displaying importance using the eli5 libraryplt.figure(figsize=[10,10])import eli5from eli5.sklearn import PermutationImportanceperm = PermutationImportance(model, random_state=1).fit(X_test, y_test)eli5.show_weights(perm, feature_names=X_test.columns.tolist())

Interpretation

解釋

The features at the top are the most important and at the bottom, the least. For example, Glucose level is the most important feature which can decide whether a person will have diabetes, which also makes sense. The number after +/- measures how performance varied from one reshuffling to the next. Some weights are negative which means that the feature does not have high impact in deciding whether or not patient has diabetes.

頂部的功能最重要，底部的功能最少。例如，葡萄糖水平是最重要的特征，可以決定一個人是否患有糖尿病，這也很有意義。 +/-之后的數字表示從一次改組到下一次改組的性能變化。一些權重是負的，這意味著該特征在確定患者是否患有糖尿病方面沒有很大的影響。

Partial Dependence Plots

部分依賴圖

The partial dependence plot (pdp) shows the marginal effect one or two features have on the predicted outcome of a ML model. PDP shows how a feature affects predictions. PDP can show the relationship between the target and the selected features via 1d or 2D plots.

偏相關圖(pdp)顯示了一個或兩個特征對ML模型的預測結果的邊際影響。 PDP顯示功能如何影響預測。 PDP可以通過1d或2D圖顯示目標和選定要素之間的關系。

from pdpbox import pdp, get_dataset, info_plotsfeatures_name = [i for i in features.columns]pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=features_name, feature='Glucose')pdp.pdp_plot(pdp_goals, 'Glucose')plt.show()

From the chart above, glucose level below 100, the probability of a patient having diabetes is low. As the Glucose level increases beyond 110, the chance of patients having diabetes increases rapidly from 0.2 through 0.4 at Glucose level 140 and impulsively to 0.8 at Glucose level 200.

從上圖可以看出，如果葡萄糖水平低于100，則患者患糖尿病的可能性較低。隨著葡萄糖水平增加到超過110，糖尿病患者的機會從葡萄糖水平140的0.2Swift增加到0.4，而在葡萄糖水平200的脈沖中Swift增加到0.8。

Glucose level beyond 110 poses patients at high risk of having diabetes while if below 100, the risk is low.

葡萄糖水平超過110會使患者罹患糖尿病的風險很高，而低于100則意味著風險較低。

pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=features_name, feature='log_Insulin')
pdp.pdp_plot(pdp_goals, 'log_Insulin')

The Y-axis represents the change in prediction from what it would be predicted at the baseline or leftmost value. Blue area denotes the confidence interval.For the ‘Insulin’ graph, we observe that probability of a person having diabetes slightly increases continuously as the Insulin level goes up and then remains constant.

Y軸表示預測值相對于基線或最左值的預測值的變化。藍色區域表示置信區間。對于“胰島素”圖，我們觀察到患有糖尿病的人的概率會隨著胰島素水平的升高而持續增加，然后保持恒定。

The chart above says if number of pregnancies is greater than 7, probability of having diabetes will increase by 0.1. This means that, as the number of pregnancy increases beyond 8, the probability of having diabetes also increases.

上表顯示，如果懷孕次數大于7，則患糖尿病的可能性將增加0.1 。這意味著，隨著懷孕次數增加到8次以上，患糖尿病的可能性也會增加。

From the Age plot above,it can be observed that younger people has lower chance of having diabetes. The age range (27–29) experience a slight downward trend in probability of having diabetes. According to healthline, Middle-aged and older adults are still at the highest risk for developing type-2 diabetes. In 2015, adults aged 45 to 64 were the most diagnosed age group for diabetes and it conforms to what the model explains as the probability of having diabetes continuously increases as the age of patient increases beyond 49 years.

從上面的年齡圖可以看出，年輕人患糖尿病的機會較低。 27-29歲年齡段患糖尿病的可能性略有下降。根據healthline的數據，中年和老年人仍然是罹患2型糖尿病的最高風險。 2015年，年齡在45歲至64歲之間的成年人是診斷最多的糖尿病年齡組，符合該模型的解釋，即隨著患者年齡超過49歲，患糖尿病的可能性不斷增加。

使用SHAP值解釋模型結果 (Using SHAP values to explain model results)

SHAP stands for SHapley Additive exPlanation. It helps to break down a prediction to show the impact of each feature. it is based on Shapley values, a technique used in game theory to determine how much each player in a collaborative game has contributed to its success. Normally, getting the trade-off between accuracy and interpreting model results the right way can be a difficult task but with SHAP values, it easy to achieve both.

SHAP代表SHapley Additive exPlanation。它有助于分解預測以顯示每個功能的影響。它基于Shapley值，這是博弈論中使用的一種技術，用于確定協作游戲中每個玩家對其成功的貢獻程度。通常，以正確的方式在準確性和解釋模型結果之間進行權衡可能是一項艱巨的任務，但使用SHAP值，可以輕松實現兩者。

SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

SHAP是一種博弈論方法，用于解釋任何機器學習模型的輸出。它使用來自博弈論的經典Shapley值及其相關擴展，將最佳信用分配與本地解釋聯系起來。

At a lower Glucose level and a small number of pregnancy, the chance/probability of having diabetes is low. As the value of Glucose level increases the probability also increases which further suggest that higher Glucose level means higher probability of having diabetes.

在較低的葡萄糖水平和少量妊娠的情況下，患糖尿病的機會/可能性較低。隨著葡萄糖水平值的增加，可能性也增加，這進一步表明較高的葡萄糖水平意味著患糖尿病的可能性更高。

Glucose and number of pregnancy are good predictors to determine if a patient has diabetes or not and it can be concluded that the higher their values are the higher the chance of having diabetes.

葡萄糖和妊娠次數是確定患者是否患有糖尿病的良好預測指標，可以得出結論，他們的值越高，患糖尿病的機會越大。

With the chart above, you can interact with some of the features used to build the model and see how each vary and affect the decision of the model.

使用上面的圖表，您可以與用于構建模型的某些功能進行交互，并查看每個功能如何變化并影響模型的決策。

推薦和結論 (Recommendation and Conclusion)

I will like to start with knowing about the data and give a summary about the findings

我想從了解數據開始，并總結一下調查結果

Know about the data

了解數據

Explain Findings

解釋發現

Recommendation

建議

Know about the data This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

了解數據該數據集最初來自美國國立糖尿病與消化與腎臟疾病研究所。數據集的目的是基于數據集中包含的某些診斷測量值來診斷預測患者是否患有糖尿病。

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not? I believe i have answered this question.

您是否可以建立機器學習模型來準確預測數據集中的患者是否患有糖尿病？ 我相信我已經回答了這個問題。

Explain Findings Features like Glucose and Blood Pressure are normally distributed while Features like Age, BMI, Insulin, Skin Thickness are Skewed. The Skewed features was handled by taking the logarithm of the values and that was how it was use to build the model. While Interpreting the result of the model, Assumptions like if pregnancy is more than 7, increases the chance of having diabetes was verified to be true. Also assumption that: if glucose level is high i.e glucose > 140, then probability of having diabetes is high or increases was also verified to be true. it was discovered that features like Glucose contribute to people having diabetes, also high BMI will increase the chance of having diabetes. if Insulin level is high, the chance of having diabetes will reduce. Also, a low level of Glucose and small value of Pregnancy will reduce the chance of having diabetes.

說明結果葡萄糖和血壓等功能正常分布，而年齡，BMI，胰島素，皮膚厚度等功能偏斜。通過獲取值的對數來處理偏斜特征，這就是用來構建模型的方式。在解釋模型的結果時，假設if pregnancy is more than 7, increases the chance of having diabetes was verified to be true 。還假設： if glucose level is high ie glucose > 140, then probability of having diabetes is high or increases was also verified to be true 。研究發現，葡萄糖等功能會助長糖尿病患者的生命，而高BMI也會增加患糖尿病的機會。如果Insulin水平高，則患糖尿病的機會會減少。同樣，低水平的葡萄糖和少量的懷孕將減少患糖尿病的機會。

Recommendation

建議

a. Always do Cardio at least 20 mins per day to reduce the Glucose level

一個。每天至少要做20分鐘有氧運動以降低葡萄糖水平

b. Eat bitter leafs and reduce the intakes of sweet things like snacks

b。吃苦樹葉，減少零食等甜食的攝入

c. Do Child Control and use condoms if you do not wish to get pregnant

C。如果您不想懷孕，請進行兒童控制和使用避孕套

d. Tell your Husband he don do nah you wan kill me :)

d。告訴你的丈夫， he don do nah you wan kill me :)

This was put together as part of effort to explain black box machine learning models, feel free to reach me on Twitter.

這是解釋黑匣子機器學習模型的一部分，請隨時與我聯系。

Special Thanks to Yabebal, 10Academy Team and Babatunde. The complete notebook can be found here

特別感謝Yabebal，10 學院團隊和Babatunde。完整的筆記本可以在這里找到

翻譯自: https://medium.com/swlh/machine-learning-model-explanation-9f238d8bfe25

機器學習模型非線性模型

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389383.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389383.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389383.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

5分鐘內完成胸部CT掃描機器學習

This post provides an overview of chest CT scan machine learning organized by clinical goal, data representation, task, and model.這篇文章按臨床目標，數據表示，任務和模型組織了胸部CT掃描機器學習的概述。 A chest CT scan is a grayscale 3…

Pytorch高階API示范——線性回歸模型

本文與《20天吃透Pytorch》有所不同，《20天吃透Pytorch》中是繼承之前的模型進行擬合，本文是單獨建立網絡進行擬合。代碼實現： import torch import numpy as np import matplotlib.pyplot as plt import pandas as pd from torch import …

vue 上傳圖片限制大小和格式

<div class"upload-box clear"><span class"fl">上傳圖片</span><div class"artistDet-logo-box fl"><el-upload :action"this.baseServerUrl/fileUpload/uploadPic?filepathartwork" list-type"pic…

作業要求 20181023-3 每周例行報告

本周要求參見：https://edu.cnblogs.com/campus/nenu/2018fall/homework/2282 1、本周PSP 總計：927min 2、本周進度條代碼行數博文字數用到的軟件工程知識點 217 757 PSP、版本控制 3、累積進度圖 （1）累積代碼折線圖 &…

算命數據_未來的數據科學家或算命精神向導

算命數據Real Estate Sale Prices, Regression, and Classification: Data Science is the Future of Fortune Telling房地產銷售價格，回歸和分類：數據科學是算命的未來 As we all know, I am unusually blessed with totally-real psychic abilities.眾…

openai-gpt_為什么到處都看到GPT-3？

openai-gptDisclaimer: My opinions are informed by my experience maintaining Cortex, an open source platform for machine learning engineering.免責聲明：我的看法是基于我維護機器學習工程的開源平臺 Cortex的經驗而得出的。 If you frequent any part…

Pytorch高階API示范——DNN二分類模型

代碼部分： import numpy as np import pandas as pd from matplotlib import pyplot as plt import torch from torch import nn import torch.nn.functional as F from torch.utils.data import Dataset,DataLoader,TensorDataset""" 準備數據 &qu…

OO期末總結

$0 寫在前面善始善終，臨近期末，為一學期的收獲和努力畫一個圓滿的句號。 $1 測試與正確性論證的比較 $1-0 什么是測試？ 測試是使用人工操作或者程序自動運行的方式來檢驗它是否滿足規定的需求或弄清預期結果與實際結果之間的差別的過程。它…

puppet puppet模塊、file模塊

轉載：http://blog.51cto.com/ywzhou/1577356 作用：通過puppet模塊自動控制客戶端的puppet配置，當需要修改客戶端的puppet配置時不用在客戶端一一設置。 1、服務端配置puppet模塊 （1）模塊清單 [rootpuppet ~]# tree /et…

數據可視化及其重要性：Python

Data visualization is an important skill to possess for anyone trying to extract and communicate insights from data. In the field of machine learning, visualization plays a key role throughout the entire process of analysis.對于任何試圖從數據中提取和傳達見…

熊貓數據集_熊貓邁向數據科學的第三部分

熊貓數據集Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number i…