如何使用Python處理丟失的數據

The complete notebook and required datasets can be found in the git repo here

完整的筆記本和所需的數據集可以在git repo中找到

Real-world data often has missing values.

實際數據通常缺少值

Data can have missing values for a number of reasons such as observations that were not recorded/measured or may be data corrupted.

數據可能由于許多原因而缺少值,例如未記錄/測量的觀測值或數據可能已損壞。

Handling missing data is important as many machine learning algorithms do not support data with missing values.

處理丟失的數據非常重要,因為許多機器學習算法不支持帶有缺失值的數據。

In this notebook, you will discover how to handle missing data for machine learning with Python.

在本筆記本中,您將發現如何使用Python處理丟失的數據以進行機器學習。

Specifically, after completing this tutorial you will know:

具體而言,完成本教程后,您將知道:

  • How to mark invalid or corrupt values as missing in your dataset.

    如何在數據集中將無效或損壞的值標記為丟失

  • How to remove rows with missing data from your dataset.

    如何從數據集中刪除缺少數據的行。

  • How to impute missing values with mean values in your dataset.

    如何在數據集中用均值估算缺失值

Lets get started.

讓我們開始吧。

"How to Handle Missing Data with Python"

總覽 (Overview)

This tutorial is divided into 6 parts:

本教程分為6部分:

  1. Diabetes Dataset: where we look at a dataset that has known missing values.

    糖尿病數據集:我們在其中查看具有已知缺失值的數據集。

  2. Mark Missing Values: where we learn how to mark missing values in a dataset.

    標記缺失值:我們在這里學習如何標記數據集中的缺失值。

  3. Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.

    缺失值會導致問題:在這里,我們將了解機器學習算法包含缺失值時如何失敗。

  4. Remove Rows With Missing Values: where we see how to remove rows that contain missing values.

    刪除具有缺失值的行:我們將在這里看到如何刪除包含缺失值的行。

  5. Impute Missing Values: where we replace missing values with sensible values.

    估算缺失值:我們用合理的值替換缺失的值。

  6. Algorithms that Support Missing Values: where we learn about algorithms that support missing values.

    支持缺失值的算法:我們在此處了解支持缺失值的算法。

First, let’s take a look at our sample dataset with missing values.

首先,讓我們看一下缺少值的樣本數據集。

1.糖尿病數據集 (1. Diabetes Dataset)

The Diabetes Dataset involves predicting the onset of diabetes within 5 years in given medical details.

糖尿病數據集包括在給定的醫療細節中預測5年內的糖尿病發作。

  • Dataset File.

    數據集文件。
  • Dataset Details both files are available in the same folder as this notebook.

    數據集詳細信息這兩個文件都在與此筆記本相同的文件夾中。

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

這是一個二進制(2類)分類問題。 每個類別的觀察次數不平衡。 有768個觀測值,其中包含8個輸入變量和1個輸出變量。 變量名稱如下:

  1. Number of times pregnant.

    懷孕的次數。
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

    口服葡萄糖耐量試驗中血漿葡萄糖濃度2小時。
  3. Diastolic blood pressure (mm Hg).

    舒張壓(毫米汞柱)。
  4. Triceps skinfold thickness (mm).

    三頭肌皮褶厚度(毫米)。
  5. 2-Hour serum insulin (mu U/ml).

    2小時血清胰島素(mu U / ml)。
  6. Body mass index (weight in kg/(height in m)2).

    體重指數(體重以千克/(身高以米)2)。
  7. Diabetes pedigree function.

    糖尿病譜系功能。
  8. Age (years).

    年齡(年)。
  9. Class variable (0 or 1).

    類變量(0或1)。

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

預測最流行的類別的基準性能是大約65%的分類準確性。 最佳結果的分類精度約為77%。

A sample of the first 5 rows is listed below.

下面列出了前5行的示例。

# load and summarize the dataset
import numpy as np
import pandas as pd
# load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv', header=None)
# look few rows of the dataset
dataset.head()

This dataset is known to have missing values.

已知此數據集缺少值。

Specifically, there are missing observations for some columns that are marked as a zero value.(This is a very bad way representation of missing values)

具體來說,某些標記為零值的列缺少觀測值(這是表示缺失值的一種非常糟糕的方式)

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

我們可以通過定義這些列和領域知識來證實這一點,即對于這些度量,零值無效,例如,對于體重指數或血壓為零無效。

Note : Here zero values (0) for data indicate missing values only for few predictors/features, namely 1,2,3,4,5 and not for target/response variable

注意:此處數據的零值(0)僅表示很少的預測變量/特征(即1,2,3,4,5)的缺失值,而不是目標變量/響應變量的缺失值

2.標記缺失值 (2. Mark Missing Values)

Most data has missing values, and the likelihood of having missing values increases with the size of the dataset.

大多數數據都有缺失值,并且缺失值的可能性會隨數據集的大小而增加。

Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as the data set size increases.

丟失數據在實際數據集中并不罕見。 實際上,隨著數據集大小的增加,至少一個數據點丟失的機會增加。

— Page 187, Feature Engineering and Selection, 2019.

—第187頁,功能工程與選擇,2019年。

A note on this book, I just received this two days book and enjoying reading it

關于這本書的筆記,我剛收到這兩天的書,喜歡閱讀

Image for post

In this section, we will look at how we can identify and mark values as missing.

在本節中,我們將研究如何識別和標記缺失值。

We can use summary statistics to help identify missing or corrupt data.

我們可以使用摘要統計信息來幫助識別丟失或損壞的數據。

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

我們可以將數據集作為Pandas DataFrame加載,并在每個屬性上打印摘要統計信息。

dataset.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
0 768 non-null int64
1 768 non-null int64
2 768 non-null int64
3 768 non-null int64
4 768 non-null int64
5 768 non-null float64
6 768 non-null float64
7 768 non-null int64
8 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KBdataset.describe()# example of summarizing the number of missing values for each variable
# count the number of missing values for each column
num_missing = (dataset[[1,2,3,4,5]] == 0).sum()
# report the results
num_missing1 5
2 35
3 227
4 374
5 11
dtype: int64# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.nan)
# count the number of nan values in each column
dataset.isnull().sum()0 0
1 5
2 35
3 227
4 374
5 11
6 0
7 0
8 0
dtype: int64dataset.head()

3.缺少值會導致問題 (3. Missing Values Causes Problems)

Having missing values in a dataset can cause errors with some machine learning algorithms.

數據集中缺少值會導致某些機器學習算法出錯。

Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling.

缺失值是數據中的常見情況。 不幸的是,大多數預測建模技術無法處理任何缺失值。 因此,必須在建模之前解決此問題。

— Page 203, Feature Engineering and Selection, 2019.

—第203頁, 功能工程與選擇 ,2019年。

In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.

在本節中,我們將嘗試對缺少值的數據集評估線性判別分析(LDA)算法。

This is an algorithm that does not work when there are missing values in the dataset.

當數據集中缺少值時,此算法無效。

The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.

下面的示例與上一節中的操作一樣,標記了數據集中的缺失值,然后嘗試使用3倍交叉驗證來評估LDA并打印出平均準確度。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# define the model
model = LinearDiscriminantAnalysis()
# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)
# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
# report the mean performance
print('Accuracy: %.3f' % result.mean())Accuracy: nan
(FitFailedWarning)

Running the example results in an error, as above the collapsed output

運行示例會導致錯誤,如上面的折疊輸出所示

This is as we expect.

這是我們所期望的。

We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.

我們無法評估缺少值的數據集上的LDA算法(和其他算法)。

Many popular predictive models such as support vector machines, the glmnet, and neural networks, cannot tolerate any amount of missing values.

許多流行的預測模型,例如支持向量機,glmnet和神經網絡,都無法容忍任何數量的缺失值。

— Page 195, Feature Engineering and Selection, 2019.

—第195頁, 功能工程與選擇 ,2019年。

Now, we can look at methods to handle the missing values.

現在,我們來看一下處理缺失值的方法。

4.刪除缺少值的行 (4. Remove Rows With Missing Values)

The simplest strategy for handling missing data is to remove records that contain a missing value.

處理缺失數據的最簡單策略是刪除包含缺失值的記錄。

The simplest approach for dealing with missing values is to remove entire predictor(s) and/or sample(s) that contain missing values.

處理缺失值的最簡單方法是刪除包含缺失值的整個預測變量和/或樣本。

— Page 196, Feature Engineering and Selection, 2019.

—第196頁,功能工程與選擇,2019年。

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

為此,我們可以創建一個新的Pandas DataFrame,并刪除包含缺失值的行。

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:

Pandas提供了dropna()函數,該函數可用于刪除缺少數據的列或行。 我們可以使用dropna()刪除所有缺少數據的行,如下所示:

present dataset shape:

當前數據集形狀:

dataset.shape(768, 9)X.shape(768, 8)# drop rows with missing values
dataset.dropna(inplace=True)
# summarize the shape of the data with missing rows removed
print(dataset.shape)(392, 9)

In this example, we can see that the number of rows has been aggressively cut from 768 in the original dataset to 392 with all rows containing a NaN removed.

在此示例中,我們可以看到行數已從原始數據集中的768個減少到392個,同時刪除了所有包含NaN的行。

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# define the model
model = LinearDiscriminantAnalysis()
# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)
# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
# report the mean performance
print('Accuracy: %.3f' % result.mean())Accuracy: 0.781

Removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.

刪除缺失值的行可能會在某些預測建模問題上過于局限,一種替代方法是估算缺失值。

5.估算缺失值 (5. Impute Missing Values)

Imputing refers to using a model to replace missing values.

估算是指使用模型替換缺失值。

| missing data can be imputed. In this case, we can use information in the training set predictors to, in essence, estimate the values of other predictors.

|可以估算丟失的數據。 在這種情況下,我們可以使用訓練集預測變量中的信息來實質上估計其他預測變量的值。

— Page 42, Applied Predictive Modeling, 2013.

—第42頁, 應用預測建模 ,2013年。

There are many options we could consider when replacing a missing value, for example:

替換缺失值時,我們可以考慮許多選項,例如:

A constant value that has meaning within the domain, such as 0, distinct from all other values. A value from another randomly selected record. A mean, median or mode value for the column. A value estimated by another predictive model. Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

在域中具有含義的常數值,例如0,不同于所有其他值。 來自另一個隨機選擇的記錄的值。 列的平均值,中位數或眾數值。 由另一個預測模型估計的值。 當需要根據定型模型進行預測時,將來必須在新數據上執行在訓練數據集上執行的所有估算。 選擇如何估算缺失值時,必須考慮到這一點。

For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.

例如,如果您選擇使用平均列值進行估算,則這些平均列值將需要存儲到文件中,以便以后在缺少值的新數據上使用。

Pandas provides the fillna() function for replacing missing values with a specific value.

Pandas提供fillna()函數,用于用特定值替換缺失值。

For example, we can use fillna() to replace missing values with the mean value for each column, as follows:

例如,我們可以使用fillna()將缺失值替換為每一列的平均值,如下所示:

# manually impute missing values with numpy
from pandas import read_csv
from numpy import nan
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)
# count the number of NaN values in each column
print(dataset.isnull().sum())0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
dtype: int64

The scikit-learn library provides the SimpleImputer pre-processing class that can be used to replace missing values.

scikit-learn庫提供了SimpleImputer預處理類,該類可用于替換缺少的值。

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The SimpleImputer class operates directly on the NumPy array instead of the DataFrame.

它是一個靈活的類,允許您指定要替換的值(可以是NaN以外的其他值)以及用于替換它的技術(例如均值,中位數或眾數)。 SimpleImputer類直接在NumPy數組而不是DataFrame上操作。

The example below uses the SimpleImputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix.

下面的示例使用SimpleImputer類用每列的平均值替換缺失值,然后在轉換后的矩陣中打印NaN值的數量。

from sklearn.impute import SimpleImputer
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.nan)
# retrieve the numpy array
values = dataset.values
# define the imputer
imputer = SimpleImputer(missing_values=nan, strategy='mean')
# transform the dataset
transformed_values = imputer.fit_transform(values)
# count the number of NaN values in each column
print('Missing: %d' % np.isnan(transformed_values).sum())Missing: 0

In either case, we can train algorithms sensitive to NaN values in the transformed dataset, such as LDA.

無論哪種情況,我們都可以訓練對轉換后的數據集中的NaN值敏感的算法,例如LDA。

The example below shows the LDA algorithm trained in the SimpleImputer transformed dataset.

下面的示例顯示了在SimpleImputer轉換的數據集中訓練的LDA算法。

We use a Pipeline to define the modeling pipeline, where data is first passed through the imputer transform, then provided to the model. This ensures that the imputer and model are both fit only on the training dataset and evaluated on the test dataset within each cross-validation fold. This is important to avoid data leakage.

我們使用Pipeline來定義建模管道,其中數據首先通過imputer轉換傳遞,然后提供給模型。 這樣可以確保在每個交叉驗證折疊內,僅將導入者和模型都僅適合于訓練數據集并對其進行評估。 這對于避免數據泄漏很重要。

The complete final code is here:

完整的最終代碼在這里:

# example of evaluating a model after an imputer transform
from numpy import nan
from pandas import read_csv
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# define the imputer
imputer = SimpleImputer(missing_values=nan, strategy='mean')
# define the model
lda = LinearDiscriminantAnalysis()
# define the modeling pipeline
pipeline = Pipeline(steps=[('imputer', imputer),('model', lda)])
# define the cross validation procedure
kfold = KFold(n_splits=3, shuffle=True, random_state=1)
# evaluate the model
result = cross_val_score(pipeline, X, y, cv=kfold, scoring='accuracy')
# report the mean performance
print('Accuracy: %.3f' % result.mean())Accuracy: 0.762

Try replacing the missing values with other values and see if you can lift the performance of the model.

嘗試用其他值替換缺少的值,看看是否可以提高模型的性能。

Maybe missing values have meaning in the data.

缺失值可能在數據中有意義。

6.支持缺失值的算法 (6. Algorithms that Support Missing Values)

Not all algorithms fail when there is missing data.

缺少數據時,并非所有算法都會失敗。

There are algorithms that can be made robust to missing data, such as k-Nearest Neighbors that can ignore a column from a distance measure when a value is missing. Naive Bayes can also support missing values when making a prediction.

有一些算法可以使丟失數據變得健壯,例如k最近鄰可以在缺少值時忽略距離度量中的列。 進行預測時, 樸素貝葉斯也可以支持缺失值。

One of the really nice things about Naive Bayes is that missing values are no problem at all.

樸素貝葉斯(Naive Bayes)的真正好處之一是,缺少值根本沒有問題。

— Page 100, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

—第100頁,數據挖掘:實用機器學習工具和技術,2016年。

There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees.

還有一些算法可以在構建預測模型時使用缺失值作為唯一值和不同值,例如分類樹和回歸樹

| a few predictive models, especially tree-based techniques, can specifically account for missing data.

|一些預測模型,尤其是基于樹的技術,可以專門說明丟失的數據。

— Page 42, Applied Predictive Modeling, 2013.

—第42頁,應用預測建模,2013年。

Sadly, the scikit-learn implementations of naive bayes, decision trees and k-Nearest Neighbors are not robust to missing values. Although it is being considered in future versions of scikit-learn.

可悲的是,樸素貝葉斯,決策樹和k最近鄰居的scikit-learn實現對于丟失值并不健壯。 盡管scikit-learn的未來版本中將考慮使用它。

Nevertheless, this remains as an option if you consider using another algorithm implementation (such as xgboost).

但是,如果您考慮使用其他算法實現(例如xgboost ),則仍然可以選擇這樣做。

翻譯自: https://medium.com/@kvssetty/how-to-handel-missing-data-71a3eb89ef91

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391910.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391910.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391910.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

MySQL—隔離級別

READ UNCOMMITED(讀未提交) 即讀取到了正在修改但是卻還沒有提交的數據&#xff0c;這就會造成數據讀取的錯誤。 READ COMMITED(提交讀/不可重復讀) 它與READ UNCOMMITED的區別在于&#xff0c;它規定讀取的時候讀到的數據只能是提交后的數據。 這個級別所帶來的問題就是不可…

做虛擬化服務器的配資一致嘛,服務器虛擬化技術在校園網管理中的應用探討.pdf...

第 卷 第 期 江 蘇 建 筑 職 業 技 術 學 院 學 報14 3 Vol.14 曧.3年 月 JOURNAL OF JIANGSU JIANZHU INSTITUTE2014 09 Se .2014p服務器虛擬化技術在校園網管理中的應用探討,汪小霞 江建( , )健雄職業技術學院 軟件與服務外包學院 江蘇 太倉 215411: , ,摘 要 高校校園網數據…

aws中部署防火墻_如何在AWS中設置自動部署

aws中部署防火墻by Harry Sauers哈里紹爾斯(Harry Sauers) 如何在AWS中設置自動部署 (How to set up automated deployment in AWS) 設置和配置服務器 (Provisioning and Configuring Servers) 介紹 (Introduction) In this tutorial, you’ll learn how to use Amazon’s AWS…

Runtime的應用

來自&#xff1a;http://www.imlifengfeng.com/blog/?p397 1、快速歸檔 (id)initWithCoder:(NSCoder *)aDecoder { if (self [super init]) { unsigned int outCount; Ivar * ivars class_copyIvarList([self class], &outCount); for (int i 0; i < outCount; i ) …

使用 VisualVM 進行性能分析及調優

https://www.ibm.com/developerworks/cn/java/j-lo-visualvm/轉載于:https://www.cnblogs.com/adolfmc/p/7238893.html

spring—事務控制

編程式事務控制相關對象 PlatformTransactionManager PlatformTransactionManager 接口是 spring 的事務管理器&#xff0c;它里面提供了我們常用的操作事務的方法。注意&#xff1a; PlatformTransactionManager 是接口類型&#xff0c;不同的 Dao 層技術則有不同的實現類 …

為什么印度盛產碼農_印度農產品價格的時間序列分析

為什么印度盛產碼農Agriculture is at the center of Indian economy and any major change in the sector leads to a multiplier effect on the entire economy. With around 17% contribution to the Gross Domestic Product (GDP), it provides employment to more than 50…

SAP NetWeaver

SAP的新一代企業級服務架構——NetWeaver    SAP NetWeaver是下一代基于服務的平臺&#xff0c;它將作為未來所有SAP應用程序的基礎。NetWeaver包含了一個門戶框架&#xff0c;商業智能和報表&#xff0c;商業流程管理&#xff08;BPM&#xff09;&#xff0c;自主數據管理&a…

NotifyMyFrontEnd 函數背后的數據緩沖區(一)

async.c的 static void NotifyMyFrontEnd(const char *channel, const char *payload, int32 srcPid) 函數中的主要邏輯是這樣的&#xff1a;復制代碼if (whereToSendOutput DestRemote) { StringInfoData buf; pq_beginmessage(&buf, A); //cursor 為 A pq…

最后期限 軟件工程_如何在軟件開發的最后期限內實現和平

最后期限 軟件工程D E A D L I N E…最后期限… As a developer, this is one of your biggest nightmares or should I say your enemy? Name it whatever you want.作為開發人員&#xff0c;這是您最大的噩夢之一&#xff0c;還是我應該說您的敵人&#xff1f; 隨便命名。 …

SQL Server的復合索引學習【轉載】

概要什么是單一索引,什么又是復合索引呢? 何時新建復合索引&#xff0c;復合索引又需要注意些什么呢&#xff1f;本篇文章主要是對網上一些討論的總結。一.概念單一索引是指索引列為一列的情況,即新建索引的語句只實施在一列上。用戶可以在多個列上建立索引&#xff0c;這種索…

leetcode 1423. 可獲得的最大點數(滑動窗口)

幾張卡牌 排成一行&#xff0c;每張卡牌都有一個對應的點數。點數由整數數組 cardPoints 給出。 每次行動&#xff0c;你可以從行的開頭或者末尾拿一張卡牌&#xff0c;最終你必須正好拿 k 張卡牌。 你的點數就是你拿到手中的所有卡牌的點數之和。 給你一個整數數組 cardPoi…

pandas處理excel文件和csv文件

一、csv文件 csv以純文本形式存儲表格數據 pd.read_csv(文件名)&#xff0c;可添加參數enginepython,encodinggbk 一般來說&#xff0c;windows系統的默認編碼為gbk&#xff0c;可在cmd窗口通過chcp查看活動頁代碼&#xff0c;936即代表gb2312。 例如我的電腦默認編碼時gb2312&…

tukey檢測_回到數據分析的未來:Tukey真空度的整潔實現

tukey檢測One of John Tukey’s landmark papers, “The Future of Data Analysis”, contains a set of analytical techniques that have gone largely unnoticed, as if they’re hiding in plain sight.John Tukey的標志性論文之一&#xff0c;“ 數據分析的未來 ”&#x…

spring— Spring與Web環境集成

ApplicationContext應用上下文獲取方式 應用上下文對象是通過new ClasspathXmlApplicationContext(spring配置文件) 方式獲取的&#xff0c;但是每次從容器中獲 得Bean時都要編寫new ClasspathXmlApplicationContext(spring配置文件) &#xff0c;這樣的弊端是配置文件加載多次…

Elasticsearch集群知識筆記

Elasticsearch集群知識筆記 Elasticsearch內部提供了一個rest接口用于查看集群內部的健康狀況&#xff1a; curl -XGET http://localhost:9200/_cluster/healthresponse結果&#xff1a; {"cluster_name": "format-es","status": "green&qu…

Item 14 In public classes, use accessor methods, not public fields

在public類中使用訪問方法&#xff0c;而非公有域 這標題看起來真晦澀。。解釋一下就是&#xff0c;如果類變成public的了--->那就使用getter和setter&#xff0c;不要用public成員。 要注意它的前提&#xff0c;如果是private的class&#xff08;內部類..&#xff09;或者p…

子集和與一個整數相等算法_背包問題的一個變體:如何解決Java中的分區相等子集和問題...

子集和與一個整數相等算法by Fabian Terh由Fabian Terh Previously, I wrote about solving the Knapsack Problem (KP) with dynamic programming. You can read about it here.之前&#xff0c;我寫過有關使用動態編程解決背包問題(KP)的文章。 你可以在這里閱讀 。 Today …

matplotlib圖表介紹

Matplotlib 是一個python 的繪圖庫&#xff0c;主要用于生成2D圖表。 常用到的是matplotlib中的pyplot&#xff0c;導入方式import matplotlib.pyplot as plt 一、顯示圖表的模式 1.plt.show() 該方式每次都需要手動show()才能顯示圖表&#xff0c;由于pycharm不支持魔法函數&a…

到2025年將保持不變的熱門流行技術

重點 (Top highlight)I spent a good amount of time interviewing SMEs, data scientists, business analysts, leads & their customers, programmers, data enthusiasts and experts from various domains across the globe to identify & put together a list that…