熊貓直播 使用什么sdk
Python has become the go-to programming language for many data scientists and machine learning researchers. One essential data processing tool for them to make this choice is the pandas library. For sure, the pandas library is so versatile that it can be used for almost all initial data manipulations to get the data ready for conducting statistical analyses or building machine learning models.
對于許多數據科學家和機器學習研究人員來說,Python已經成為編程語言。 供他們選擇的一種重要數據處理工具是pandas庫。 可以肯定的是,pandas庫的用途非常廣泛,幾乎可以用于所有初始數據操作,從而為進行統計分析或構建機器學習模型做好準備。
However, for the same versatility, it can be overwhelming to start to use it comfortably. If you’re struggling on how to get it started, this article is right for you. Instead of covering too much to lose the points, the goal of the article is to provide an overview of key operations that you want to use in your daily data processing tasks. Each key operation is accompanied with some highlights of essential parameters to consider.
但是,對于相同的多功能性,開始舒適地使用它可能會讓人不知所措。 如果您正在努力入門,那么這篇文章很適合您。 本文的目的不是概述過多的問題,而是概述您要在日常數據處理任務中使用的關鍵操作。 每個按鍵操作都帶有一些要考慮的基本參數。
Certainly, the first step is to get the pandas installed, which you can use pip
or conda
by following the instruction here. In terms of coding environment, I recommend Visual Studio Code, JupyterLab, or Google Colab, all of which requires little efforts to get it set up. When the installation is done, in your preferred coding environment, you should be able to import pandas to your project.
當然,第一步是安裝熊貓,您可以按照此處的說明使用pip
或conda
。 在編碼環境方面,我建議使用Visual Studio Code,JupyterLab或Google Colab,所有這些都無需花費很多精力即可進行設置。 安裝完成后,在您喜歡的編碼環境中,您應該能夠將熊貓導入到您的項目中。
import pandas as pd
If you run the above code without encountering any errors, you’re good to go.
如果您在運行上述代碼時沒有遇到任何錯誤,那就很好了。
1.讀取外部數據 (1. Read External Data)
In most cases, we read data from external sources. If our data are in the spreadsheet-like format, the following functions should serve the purposes.
在大多數情況下,我們從外部來源讀取數據。 如果我們的數據采用電子表格格式,則應使用以下功能。
# Read a comma-separated filedf = pd.read_csv("the_data.csv")# Read an Excel spreadsheetdf = pd.read_excel("the_data.xlsx")
The header should be handled correctly. By default, the reading will assume the first line of data to be the column names. If they are no headers, you have to specify it (e.g.,
header=None
).標頭應正確處理。 默認情況下,讀數將假定第一行數據為列名。 如果沒有標題,則必須指定它(例如,
header=None
)。If you’re reading a tab-delimited file, you can use
read_csv
by specifying the tab as the delimiter (e.g.,sep=“\t”
).如果要讀取制表符分隔的文件,則可以通過將制表符指定為分隔符來使用
read_csv
(例如,sep=“\t”
)。When you read a large file, it’s a good idea by reading a small portion of the data. In this case, you can set the number of rows to be read (e.g.,
nrows=1000
).讀取大文件時,最好讀取一小部分數據。 在這種情況下,您可以設置要讀取的行數(例如
nrows=1000
)。If your data involve dates, you can consider setting arguments to make the dates right, such as
parse_dates
, andinfer_datetime_format
.如果數據涉及日期,則可以考慮設置參數以使日期正確,例如
parse_dates
和infer_datetime_format
。
2.創建系列 (2. Create Series)
During the process of cleaning up your data, you may need to create Series yourself. In most cases, you’ll simply pass an iterable to create a Series object.
在清理數據的過程中,您可能需要自己創建Series 。 在大多數情況下,您只需傳遞一個iterable即可創建Series對象。
# Create a Series from an iterableintegers_s = pd.Series(range(10))# Create a Series from a dictionary object
squares = {x: x*x for x in range(1, 5)}
squares_s = pd.Series(squares)
You can assign a name to the Series object by setting the
name
argument. This name will become the name if it becomes part of a DataFrame object.您可以通過設置
name
參數為Series對象分配name
。 如果該名稱成為DataFrame對象的一部分,它將成為名稱。You can also assign index to the Series (e.g., setting the
index
argument) if you find it more useful than the default 0-based index. Note that the index’s length should match your data’s length.如果發現
index
比默認的從0開始的索引更有用,則還可以將索引分配給Series (例如,設置index
參數)。 請注意,索引的長度應與數據的長度匹配。If you create a Series object from a dict, the keys will become the index.
如果從dict創建Series對象,則鍵將成為索引。
3.構造DataFrame (3. Construct DataFrame)
Oftentimes, you need to create DataFrame objects using Python built-in objects, such as lists and dictionaries. The following code snippet highlights two common use scenarios.
通常,您需要使用Python內置對象(例如列表和字典)創建DataFrame對象。 下面的代碼片段重點介紹了兩種常見的使用場景。
# Create a DataFrame from a dictionary of lists as valuesdata_dict = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}
data_df0 = pd.DataFrame(data_dict)# Create a DataFrame from a listdata_list = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
data_df1 = pd.DataFrame(data_list, columns=tuple('abc'))
The first one uses a dict object. Its keys will become column names while its values will become values for the corresponding columns.
第一個使用dict對象。 其鍵將成為列名,而其值將成為對應列的值。
The second one uses a list object. Unlike the previous method, the constructed DataFrame will use the data row-wise, which means that each inner list will become a row of the created DataFrame object.
第二個使用列表對象。 與以前的方法不同,構造的DataFrame將按行使用數據,這意味著每個內部列表將成為創建的DataFrame對象的一行。
4. DataFrame概述 (4. Overview of DataFrame)
When we have the DataFrame to work with, we may want to take a look at the dataset at the 30,000 feet level. There are several common ways that you can use, as shown below.
當我們使用DataFrame時,我們可能要看一下30,000英尺高度的數據集。 可以使用幾種常用方法,如下所示。
# Find out how many rows and columns the DataFrame hasdf.shape# Take a quick peak at the beginning and the end of the datadf.head()
df.tail()# Get a random sampledf.sample(5)# Get the information of the datasetdf.info()# Get the descriptive stats of the numeric valuesdf.describe()
- It’s important to check both the head and the tail especially if you work with a large amount of data, because you want to make sure that all of your data have been read completely. 檢查頭部和尾部非常重要,尤其是在處理大量數據時,因為要確保已完全讀取所有數據。
The
info()
function will give you an overview of the columns in terms of data types and item counts.info()
函數將為您提供有關數據類型和項目計數的列概述。It’s also interesting to get a random sample of your data to check your data’s integrity (i.e.,
sample()
function).獲取數據的隨機樣本以檢查數據的完整性(例如
sample()
函數)也很有趣。
5.重命名列 (5. Rename Columns)
You notice that some columns of your data don’t make too much sense or the name is too long to work with and you want to rename these columns.
您會注意到數據的某些列沒有太大意義,或者名稱太長而無法使用,并且您想重命名這些列。
# Rename columns using mappingdf.rename({'old_col0': 'col0', 'old_col1': 'col1'}, axis=1)# Rename columns by specifying columns directlydf.rename(columns={'old_col0': 'col0', 'old_col1': 'col1'})
You need to specify
axis=1
to rename columns if you simply provide a mapping object (e.g., dict).如果僅提供映射對象(例如dict ),則需要指定
axis=1
來重命名列。Alternatively, you can specify the mapping object to the
columns
argument explicitly.另外,您可以顯式地指定到
columns
參數的映射對象。The
rename
function will create a new DataFrame by default. If you want to rename the DataFrame inplace, you need to specifyinplace=True
.rename
功能將默認創建一個新的DataFrame。 如果要重命名DataFrame inplace ,則需要指定inplace=True
。
6.排序數據 (6. Sort Data)
To make your data more structured, you need to sort the DataFrame object.
為了使數據更加結構化,您需要對DataFrame對象進行排序。
# Sort datadf.sort_values(by=['col0', 'col1'])
By default, the
sort_values
function will sort your rows (axis=0
). In most cases, we use columns as the sorting keys.默認情況下,
sort_values
函數將對行進行排序(axis=0
)。 在大多數情況下,我們使用列作為排序鍵。The
sort_values
function will create a sorted DataFrame object by default. To change it, useinplace=True
.默認情況下,
sort_values
函數將創建一個排序的DataFrame對象。 要更改它,請使用inplace=True
。By default, the sorting is based on ascending for all sorting keys. If you want to use descending, specify
ascending=False
. If you want to have mixed orders (e.g., some keys are ascending and some descending), you can create a list of boolean values to match the number of keys, likeby=[‘col0’, ‘col1’, ‘col2’], ascending=[True, False, True]
.默認情況下,排序基于所有排序鍵的升序。 如果要使用降序,請指定
ascending=False
。 如果要混合使用順序(例如,某些鍵是升序,某些鍵是降序),則可以創建一個布爾值列表以匹配鍵的數量,例如by=['col0', 'col1', 'col2'], ascending=[True, False, True]
。The original index will go with their old data rows. In many cases, you need to re-index. Instead of calling rese
t
_index function directly, you can specify theignore_index
to beTrue
, which will reset the index for you after the sorting is completed.原始索引將與它們的舊數據行一起顯示。 在許多情況下,您需要重新編制索引。 您可以將
ignore_index
指定為True
,而不是直接調用reset
_index函數,它將在排序完成后為您重置索引。
7.處理重復 (7. Deal With Duplicates)
It’s a common scenario in real-life datasets that they contain duplicate records, either by human mistake or database glitch. We want to remove these duplicates because they can cause unexpected problems later on.
在現實的數據集中,這種情況很常見,它們包含重復記錄,可能是由于人為錯誤或數據庫故障引起的。 我們要刪除這些重復項,因為它們可能在以后導致意外問題。
# To examine whether there are duplicates using all columnsdf.duplicated().any()# To examine whether there are duplicates using particular columnsdf.duplicated(['col0', 'col1']).any()
The above functions will return you a boolean value telling you whether any duplicate records exist in your dataset. To find out the exact number of duplicate records, you can use the sum()
function by taking advantage of the returned Series object of boolean values from duplicated()
function (Python treats True
as a value of 1), as show below. One additional thing to note is that when the argument keep
is set to be False
, it will mark any duplicates as True
. Suppose that there are three duplicate records, when keep=False
, both records will be marked as True
(i.e., being duplicated). When keep= “first”
or keep=“last”
, only the first or the last record marked as True
.
上面的函數將返回一個布爾值,告訴您數據集中是否存在重復的記錄。 要找出重復記錄的確切數目,可以利用sum()
函數,方法是利用來自duplicated()
函數的布爾值返回的Series對象(Python將True
的值視為1),如下所示。 需要注意的另一件事是,當參數keep
設置為False
,它將所有重復項標記為True
。 假設有三個重復記錄,當keep=False
,兩個記錄都將被標記為True
(即,被重復)。 當keep= “first”
或keep=“last”
,僅第一個或最后一個記錄標記為True
。
# To find out the number of duplicatesdf.duplicated().sum()
df.duplicated(keep=False).sum()
To actually view the duplicated records, you need to select data from the original dataset using the generated duplicated Series object, as shown below.
要實際查看重復的記錄,您需要使用生成的重復的Series對象從原始數據集中選擇數據,如下所示。
# Get the duplicate recordsduplicated_indices = df.duplicated(['col0', 'col1'], keep=False)duplicates = df.loc[duplicated_indices, :].sort_values(by=['col0', 'col1'], ignore_index=True)
We get all duplicate records by setting the
keep
argument to beFalse
.通過將
keep
參數設置為False
可以獲得所有重復記錄。To better view the duplicate records, you may want to sort the generated DataFrame using the same set of keys.
為了更好地查看重復的記錄,您可能希望使用同一組鍵對生成的DataFrame進行排序。
Once you have a good idea about the duplicate records with your dataset, you can drop them like below.
一旦對數據集中的重復記錄有了一個好主意,就可以像下面這樣刪除它們。
# Drop the duplicate recordsdf.drop_duplicates(['col0', 'col1'], keep="first", inplace=True, ignore_index=True)
- By default, the kept record will be the first of the duplicates. 默認情況下,保留的記錄將是重復記錄中的第一個。
You’ll need to specify
inplace=True
if you want to update the DataFrame inplace. BTW: many other functions have this option, the discussion of which will be skipped most of the time if not all.如果要就地更新DataFrame,則需要指定
inplace=True
。 順便說一句:許多其他功能都有此選項,如果不是全部,大部分時間都將跳過其討論。As with the
sort_values()
function, you may want to reset index afterwards by specifying theignore_index
argument (a new feature in Pandas 1.0).與
sort_values()
函數一樣,您之后可能希望通過指定ignore_index
參數(Pandas 1.0中的一項新功能)來重置索引。
8.處理丟失的數據 (8. Handle Missing Data)
Missing data are common in real-life datasets, which can be due to measures not available or simply human data entry mistakes resulting in meaningless data that are deemed to be missing. To have an overall idea of how many missing values your dataset has, you’ve seen that the info() function tells us how many non-null values each column has. We can get information about data missingness in a more structured fashion, as shown below.
丟失數據在現實生活數據集中是很常見的,這可能是由于無法采取的措施或者僅僅是由于人類數據輸入錯誤導致了被認為丟失的毫無意義的數據所致。 為了全面了解數據集有多少個缺失值,您已經了解到info()函數告訴我們每列有多少個非空值。 我們可以以更結構化的方式獲取有關數據丟失的信息,如下所示。
# Find out how many missing values for each columndf.isnull().sum()# Find out how many missing values for the entire datasetdf.isnull().sum().sum()
The
isnull()
function creates a DataFrame of the same shape as your original DataFrame with each value indicating the original value to be missing (True
) or not (False
). As a related note, you can use thenotnull()
function if you want to generate a DataFrame indicating the non-null values.notull
isnull()
函數創建的形狀與原始DataFrame形狀相同的DataFrame ,每個值指示缺失的原始值(True
)或不存在(False
)。 作為相關說明,如果要生成指示非空值的notnull()
,則可以使用notnull()
函數。As mentioned previously, a
True
value in Python is arithmetically equal to 1. Thesum()
function will compute the sum of these boolean values for each column (by default, it’s calculating the sum column-wise), which reflect the number of missing values.如前所述,Python中的
True
值在算術上等于sum()
函數將為每一列計算這些布爾值的總和(默認情況下,它按列計算總和),這反映了缺失的數量價值觀。
By having some idea about the missingness of your dataset, we usually want to deal with them. Possible solutions include drop the records with any missing values or fill them with applicable values.
通過對數據集的缺失有所了解,我們通常希望對其進行處理。 可能的解決方案包括刪除任何缺少的值的記錄或使用適用的值填充它們。
# Drop the rows with any missing valuesdf.dropna(axis=0, how="any")# Drop the rows without 2 or more non-null valuesdf.dropna(thresh=2)# Drop the columns with all values missingdf.dropna(axis=1, how="all")
By default, the
dropna()
function works column-wise (i.e.,axis=0
). If you specifyhow=“any”
, rows with any missing values will be dropped.默認情況下,
dropna()
函數按列工作(即axis=0
)。 如果指定how=“any”
,則將刪除所有缺少值的行。When you set the
thresh
argument, it requires that the row (or the column whenaxis=1
) have the number of non-missing values.設置
thresh
參數時,它要求該行(或axis=1
時的列)具有非缺失值的數量。As many other functions, when you set
axis=1
, you’re performing operations column-wise. In this case, the above function call will remove the columns for those who have all of their values missing.與其他許多功能一樣,當您設置
axis=1
,您將按列執行操作。 在這種情況下,上面的函數調用將刪除那些缺少所有值的列。
Besides the operation of dropping data rows or columns with missing values, it’s also possible to fill the missing values with some values, as shown below.
除了刪除具有缺失值的數據行或數據列的操作外,還可以用一些值填充缺失值,如下所示。
# Fill missing values with 0 or any other value is applicabledf.fillna(value=0)# Fill the missing values with customized mapping for columnsdf.fillna(value={"col0": 0, "col1": 999})# Fill missing values with the next valid observationdf.fillna(method="bfill")# Fill missing values with the last valid observationdf.fillna(method="ffill")
To fill the missing values with specified values, you set the value argument either with a fixed value for all or you can set a dict object which will instruct the filling based on each column.
要用指定的值填充缺失值,可以將value參數設置為全部固定值,也可以設置dict對象,該對象將根據每一列指示填充。
- Alternatively, you can fill the missing values by using existing observations surrounding the missing holes, either back fill or forward fill. 或者,您可以通過使用圍繞缺失Kong的現有觀測值來填充缺失值,即回填或正向填充。
9.分組描述統計 (9. Descriptive Statistics by Group)
When you conduct machine learning research or data analysis, it’s often necessary to perform particular operations with some grouping variables. In this case, we need to use the groupby()
function. The following code snippet shows you some common scenarios that apply.
當您進行機器學習研究或數據分析時,通常需要使用一些分組變量來執行特定的操作。 在這種情況下,我們需要使用groupby()
函數。 以下代碼段顯示了一些適用的常見方案。
# Get the count by group, a 2 by 2 exampledf.groupby(['col0', 'col1']).size()# Get the mean of all applicable columns by groupdf.groupby(['col0']).mean()# Get the mean for a particular columndf.groupby(['col0'])['col1'].mean()# Request multiple descriptive statsdf.groupby(['col0', 'col1']).agg({
'col2': ['min', 'max', 'mean'],
'col3': ['nunique', 'mean']
})
By default, the
groupby()
function will return a GroupBy object. If you want to convert it to a DataFrame, you can call thereset_index()
on the object. Alternatively, you can specify theas_index=False
in thegroupby()
function call to create a DataFrame directly.默認情況下,
groupby()
函數將返回一個GroupBy對象。 如果要將其轉換為DataFrame ,則可以在對象上調用reset_index()
。 另外,您可以在groupby()
函數調用中指定as_index=False
以直接創建DataFrame 。The
size()
is useful if you want to know the frequency of each group.如果您想知道每個組的頻率,則
size()
很有用。The
agg()
function allows you to generate multiple descriptive statistics. You can simply pass a set of function names, which will apply to all columns. Alternatively, you can pass a dict object with functions to apply to specific columns.agg()
函數使您可以生成多個描述性統計信息。 您可以簡單地傳遞一組函數名稱,該名稱將應用于所有列。 另外,您可以傳遞一個dict對象,該對象具有要應用于特定列的函數。
10.寬到長格式轉換 (10. Wide to Long Format Transformation)
Depending on how the data are collected, the original dataset may be in the “wide” format — each row represents a data record with multiple measures (e.g., different time points for a subject in a research study). If we want to convert the “wide” format to the “long” format (e.g., each time point becomes a data row and thus a subject has multiple rows), we can use the melt()
function, as shown below.
取決于數據的收集方式,原始數據集可能采用“寬”格式-每行代表具有多種度量(例如,研究對象的不同時間點)的數據記錄。 如果我們想將“寬”格式轉換為“長”格式(例如,每個時間點變成一個數據行,因此一個主體有多個行),則可以使用melt()
函數,如下所示。
The
melt()
function is essentially “unpivoting” a data table (we’ll talk about pivoting next). You specify theid_vars
to be the columns that are used as identifiers in the original dataset.melt()
函數本質上是“取消透視”數據表(我們接下來將討論透視)。 您將id_vars
指定為原始數據集中用作標識符的列。The
value_vars
argument is set using the columns that contain the values. By default, the columns will become the values for thevar_name
column in the melted dataset.使用包含值的列設置
value_vars
參數。 默認情況下,這些列將成為融化數據集中var_name
列的值。
11.從長格式到寬格式的轉換 (11. Long to Wide Format Transformation)
The opposite operation to the melt()
function is called pivoting, which we can realize with the pivot()
function. Suppose that the created “wide” format DataFrame is called df_long
. The following function shows you how we can convert the wide format to the long format — basically reverse the process that we did in the previous section.
與melt()
函數相反的操作稱為pivoting,我們可以通過pivot()
函數來實現。 假設創建的“寬”格式DataFrame稱為df_long
。 以下功能向您展示了如何將寬格式轉換為長格式-基本上逆轉了上一節中的過程。
Besides the pivot()
function, a closely related function is the pivot_table()
function, which is more general than the pivot()
function by allowing duplicate index or columns (see here for a more detailed discussion).
除pivot_table()
pivot()
函數外,一個密切相關的函數是pivot_table()
函數,它比pivot()
函數更通用,它允許重復的索引或列(請參見此處以獲取更詳細的討論)。
12.選擇數據 (12. Select Data)
When we work with a complex dataset, we need to select a subset of the dataset for particular operations based on some criteria. If you select some columns, the following code shows you how to do it. The selected data will include all the rows.
當我們處理復雜的數據集時,我們需要根據一些條件為特定操作選擇數據集的子集。 如果選擇一些列,則以下代碼向您展示如何執行此操作。 所選數據將包括所有行。
# Select a columndf_wide['subject']# Select multiple columnsdf_wide[['subject', 'before_meds']]
If you want to select certain rows with all columns, do the following.
如果要選擇具有所有列的某些行,請執行以下操作。
# Select rows with a specific conditiondf_wide[df_wide['subject'] == 100]
What if you want to select certain rows and columns, we should consider using the iloc
or loc
methods. The major difference between these methods is that the iloc
method uses 0-based index, while the loc
method uses labels.
如果要選擇某些行和列,應該考慮使用iloc
或loc
方法。 這些方法之間的主要區別在于iloc
方法使用基于0的索引,而loc
方法使用標簽。
- The above pairs of calls create the same output. For clarity, only one output is listed. 上面的調用對創建相同的輸出。 為了清楚起見,僅列出了一個輸出。
When you use slice objects with the
iloc
, the stop index isn’t included, just as regular Python slice objects. However, the slice objects include the stop index in theloc
method. See Lines 15–17.當您將切片對象與
iloc
一起使用iloc
,不包括stop索引,就像常規的Python切片對象一樣。 但是,切片對象在loc
方法中包含停止索引。 參見第15-17行。As noted in Line 22, when you use a boolean array, you need to use the actual values (using the values method, which will return the underlying numpy array). If you don’t do that, you’ll probably encounter the following error:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
.如第22行所述,當您使用布爾數組時,需要使用實際值(使用values方法,這將返回基礎的numpy數組)。 如果不這樣做,則可能會遇到以下錯誤:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
。The use of labels in
loc
methods happens to be the same as index in terms of selecting rows, because the index has the same name as the index labels. In other words,iloc
will always use 0-based index based on the position regardless of the numeric values of the index.就選擇行而言,在
loc
方法中使用標簽恰好與索引相同,因為索引的名稱與索引標簽的名稱相同。 換句話說,iloc
將始終基于位置使用基于0的索引,而不管索引的數值如何。
13.使用現有數據的新列(映射并應用) (13. New Columns Using Existing Data (map and apply))
Existing columns don’t always present the data in the format we want. Thus, we often need to generate new columns using existing data. Two functions are particularly useful in this case: map()
and apply()
. There are too many possible ways that we can use them to create new columns. For instance, the apply()
function can have a more complex mapping function and it can create multiple columns. I’ll just show you two most use common cases with the following rules of thumb. Let’s keep our goal simple — just create one column with either use case.
現有列并不總是以我們想要的格式顯示數據。 因此,我們經常需要使用現有數據來生成新列。 在這種情況下,兩個函數特別有用: map()
和apply()
。 我們可以使用太多方式來創建新列。 例如, apply()
函數可以具有更復雜的映射函數,并且可以創建多個列。 我將通過以下經驗法則向您展示兩個最常用的案例。 讓我們保持目標簡單-只需使用任一用例創建一列。
If your data conversion involves just one column, simply use the
map()
function on the column (in essence, it’s a Series object).如果您的數據轉換僅涉及一列,則只需在該列上使用
map()
函數(本質上是一個Series對象)。If your data conversion involves multiple columns, use the
apply()
function.如果您的數據轉換涉及多個列,請使用
apply()
函數。
In both cases, I used lambda functions. However, you can use regular functions. It’s also possible to provide a dict object for the
map()
function, which will map the old values to the new values based on the key-value pairs with keys being the old values and the values being the new values.在兩種情況下,我都使用lambda函數。 但是,您可以使用常規功能。 也可以為
map()
函數提供一個dict對象,該對象將根據鍵值對將舊值映射到新值,其中鍵為舊值,而值為新值。For the
apply()
function, when we create a new column, we need to specifyaxis=1
, because we’re accessing data row-wise.對于
apply()
函數,當我們創建新列時,我們需要指定axis=1
,因為我們要逐行訪問數據。For the apply() function, the example shown is intended for demonstration purposes, because I could’ve used the original column to do a simpler arithmetic subtraction like this:
df_wide[‘change’] = df_wide[‘before_meds’] —df_wide[‘after_meds’]
.對于apply()函數,所示示例僅用于演示目的,因為我可以使用原始列來進行如下更簡單的算術減法:
df_wide['change'] = df_wide['before_meds'] —df_wide['after_meds']
。
14.串聯與合并 (14. Concatenation and Merging)
When we have multiple datasets, it’s necessary to put them together from time to time. There are two common scenarios. The first scenario is when you have datasets of similar shape, either sharing the same index or same columns, you can consider concatenating them directly. The following code shows you some possible concatenations.
當我們有多個數據集時,有必要不時將它們放在一起。 有兩種常見方案。 第一種情況是,當您擁有形狀相似的數據集(共享相同的索引或相同的列)時,可以考慮直接將它們連接起來。 以下代碼顯示了一些可能的連接。
# When the data have the same columns, concatenate them verticallydfs_a = [df0a, df1a, df2a]
pd.concat(dfs_a, axis=0)# When the data have the same index, concatenate them horizontallydfs_b = [df0b, df1b, df2b]
pd.concat(dfs_b, axis=1)
- By default, the concatenation performs an “outer” join, which means that if there are any non-overlapping index or columns, all of them will be kept. In other words, it’s like creating a union of two sets. 默認情況下,串聯執行“外部”聯接,這意味著如果存在任何不重疊的索引或列,則將全部保留它們。 換句話說,這就像創建兩個集合的并集。
Another thing to remember is that if you need to concatenate multiple DataFrame objects, it’s recommended that you create a list to store these objects, and perform concatenation just once by avoiding generating intermediate DataFrame objects if you perform concatenation sequentially.
要記住的另一件事是,如果需要連接多個DataFrame對象,建議您創建一個列表來存儲這些對象,并通過順序執行串聯操作避免生成中間DataFrame對象,從而只執行一次連接。
If you want to reset the index for the concatenated DataFrame, you can set
ignore_index=True
argument.如果要重置串聯的DataFrame的索引,則可以設置
ignore_index=True
參數。
The other scenario is to merge datasets that have one or two overlapping identifiers. For instance, one DataFrame has id number, name and gender, and the other has id number and transaction records. You can merge them using the id number column. The following code shows you how to merge them.
另一種情況是合并具有一個或兩個重疊標識符的數據集。 例如,一個DataFrame具有ID號,名稱和性別,而另一個具有ID號和交易記錄。 您可以使用ID號列合并它們。 以下代碼顯示了如何合并它們。
# Merge DataFrames that have the same merging keysdf_a0 = pd.DataFrame(dict(), columns=['id', 'name', 'gender'])
df_b0 = pd.DataFrame(dict(), columns=['id', 'name', 'transaction'])
merged0 = df_a0.merge(df_b0, how="inner", on=["id", "name"])# Merge DataFrames that have different merging keysdf_a1 = pd.DataFrame(dict(), columns=['id_a', 'name', 'gender'])
df_b1 = pd.DataFrame(dict(), columns=['id_b', 'transaction'])
merged1 = df_a1.merge(df_b1, how="outer", left_on="id_a", right_on="id_b")
When both DataFrame objects share the same key or keys, you can simply specify them (either one or multiple is fine) using the
on
argument.當兩個DataFrame對象共享一個或多個相同的鍵時,您可以使用
on
參數簡單地指定它們(一個或多個都可以)。When they have different names, you can specify which one for the left DataFrame and which one for the right DataFrame.
當他們有不同的名稱,你可以指定一個左數據框和一個合適的數據幀 。
By default, the merging will use the inner join method. When you want to have other join methods (e.g., left, right, outer), you set the proper value for the
how
argument.默認情況下,合并將使用內部連接方法。 當您要使用其他聯接方法(例如,左,右,外)時,可以為
how
參數設置適當的值。
15.放置列 (15. Drop Columns)
Although you can keep all the columns in the DataFrame by renaming them without any conflict, sometimes you’d like to drop some columns to keep the dataset clean. In this case, you should use the drop()
function.
盡管您可以通過重命名所有列將其保留在DataFrame中,而不會發生任何沖突,但是有時您還是希望刪除一些列以保持數據集的整潔。 在這種情況下,應使用drop()
函數。
# Drop the unneeded columnsdf.drop(['col0', 'col1'], axis=1)
By default, the
drop()
function uses labels to refer to columns or index, and thus you may want to make sure that the labels are contained in the DataFrame object.默認情況下,
drop()
函數使用標簽來引用列或索引,因此您可能需要確保標簽包含在DataFrame對象中。To drop index, you use
axis=0
. If you drop columns, which I find them to be more common, you useaxis=1
.要刪除索引,請使用
axis=0
。 如果刪除列(我發現它們更常見),則使用axis=1
。Again, this operation creates a DataFrame object, and if you prefer changing the original DataFrame, you specify
inplace=True
.同樣,此操作將創建一個DataFrame對象,并且如果您希望更改原始DataFrame ,則可以指定inplace
inplace=True
。
16.寫入外部文件 (16. Write to External Files)
When you want to communicate data with your collaborators or teammates, you need to write your DataFrame objects to external files. In most cases, the comma-delimited files should serve the purposes.
當您想與合作者或隊友交流數據時,您需要將DataFrame對象寫入外部文件。 在大多數情況下,以逗號分隔的文件應能達到目的。
# Write to a csv file, which will keep the indexdf.to_csv("filename.csv")# Write to a csv file without the indexdf.to_csv("filename.csv", index=False)# Write to a csv file without the headerdf.to_csv("filename.csv", header=False)
By default, the generated file will keep the index. You need to specify
index=False
to remove the index from the output.默認情況下,生成的文件將保留索引。 您需要指定
index=False
才能從輸出中刪除索引。By default, the generated file will keep the header (e.g., column names). You need to specify
header=False
to remove the headers.默認情況下,生成的文件將保留標題(例如,列名)。 您需要指定
header=False
來刪除標題。
結論 (Conclusion)
In this article, we reviewed the basic operations that you’ll find them useful to get you started with the pandas library. As indicated by the article’s title, these techniques aren’t intended to handle the data in a fancy way. Instead, they’re all basic techniques to allow you to process the data in the way you want. Later on, you can probably find fancier ways to get some operations done.
在本文中,我們回顧了基本操作,您會發現它們對開始使用pandas庫很有用。 如文章標題所示,這些技術并非旨在以一種奇特的方式處理數據。 相反,它們都是允許您以所需方式處理數據的基本技術。 稍后,您可能會找到更理想的方法來完成一些操作。
翻譯自: https://towardsdatascience.com/nothing-fancy-but-16-essential-operations-to-get-you-started-with-pandas-5b0c2f649068
熊貓直播 使用什么sdk
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/387847.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/387847.shtml 英文地址,請注明出處:http://en.pswp.cn/news/387847.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!