大熊貓卸妝后
數據科學 (Data Science)
Pandas is used mainly for reading, cleaning, and extracting insights from data. We will see an advanced use of Pandas which are very important to a Data Scientist. These operations are used to analyze data and manipulate it if required. These are used in the steps performed before building any machine learning model.
熊貓主要用于讀取,清潔和從數據中提取見解。 我們將看到對數據科學家非常重要的熊貓的高級用法。 這些操作用于分析數據并根據需要進行操作。 這些用于構建任何機器學習模型之前執行的步驟。
- Summarising Data 匯總數據
- Concatenation 級聯
- Merge and Join 合并與加入
- Grouping 分組
- Pivot Table 數據透視表
- Reshaping multi-index DataFrame 重塑多索引DataFrame
We will be using the very famous Titanic dataset to explore the functionalities of Pandas. Let’s just quickly import NumPy, Pandas, and load Titanic Dataset from Seaborn.
我們將使用非常著名的泰坦尼克號數據集來探索熊貓的功能。 讓我們快速導入NumPy,Pandas,并從Seaborn加載Titanic Dataset。
import numpy as np
import pandas as pd
import seaborn as snsdf = sns.load_dataset('titanic')
df.head()

匯總數據 (Summarizing data)
The very first thing any data scientist would like to know is the statistics of the entire data. With the help of the Pandas .describe() method, we can see the summary stats of each feature. Notice, the stats are given only for numerical columns which is an obvious behavior we can also ask describe function to include categorical columns with the parameter ‘include’ and value equal to ‘all’ ( include=‘all’).
任何數據科學家都想知道的第一件事就是整個數據的統計信息。 借助Pandas .describe()方法 ,我們可以看到每個功能的摘要狀態。 注意,僅針對數字列提供統計信息,這是顯而易見的行為,我們也可以要求describe函數包括參數為'include'并且值等于'all'(include ='all')的分類列。
df.describe()

Another method is .info(). It gives metadata of a dataset. We can see the size of the dataset, dtype, and count of null values in each column.
另一個方法是.info() 。 它給出了數據集的元數據。 我們可以在每個列中看到數據集的大小,dtype和空值的計數。
df.info()

級聯 (Concatenation)
Concatenation of two DataFrames is very straightforward, thanks to the Pandas method concat(). Let us take a small section of our Titanic data with the help of vector indexing. Vector indexing is a way to specify the row and column name/integer we would like to index in any order as a list.
借助于Pandas 方法concat() ,兩個DataFrame的串聯非常簡單。 讓我們在矢量索引的幫助下,提取一小部分Titanic數據。 扇區索引是一種指定行和列名稱/整數的方法,我們希望以列表的任何順序對其進行索引。
smallData = df.loc[[1,7,21,10], ['sex','age','fare','who','class']]
smallData

Also, I have created a dataset with matching columns to explain concatenation.
另外,我還創建了帶有匹配列的數據集來說明串聯。
newData = pd.DataFrame({’sex’:[’female’,’male’,’male’],’age’: [25,49,35], ’fare’:[89.22,70.653,30.666],’who’:[’child’, 'women’, 'man’],'class’:[’First’,’First’,’First’]})

By default, the concatenation happens row-wise. Let’s see how the new dataset looks when we concat the two DataFrames.
默認情況下,串聯是逐行進行的 。 讓我們看看當連接兩個DataFrame時新數據集的外觀。
pd.concat([smallData, newData])

What if we want to concatenate ignoring the index? just set the ingore_index parameter to True.
如果我們想串聯忽略索引怎么辦? 只需將ingore_index參數設置為True。
pd.concat([ newData,smallData], ignore_index=True)

If we wish to concatenate along with the columns we just have to change the axis parameter to 1.
如果我們希望與列連接在一起,我們只需要將axis參數更改為1。
pd.concat([ newData,smallData], axis=1)


Notice the changes? As soon we concatenated column-wise Pandas arranged the data in an order of row indices. In smallData, row 0 and 2 are missing but present in newData hence insert them in sequential order. But we have row 1 in both the data and Pandas retained the data of the 1st dataset because that was the 1st dataset we passed as a parameter to concat. Also, the missing data is represented as NaN.
注意到變化了嗎? 一旦我們串聯列式熊貓,便以行索引的順序排列了數據。 在smallData中,缺少第0行和第2行,但存在于newData中,因此按順序插入它們。 但是我們在數據中都有第1行,Pandas保留了第一個數據集的數據,因為那是我們作為參數傳遞給concat的第一個數據集。 同樣,缺失的數據表示為NaN。
We can also perform concatenation in SQL join fashion. Let’s create a new DataFrame ‘newData’ having a few columns the same as smallData but not all.
我們還可以以SQL連接的方式執行串聯。 讓我們創建一個新的DataFrame'newData',其中有幾列與smallData相同,但并非全部。
newData = pd.DataFrame({'fare':[89.22,70.653,30.666,100],
'who':['child', 'women', 'man', 'women'],
'class':['First','First','First','Second'],
'adult_male': [True, False, True, False]})
newData

If you are familiar with SQL join operation we can notice that .concat() performs outer join by default. Missing values for unmatched columns are filled with NaN.
如果您熟悉SQL連接操作,我們會注意到.concat()默認執行外部連接 。 未匹配列的缺失值用NaN填充。
pd.concat([smallData, newData])

We can control the type of join operation with ‘join’ parameter. Let’s perform an inner join that takes only common columns from two.
我們可以使用'join'參數控制聯接操作的類型。 讓我們執行一個內部聯接,該聯接只接受兩個公共列。
pd.concat([smallData, newData], join='inner')

Merge and Join
合并與加入
Pandas provide us an exclusive and more efficient method .merge() to perform in-memory join operations. Merge method is a subset of relational algebra that comes under SQL.
熊貓為我們提供了一種排他的,更有效的方法.merge()來執行內存中的聯接操作。 合并方法是SQL下的關系代數的子集。
I will be moving away from our Titanic dataset only for this section to ease the understanding of join operation with less complex data.
在本節中,我將不再使用Titanic數據集,以簡化對不太復雜數據的聯接操作的理解。
There are different types of join operations:
有不同類型的聯接操作:
- One-to-one 一對一
- Many-to-one 多對一
- Many-to-many 多對多
The classic data used to explain joins in SQL in the employee dataset. Lets create DataFrames.
用于解釋員工數據集中SQL中聯接的經典數據。 讓我們創建DataFrames。
df1=pd.DataFrame({'employee_name'['Tasha','Linda','Olliver','Jack'],'department':['Engineering', 'Accounting', 'HR', 'HR']})df2 = pd.DataFrame({'employee_name':['Linda', 'Tasha', 'Jack', 'Olliver'],'salary':[35000, 20500, 90000, 68000]})

One-to-one
一對一
One-to-one merge is very similar to column-wise concatenation. To combine ‘df1’ and ‘df2’ we use .merge() method. Merge is capable of recognizing common columns in the datasets and uses it as a key, in our case column ‘employee_name’. Also, the names are not in order. Let’s see how the merge does the work for us by ignoring the indices.
一對一合并與按列合并非常相似。 要組合'df1'和'df2',我們使用.merge()方法。 合并能夠識別數據集中的公共列 ,并將其用作鍵 (在我們的示例中為“ employee_name”列)。 另外,名稱不正確。 讓我們看看合并如何通過忽略索引為我們工作。
df3 = pd.merge(df1,df2)
df3

Many-to-one
多對一
Many-to-one is a type of join in which one of the two key columns have duplicate values. Suppose we have supervisors for each department and there are many employees in each department hence, Many employees to one supervisor.
多對一是一種連接類型,其中兩個鍵列之一具有重復值 。 假設我們每個部門都有主管,每個部門有很多員工,因此, 一個主管有很多員工。
df4 = pd.DataFrame({'department':['Engineering', 'Accounting', 'HR'],'supervisor': ['Jonas', 'Martha', 'Martin']})print('--------df4---------\n',df4)print('-------merged--------')pd.merge(df3, df4)

Many-to-many
多對多
This is the case where the key column in both the dataset has duplicate values. Suppose many skills are mapped to each department then the resulting DataFrame will have duplicate entries.
這是兩個數據集中的鍵列都有重復值的情況 。 假設每個部門都有許多技能,那么生成的DataFrame將具有重復的條目。
df5=pd.DataFrame({'department'['Engineering','Engineering','Accounting','Accounting', 'HR', 'HR'],'skills': ['Coding', 'Soft skills', 'Math', 'Excel', 'Organizing', 'Decision making']})print('--------df5---------\n',df5)print('\n-------merged--------')pd.merge(df3, df5)

合并不常見的列名稱和值 (Merge on uncommon column names and values)
Uncommon column names
罕見的列名
Many times merging is not that simple since the data we receive will not be so clean. We saw how the merge does all the work provided we have one common column. What if we have no common columns at all? or there is more than one common column. Pandas provide us the flexibility to explicitly specify the columns to act as the key in both DataFrames.
很多時候合并不是那么簡單,因為我們收到的數據不會那么干凈。 我們看到合并是如何完成所有工作的,只要我們有一個共同的專欄即可。 如果我們根本沒有公共列怎么辦? 或有多個共同的欄目 。 熊貓為我們提供了靈活地顯式指定列以充當兩個DataFrame中的鍵的靈活性。
Suppose we change our ‘employee_name’ column to ‘name’ in ‘df2’. Let’s see how datasets look and how to tell merge explicitly the key columns.
假設我們在“ df2”中將“ employee_name”列更改為“ name”。 讓我們看看數據集的外觀以及如何明確地合并關鍵列。
df2 = pd.DataFrame({'name':['Linda', 'Tasha', 'Jack', 'Olliver'],
'salary':[35000, 20500, 90000, 68000]})print('--------df1---------\n',df1)print('--------df2---------\n',df2)print('\n-------merged--------')pd.merge(df1, df2, left_on='employee_name', right_on='name')

Parameter ‘left_on’ to specify the key of the first column and ‘right_on’ for the key of the second. Remember, the value of ‘left_on’ should match with the columns of the first DataFrame you passed and ‘right_on’ with second. Notice, we get redundant column ‘name’, we can drop it if not needed.
參數'left_on'指定第一列的鍵,'right_on'指定第二列的鍵。 請記住,“ left_on”的值應與您傳遞的第一個DataFrame的列匹配,第二個則與“ right_on”的列匹配。 注意,我們得到了多余的列“名稱”,如果不需要的話可以將其刪除。
Uncommon values
罕見的價值觀
Previously we saw that all the employee names present in one dataset were also present in other. What if the names are missing.
以前,我們看到一個數據集中存在的所有雇員姓名也存在于另一個數據集中。 如果缺少名稱怎么辦。
df1=pd.DataFrame({'employee_name'['Tasha','Linda','Olliver','Jack'], 'department':['Engineering', 'Accounting', 'HR', 'HR']})
df2 = pd.DataFrame({'employee_name':['Linda', 'Mary'],'salary':[35000, 20500]})print('--------df1---------\n',df1)print('--------df2---------\n',df2)print('\n-------merged--------\n')pd.merge(df1, df2)

By default merge applies inner join, meaning join in performed only on common values which is always not preferred way since there will be data loss. The method of joining can be controlled by using the parameter ‘how’. We can perform left join or right join to overcome?data?loss. The missing values will be represented as NaN by Pandas.
默認情況下,合并將應用內部聯接 ,這意味著聯接僅在通用值上執行,由于存在數據丟失,因此始終不建議采用這種方式。 可以通過使用參數“如何”來控制加入方法。 我們可以執行左聯接或右聯接以克服數據丟失。 缺失值將由Pandas表示為NaN。
print('-------left join--------\n',pd.merge(df1, df2, how='left'))
print('\n-------right join--------\n',pd.merge(df1,df2,how='right'))

通過...分組 (GroupBy)
GroupBy is a very flexible abstraction, we can think of it as a collection of DataFrame. It allows us to do many different powerful operations. In simple words, it groups the entire data set by the values of the column we specify and allows us to perform operations to extract insights.
GroupBy是一個非常靈活的抽象,我們可以將其視為DataFrame的集合。 它允許我們執行許多不同的強大操作。 簡而言之,它通過我們指定的列的值對整個數據集進行分組,并允許我們執行操作以提取見解。
Let’s come back to our Titanic dataset
讓我們回到泰坦尼克號數據集
Suppose we would like to see how many male and female passengers survived.
假設我們想看看有多少男女乘客幸存下來。
print(df.groupby('sex'))
df.groupby('sex').sum()

Notice, printing only the groupby without performing any operation gives GroupBy object. Since there are only two unique values in the column ‘sex’ we can see a summation of every other column grouped by male and female. More insightful would be to get the percentage. We will capture only the ‘survived’ column of groupby result above upon summation and calculate percentages.
注意,僅打印groupby而不執行任何操作將給出GroupBy對象。 由于“性別”列中只有兩個唯一值,因此我們可以看到按性別分組的所有其他列的總和。 更有見地的將是獲得百分比。 求和后,我們將僅捕獲以上groupby結果的“幸存”列,并計算百分比。
data = df.groupby('sex')['survived'].sum()print('% of male survivers',(data['male']/(data['male']+data['female']))*100)print('% of male female',(data['female']/(data['male']+data['female']))*100)Output% of male survivers 31.87134502923976
% of male female 68.12865497076024
Under the hood, the GroupBy function performs three operations: split-apply-combine.
在幕后,GroupBy函數執行三個操作: split-apply-combine。
- Split - breaking the DataFrame in order to group it into the specified key. 拆分-拆分DataFrame以便將其分組為指定的鍵。
- Apply - it involves computing the function we wish like aggregation or transformation or filter. 應用-它涉及計算我們希望的功能,例如聚合,轉換或過濾。
- Combine - merging the output into a single DataFrame. 合并-將輸出合并到單個DataFrame中。

Perhaps, more powerful operations that can be performed on groupby are:
也許可以對groupby執行的更強大的操作是:
- Aggregate 骨料
- Filter 過濾
- Transform 轉變
- Apply 應用
Let’s see each one with an example.
讓我們來看一個例子。
Aggregate
骨料
The aggregate function allows us to perform more than one aggregation at a time. We need to pass the list of required aggregates as a parameter to .aggregate() function.
聚合功能使我們可以一次執行多個聚合 。 我們需要將所需聚合的列表作為參數傳遞給.aggregate()函數。
df.groupby('sex')['survived'].aggregate(['sum', np.mean,'median'])

Filter
過濾
The filter function allows us to drop data based on group property. Suppose we want to see data where the standard deviation of ‘fare’ is greater than the threshold value say 50 when grouped by ‘survived’.
過濾器功能使我們可以基于組屬性刪除數據 。 假設我們想查看“票價”的標準偏差大于按“生存”分組的閾值(例如50)的數據。
df.groupby('survived').filter(lambda x: x['fare'].std() > 50)

Since the standard deviation of ‘fare’ is greater than 50 only for values of ‘survived’ equal to 1, we can see data only where ‘survived’ is 1.
由于僅當“ survived”等于1時,“ fare”的標準差才大于50,因此我們只能在“ survived”為1的情況下才能看到數據。
Transform
轉變
Transform returns the transformed version of the entire data. The best example to explain is to center the dataset. Centering the data is nothing but subtracting each value of the column with the mean value of its respective column.
轉換返回整個數據的轉換版本 。 最好的解釋示例是將數據集居中。 使數據居中無非是用該列的各個值的平均值減去該列的每個值。
df.groupby('survived').transform(lambda x: x - x.mean())

Apply
應用
Apply is very flexible unlike filter and transform, the only criteria are it takes a DataFrame and returns Pandas object or scalar. We have the flexibility to do anything we wish in the function.
Apply與過濾器和變換不同,它非常靈活,唯一的條件是它需要一個DataFrame并返回Pandas對象或標量。 我們可以靈活地執行該函數中希望執行的任何操作。
def func(x):
x['fare'] = x['fare'] / x['fare'].sum()
return xdf.groupby('survived').apply(func)

數據透視表 (Pivot tables)
Previously in GroupBy, we saw how ‘sex’ affected survival, the survival rate of females is much larger than males. Suppose we would also like to see how ‘pclass’ affected the survival but both ‘sex’ and ‘pclass’ side by side. Using GroupBy we would do something like this.
之前在GroupBy中,我們看到了“性別”如何影響生存,女性的生存率遠高于男性。 假設我們還想了解“ pclass”如何影響生存,但“ sex”和“ pclass”并存。 使用GroupBy,我們將執行以下操作。
df.groupby(['sex', 'pclass']['survived'].aggregate('mean').unstack()

This is more insightful, we can easily make out passengers in the third class section of the Titanic are less likely to be survived.
這是更有洞察力的,我們可以很容易地看出,泰坦尼克號三等艙的乘客幸存下來的可能性較小。
This type of operation is very common in the analysis. Hence, Pandas provides the function .pivot_table() which performs the same with more flexibility and less complexity.
這種類型的操作在分析中非常常見。 因此,Pandas提供了.pivot_table()函數,該函數以更大的靈活性和更低的復雜度執行相同的功能。
df.pivot_table('survived', index='sex', columns='pclass')

The result of the pivot table function is a DataFrame, unlike groupby which returned a groupby object. We can perform all the DataFrame operations normally on it.
數據透視表功能的結果是一個DataFrame,與groupby不同,后者返回了groupby對象。 我們可以正常執行所有DataFrame操作。
We can also add a third dimension to our result. Suppose we want to see how ‘age’ has also affected the survival rate along with ‘sex’ and ‘pclass’. Let’s divide our ‘age’ into groups within it: 0–18 child/teenager, 18–40 adult, and 41–80 old.
我們還可以在結果中添加第三維 。 假設我們想了解“年齡”如何同時影響“性別”和“ pclass”的生存率。 讓我們將“年齡”分為幾類:0-18歲的兒童/青少年,18-40歲的成人和41-80歲的孩子。
age = pd.cut(df['age'], [0, 18, 40, 80])
pivotTable = df.pivot_table('survived', ['sex', age], 'class')
pivotTable

Interestingly female children and teenagers in the second class have a 100% survival rate. This is the kind of power the pivot table of Pandas has.
有趣的是,二等班的女孩和青少年的存活率是100%。 這就是Pandas數據透視表具有的功能。
重塑多索引DataFrame (Reshaping Multi-index DataFrame)
To see a multi-index DataFrame from a different view we reshape it. Stack and Unstack are the two methods to accomplish this.
為了從不同的視圖查看多索引DataFrame,我們對其進行了重塑。 Stack和Unstack是完成此操作的兩種方法。
unstack( )
取消堆疊()
It is the process of converting the row index to the column index. The pivot table we created previously is multi-indexed row-wise. We can get the innermost row index(age groups) into the innermost column index.
這是將行索引轉換為列索引的過程。 我們之前創建的數據透視表是多索引的。 我們可以將最里面的行索引(年齡組)轉換為最里面的列索引。
pivotTable = pivotTable.unstack()
pivotTable

We can also convert the outermost row index(sex) into the innermost column index by using parameter ‘level’.
我們還可以通過使用參數“級別”將最外面的行索引(性別)轉換為最里面的列索引。
pivotTable = pivotTable.unstack(level=0)
piviotTable

stack( )
堆棧()
Stacking is exactly inverse of unstacking. We can convert the column index of multi-index DataFrame into a row index. The innermost column index ‘sex’ is converted to the innermost row index. The result is slightly different from the original DataFrame because we unstacked with level 0 previously.
堆疊與堆疊完全相反。 我們可以將多索引DataFrame的列索引轉換為行索引。 最里面的列索引“ sex”被轉換??為最里面的行索引。 結果與原始DataFrame略有不同,因為我們之前使用0級進行了堆疊。
pivotTable.stack()

These functions and methods are very helpful to understand data, further used to manipulation, or to build a predictive model. We can also plot graphs to get visual insights.
這些功能和方法對于理解數據,進一步用于操縱或建立預測模型非常有幫助。 我們還可以繪制圖形以獲得視覺見解。
翻譯自: https://medium.com/towards-artificial-intelligence/6-pandas-operations-you-should-not-miss-d531736c6574
大熊貓卸妝后
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389431.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389431.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389431.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!