熊貓直播 使用什么sdk_沒什么可花的-但是16項基本操作才能讓您開始使用熊貓

熊貓直播 使用什么sdk

Python has become the go-to programming language for many data scientists and machine learning researchers. One essential data processing tool for them to make this choice is the pandas library. For sure, the pandas library is so versatile that it can be used for almost all initial data manipulations to get the data ready for conducting statistical analyses or building machine learning models.

對于許多數據科學家和機器學習研究人員來說,Python已經成為編程語言。 供他們選擇的一種重要數據處理工具是pandas庫。 可以肯定的是,pandas庫的用途非常廣泛,幾乎可以用于所有初始數據操作,從而為進行統計分析或構建機器學習模型做好準備。

However, for the same versatility, it can be overwhelming to start to use it comfortably. If you’re struggling on how to get it started, this article is right for you. Instead of covering too much to lose the points, the goal of the article is to provide an overview of key operations that you want to use in your daily data processing tasks. Each key operation is accompanied with some highlights of essential parameters to consider.

但是,對于相同的多功能性,開始舒適地使用它可能會讓人不知所措。 如果您正在努力入門,那么這篇文章很適合您。 本文的目的不是概述過多的問題,而是概述您要在日常數據處理任務中使用的關鍵操作。 每個按鍵操作都帶有一些要考慮的基本參數。

Certainly, the first step is to get the pandas installed, which you can use pip or conda by following the instruction here. In terms of coding environment, I recommend Visual Studio Code, JupyterLab, or Google Colab, all of which requires little efforts to get it set up. When the installation is done, in your preferred coding environment, you should be able to import pandas to your project.

當然,第一步是安裝熊貓,您可以按照此處的說明使用pipconda 。 在編碼環境方面,我建議使用Visual Studio Code,JupyterLab或Google Colab,所有這些都無需花費很多精力即可進行設置。 安裝完成后,在您喜歡的編碼環境中,您應該能夠將熊貓導入到您的項目中。

import pandas as pd

If you run the above code without encountering any errors, you’re good to go.

如果您在運行上述代碼時沒有遇到任何錯誤,那就很好了。

1.讀取外部數據 (1. Read External Data)

In most cases, we read data from external sources. If our data are in the spreadsheet-like format, the following functions should serve the purposes.

在大多數情況下,我們從外部來源讀取數據。 如果我們的數據采用電子表格格式,則應使用以下功能。

# Read a comma-separated filedf = pd.read_csv("the_data.csv")# Read an Excel spreadsheetdf = pd.read_excel("the_data.xlsx")
  • The header should be handled correctly. By default, the reading will assume the first line of data to be the column names. If they are no headers, you have to specify it (e.g., header=None).

    標頭應正確處理。 默認情況下,讀數將假定第一行數據為列名。 如果沒有標題,則必須指定它(例如, header=None )。

  • If you’re reading a tab-delimited file, you can use read_csv by specifying the tab as the delimiter (e.g., sep=“\t”).

    如果要讀取制表符分隔的文件,則可以通過將制表符指定為分隔符來使用read_csv (例如, sep=“\t” )。

  • When you read a large file, it’s a good idea by reading a small portion of the data. In this case, you can set the number of rows to be read (e.g., nrows=1000).

    讀取大文件時,最好讀取一小部分數據。 在這種情況下,您可以設置要讀取的行數(例如nrows=1000 )。

  • If your data involve dates, you can consider setting arguments to make the dates right, such as parse_dates, and infer_datetime_format.

    如果數據涉及日期,則可以考慮設置參數以使日期正確,例如parse_datesinfer_datetime_format

2.創建系列 (2. Create Series)

During the process of cleaning up your data, you may need to create Series yourself. In most cases, you’ll simply pass an iterable to create a Series object.

在清理數據的過程中,您可能需要自己創建Series 。 在大多數情況下,您只需傳遞一個iterable即可創建Series對象。

# Create a Series from an iterableintegers_s = pd.Series(range(10))# Create a Series from a dictionary object
squares = {x: x*x for x in range(1, 5)}
squares_s = pd.Series(squares)
  • You can assign a name to the Series object by setting the name argument. This name will become the name if it becomes part of a DataFrame object.

    您可以通過設置name參數為Series對象分配name 。 如果該名稱成為DataFrame對象的一部分,它將成為名稱。

  • You can also assign index to the Series (e.g., setting the index argument) if you find it more useful than the default 0-based index. Note that the index’s length should match your data’s length.

    如果發現index比默認的從0開始的索引更有用,則還可以將索引分配給Series (例如,設置index參數)。 請注意,索引的長度應與數據的長度匹配。

  • If you create a Series object from a dict, the keys will become the index.

    如果從dict創建Series對象,則鍵將成為索引。

3.構造DataFrame (3. Construct DataFrame)

Oftentimes, you need to create DataFrame objects using Python built-in objects, such as lists and dictionaries. The following code snippet highlights two common use scenarios.

通常,您需要使用Python內置對象(例如列表和字典)創建DataFrame對象。 下面的代碼片段重點介紹了兩種常見的使用場景。

# Create a DataFrame from a dictionary of lists as valuesdata_dict = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}
data_df0 = pd.DataFrame(data_dict)# Create a DataFrame from a listdata_list = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
data_df1 = pd.DataFrame(data_list, columns=tuple('abc'))
  • The first one uses a dict object. Its keys will become column names while its values will become values for the corresponding columns.

    第一個使用dict對象。 其鍵將成為列名,而其值將成為對應列的值。

  • The second one uses a list object. Unlike the previous method, the constructed DataFrame will use the data row-wise, which means that each inner list will become a row of the created DataFrame object.

    第二個使用列表對象。 與以前的方法不同,構造的DataFrame將按行使用數據,這意味著每個內部列表將成為創建的DataFrame對象的一行。

4. DataFrame概述 (4. Overview of DataFrame)

When we have the DataFrame to work with, we may want to take a look at the dataset at the 30,000 feet level. There are several common ways that you can use, as shown below.

當我們使用DataFrame時,我們可能要看一下30,000英尺高度的數據集。 可以使用幾種常用方法,如下所示。

# Find out how many rows and columns the DataFrame hasdf.shape# Take a quick peak at the beginning and the end of the datadf.head()
df.tail()# Get a random sampledf.sample(5)# Get the information of the datasetdf.info()# Get the descriptive stats of the numeric valuesdf.describe()
  • It’s important to check both the head and the tail especially if you work with a large amount of data, because you want to make sure that all of your data have been read completely.

    檢查頭部和尾部非常重要,尤其是在處理大量數據時,因為要確保已完全讀取所有數據。
  • The info() function will give you an overview of the columns in terms of data types and item counts.

    info()函數將為您提供有關數據類型和項目計數的列概述。

  • It’s also interesting to get a random sample of your data to check your data’s integrity (i.e., sample() function).

    獲取數據的隨機樣本以檢查數據的完整性(例如sample()函數)也很有趣。

5.重命名列 (5. Rename Columns)

You notice that some columns of your data don’t make too much sense or the name is too long to work with and you want to rename these columns.

您會注意到數據的某些列沒有太大意義,或者名稱太長而無法使用,并且您想重命名這些列。

# Rename columns using mappingdf.rename({'old_col0': 'col0', 'old_col1': 'col1'}, axis=1)# Rename columns by specifying columns directlydf.rename(columns={'old_col0': 'col0', 'old_col1': 'col1'})
  • You need to specify axis=1 to rename columns if you simply provide a mapping object (e.g., dict).

    如果僅提供映射對象(例如dict ),則需要指定axis=1來重命名列。

  • Alternatively, you can specify the mapping object to the columns argument explicitly.

    另外,您可以顯式地指定到columns參數的映射對象。

  • The rename function will create a new DataFrame by default. If you want to rename the DataFrame inplace, you need to specify inplace=True.

    rename功能將默認創建一個新的DataFrame。 如果要重命名DataFrame inplace ,則需要指定inplace=True

6.排序數據 (6. Sort Data)

To make your data more structured, you need to sort the DataFrame object.

為了使數據更加結構化,您需要對DataFrame對象進行排序。

# Sort datadf.sort_values(by=['col0', 'col1'])
  • By default, the sort_values function will sort your rows (axis=0). In most cases, we use columns as the sorting keys.

    默認情況下, sort_values函數將對行進行排序( axis=0 )。 在大多數情況下,我們使用列作為排序鍵。

  • The sort_values function will create a sorted DataFrame object by default. To change it, use inplace=True.

    默認情況下, sort_values函數將創建一個排序的DataFrame對象。 要更改它,請使用inplace=True

  • By default, the sorting is based on ascending for all sorting keys. If you want to use descending, specify ascending=False. If you want to have mixed orders (e.g., some keys are ascending and some descending), you can create a list of boolean values to match the number of keys, like by=[‘col0’, ‘col1’, ‘col2’], ascending=[True, False, True].

    默認情況下,排序基于所有排序鍵的升序。 如果要使用降序,請指定ascending=False 。 如果要混合使用順序(例如,某些鍵是升序,某些鍵是降序),則可以創建一個布爾值列表以匹配鍵的數量,例如by=['col0', 'col1', 'col2'], ascending=[True, False, True]

  • The original index will go with their old data rows. In many cases, you need to re-index. Instead of calling reset_index function directly, you can specify the ignore_index to be True, which will reset the index for you after the sorting is completed.

    原始索引將與它們的舊數據行一起顯示。 在許多情況下,您需要重新編制索引。 您可以將ignore_index指定為True ,而不是直接調用rese t _index函數,它將在排序完成后為您重置索引。

7.處理重復 (7. Deal With Duplicates)

It’s a common scenario in real-life datasets that they contain duplicate records, either by human mistake or database glitch. We want to remove these duplicates because they can cause unexpected problems later on.

在現實的數據集中,這種情況很常見,它們包含重復記錄,可能是由于人為錯誤或數據庫故障引起的。 我們要刪除這些重復項,因為它們可能在以后導致意外問題。

# To examine whether there are duplicates using all columnsdf.duplicated().any()# To examine whether there are duplicates using particular columnsdf.duplicated(['col0', 'col1']).any()

The above functions will return you a boolean value telling you whether any duplicate records exist in your dataset. To find out the exact number of duplicate records, you can use the sum() function by taking advantage of the returned Series object of boolean values from duplicated() function (Python treats True as a value of 1), as show below. One additional thing to note is that when the argument keep is set to be False, it will mark any duplicates as True. Suppose that there are three duplicate records, when keep=False, both records will be marked as True (i.e., being duplicated). When keep= “first” or keep=“last”, only the first or the last record marked as True.

上面的函數將返回一個布爾值,告訴您數據集中是否存在重復的記錄。 要找出重復記錄的確切數目,可以利用sum()函數,方法是利用來自duplicated()函數的布爾值返回的Series對象(Python將True的值視為1),如下所示。 需要注意的另一件事是,當參數keep設置為False ,它將所有重復項標記為True 。 假設有三個重復記錄,當keep=False ,兩個記錄都將被標記為True (即,被重復)。 當keep= “first”keep=“last” ,僅第一個或最后一個記錄標記為True

# To find out the number of duplicatesdf.duplicated().sum()
df.duplicated(keep=False).sum()

To actually view the duplicated records, you need to select data from the original dataset using the generated duplicated Series object, as shown below.

要實際查看重復的記錄,您需要使用生成的重復的Series對象從原始數據集中選擇數據,如下所示。

# Get the duplicate recordsduplicated_indices = df.duplicated(['col0', 'col1'], keep=False)duplicates = df.loc[duplicated_indices, :].sort_values(by=['col0', 'col1'], ignore_index=True)
  • We get all duplicate records by setting the keep argument to be False.

    通過將keep參數設置為False可以獲得所有重復記錄。

  • To better view the duplicate records, you may want to sort the generated DataFrame using the same set of keys.

    為了更好地查看重復的記錄,您可能希望使用同一組鍵對生成的DataFrame進行排序。

Once you have a good idea about the duplicate records with your dataset, you can drop them like below.

一旦對數據集中的重復記錄有了一個好主意,就可以像下面這樣刪除它們。

# Drop the duplicate recordsdf.drop_duplicates(['col0', 'col1'], keep="first", inplace=True, ignore_index=True)
  • By default, the kept record will be the first of the duplicates.

    默認情況下,保留的記錄將是重復記錄中的第一個。
  • You’ll need to specify inplace=True if you want to update the DataFrame inplace. BTW: many other functions have this option, the discussion of which will be skipped most of the time if not all.

    如果要就地更新DataFrame,則需要指定inplace=True 。 順便說一句:許多其他功能都有此選項,如果不是全部,大部分時間都將跳過其討論。

  • As with the sort_values() function, you may want to reset index afterwards by specifying the ignore_index argument (a new feature in Pandas 1.0).

    sort_values()函數一樣,您之后可能希望通過指定ignore_index參數(Pandas 1.0中的一項新功能)來重置索引。

8.處理丟失的數據 (8. Handle Missing Data)

Missing data are common in real-life datasets, which can be due to measures not available or simply human data entry mistakes resulting in meaningless data that are deemed to be missing. To have an overall idea of how many missing values your dataset has, you’ve seen that the info() function tells us how many non-null values each column has. We can get information about data missingness in a more structured fashion, as shown below.

丟失數據在現實生活數據集中是很常見的,這可能是由于無法采取的措施或者僅僅是由于人類數據輸入錯誤導致了被認為丟失的毫無意義的數據所致。 為了全面了解數據集有多少個缺失值,您已經了解到info()函數告訴我們每列有多少個非空值。 我們可以以更結構化的方式獲取有關數據丟失的信息,如下所示。

# Find out how many missing values for each columndf.isnull().sum()# Find out how many missing values for the entire datasetdf.isnull().sum().sum()
  • The isnull() function creates a DataFrame of the same shape as your original DataFrame with each value indicating the original value to be missing (True) or not (False). As a related note, you can use the notnull() function if you want to generate a DataFrame indicating the non-null values.

    notull isnull()函數創建的形狀與原始DataFrame形狀相同的DataFrame ,每個值指示缺失的原始值( True )或不存在( False )。 作為相關說明,如果要生成指示非空值的notnull() ,則可以使用notnull()函數。

  • As mentioned previously, a True value in Python is arithmetically equal to 1. The sum() function will compute the sum of these boolean values for each column (by default, it’s calculating the sum column-wise), which reflect the number of missing values.

    如前所述,Python中的True值在算術上等于sum()函數將為每一列計算這些布爾值的總和(默認情況下,它按列計算總和),這反映了缺失的數量價值觀。

By having some idea about the missingness of your dataset, we usually want to deal with them. Possible solutions include drop the records with any missing values or fill them with applicable values.

通過對數據集的缺失有所了解,我們通常希望對其進行處理。 可能的解決方案包括刪除任何缺少的值的記錄或使用適用的值填充它們。

# Drop the rows with any missing valuesdf.dropna(axis=0, how="any")# Drop the rows without 2 or more non-null valuesdf.dropna(thresh=2)# Drop the columns with all values missingdf.dropna(axis=1, how="all")
  • By default, the dropna() function works column-wise (i.e., axis=0). If you specify how=“any”, rows with any missing values will be dropped.

    默認情況下, dropna()函數按列工作(即axis=0 )。 如果指定how=“any” ,則將刪除所有缺少值的行。

  • When you set the thresh argument, it requires that the row (or the column when axis=1) have the number of non-missing values.

    設置thresh參數時,它要求該行(或axis=1時的列)具有非缺失值的數量。

  • As many other functions, when you set axis=1, you’re performing operations column-wise. In this case, the above function call will remove the columns for those who have all of their values missing.

    與其他許多功能一樣,當您設置axis=1 ,您將按列執行操作。 在這種情況下,上面的函數調用將刪除那些缺少所有值的列。

Besides the operation of dropping data rows or columns with missing values, it’s also possible to fill the missing values with some values, as shown below.

除了刪除具有缺失值的數據行或數據列的操作外,還可以用一些值填充缺失值,如下所示。

# Fill missing values with 0 or any other value is applicabledf.fillna(value=0)# Fill the missing values with customized mapping for columnsdf.fillna(value={"col0": 0, "col1": 999})# Fill missing values with the next valid observationdf.fillna(method="bfill")# Fill missing values with the last valid observationdf.fillna(method="ffill")
  • To fill the missing values with specified values, you set the value argument either with a fixed value for all or you can set a dict object which will instruct the filling based on each column.

    要用指定的值填充缺失值,可以將value參數設置為全部固定值,也可以設置dict對象,該對象將根據每一列指示填充。

  • Alternatively, you can fill the missing values by using existing observations surrounding the missing holes, either back fill or forward fill.

    或者,您可以通過使用圍繞缺失Kong的現有觀測值來填充缺失值,即回填或正向填充。

9.分組描述統計 (9. Descriptive Statistics by Group)

When you conduct machine learning research or data analysis, it’s often necessary to perform particular operations with some grouping variables. In this case, we need to use the groupby() function. The following code snippet shows you some common scenarios that apply.

當您進行機器學習研究或數據分析時,通常需要使用一些分組變量來執行特定的操作。 在這種情況下,我們需要使用groupby()函數。 以下代碼段顯示了一些適用的常見方案。

# Get the count by group, a 2 by 2 exampledf.groupby(['col0', 'col1']).size()# Get the mean of all applicable columns by groupdf.groupby(['col0']).mean()# Get the mean for a particular columndf.groupby(['col0'])['col1'].mean()# Request multiple descriptive statsdf.groupby(['col0', 'col1']).agg({
'col2': ['min', 'max', 'mean'],
'col3': ['nunique', 'mean']
})
  • By default, the groupby() function will return a GroupBy object. If you want to convert it to a DataFrame, you can call the reset_index() on the object. Alternatively, you can specify the as_index=False in the groupby() function call to create a DataFrame directly.

    默認情況下, groupby()函數將返回一個GroupBy對象。 如果要將其轉換為DataFrame ,則可以在對象上調用reset_index() 。 另外,您可以在groupby()函數調用中指定as_index=False以直接創建DataFrame

  • The size() is useful if you want to know the frequency of each group.

    如果您想知道每個組的頻率,則size()很有用。

  • The agg() function allows you to generate multiple descriptive statistics. You can simply pass a set of function names, which will apply to all columns. Alternatively, you can pass a dict object with functions to apply to specific columns.

    agg()函數使您可以生成多個描述性統計信息。 您可以簡單地傳遞一組函數名稱,該名稱將應用于所有列。 另外,您可以傳遞一個dict對象,該對象具有要應用于特定列的函數。

10.寬到長格式轉換 (10. Wide to Long Format Transformation)

Depending on how the data are collected, the original dataset may be in the “wide” format — each row represents a data record with multiple measures (e.g., different time points for a subject in a research study). If we want to convert the “wide” format to the “long” format (e.g., each time point becomes a data row and thus a subject has multiple rows), we can use the melt() function, as shown below.

取決于數據的收集方式,原始數據集可能采用“寬”格式-每行代表具有多種度量(例如,研究對象的不同時間點)的數據記錄。 如果我們想將“寬”格式轉換為“長”格式(例如,每個時間點變成一個數據行,因此一個主體有多個行),則可以使用melt()函數,如下所示。

Wide to Long Transformation
從寬到長的轉變
  • The melt() function is essentially “unpivoting” a data table (we’ll talk about pivoting next). You specify the id_vars to be the columns that are used as identifiers in the original dataset.

    melt()函數本質上是“取消透視”數據表(我們接下來將討論透視)。 您將id_vars指定為原始數據集中用作標識符的列。

  • The value_vars argument is set using the columns that contain the values. By default, the columns will become the values for the var_name column in the melted dataset.

    使用包含值的列設置value_vars參數。 默認情況下,這些列將成為融化數據集中var_name列的值。

11.從長格式到寬格式的轉換 (11. Long to Wide Format Transformation)

The opposite operation to the melt() function is called pivoting, which we can realize with the pivot() function. Suppose that the created “wide” format DataFrame is called df_long. The following function shows you how we can convert the wide format to the long format — basically reverse the process that we did in the previous section.

melt()函數相反的操作稱為pivoting,我們可以通過pivot()函數來實現。 假設創建的“寬”格式DataFrame稱為df_long 。 以下功能向您展示了如何將寬格式轉換為長格式-基本上逆轉了上一節中的過程。

Long to Wide Transformation
從長到寬的轉變

Besides the pivot() function, a closely related function is the pivot_table() function, which is more general than the pivot() function by allowing duplicate index or columns (see here for a more detailed discussion).

pivot_table() pivot()函數外,一個密切相關的函數是pivot_table()函數,它比pivot()函數更通用,它允許重復的索引或列(請參見此處以獲取更詳細的討論)。

12.選擇數據 (12. Select Data)

When we work with a complex dataset, we need to select a subset of the dataset for particular operations based on some criteria. If you select some columns, the following code shows you how to do it. The selected data will include all the rows.

當我們處理復雜的數據集時,我們需要根據一些條件為特定操作選擇數據集的子集。 如果選擇一些列,則以下代碼向您展示如何執行此操作。 所選數據將包括所有行。

# Select a columndf_wide['subject']# Select multiple columnsdf_wide[['subject', 'before_meds']]

If you want to select certain rows with all columns, do the following.

如果要選擇具有所有列的某些行,請執行以下操作。

# Select rows with a specific conditiondf_wide[df_wide['subject'] == 100]

What if you want to select certain rows and columns, we should consider using the iloc or loc methods. The major difference between these methods is that the iloc method uses 0-based index, while the loc method uses labels.

如果要選擇某些行和列,應該考慮使用ilocloc方法。 這些方法之間的主要區別在于iloc方法使用基于0的索引,而loc方法使用標簽。

Data Selection
資料選擇
  • The above pairs of calls create the same output. For clarity, only one output is listed.

    上面的調用對創建相同的輸出。 為了清楚起見,僅列出了一個輸出。
  • When you use slice objects with the iloc, the stop index isn’t included, just as regular Python slice objects. However, the slice objects include the stop index in the loc method. See Lines 15–17.

    當您將切片對象與iloc一起使用iloc ,不包括stop索引,就像常規的Python切片對象一樣。 但是,切片對象在loc方法中包含停止索引。 參見第15-17行。

  • As noted in Line 22, when you use a boolean array, you need to use the actual values (using the values method, which will return the underlying numpy array). If you don’t do that, you’ll probably encounter the following error: NotImplementedError: iLocation based boolean indexing on an integer type is not available.

    如第22行所述,當您使用布爾數組時,需要使用實際值(使用values方法,這將返回基礎的numpy數組)。 如果不這樣做,則可能會遇到以下錯誤: NotImplementedError: iLocation based boolean indexing on an integer type is not available

  • The use of labels in loc methods happens to be the same as index in terms of selecting rows, because the index has the same name as the index labels. In other words, iloc will always use 0-based index based on the position regardless of the numeric values of the index.

    就選擇行而言,在loc方法中使用標簽恰好與索引相同,因為索引的名稱與索引標簽的名稱相同。 換句話說, iloc將始終基于位置使用基于0的索引,而不管索引的數值如何。

13.使用現有數據的新列(映射并應用) (13. New Columns Using Existing Data (map and apply))

Existing columns don’t always present the data in the format we want. Thus, we often need to generate new columns using existing data. Two functions are particularly useful in this case: map() and apply(). There are too many possible ways that we can use them to create new columns. For instance, the apply() function can have a more complex mapping function and it can create multiple columns. I’ll just show you two most use common cases with the following rules of thumb. Let’s keep our goal simple — just create one column with either use case.

現有列并不總是以我們想要的格式顯示數據。 因此,我們經常需要使用現有數據來生成新列。 在這種情況下,兩個函數特別有用: map()apply() 。 我們可以使用太多方式來創建新列。 例如, apply()函數可以具有更復雜的映射函數,并且可以創建多個列。 我將通過以下經驗法則向您展示兩個最常用的案例。 讓我們保持目標簡單-只需使用任一用例創建一列。

  • If your data conversion involves just one column, simply use the map() function on the column (in essence, it’s a Series object).

    如果您的數據轉換僅涉及一列,則只需在該列上使用map()函數(本質上是一個Series對象)。

  • If your data conversion involves multiple columns, use the apply() function.

    如果您的數據轉換涉及多個列,請使用apply()函數。

Map and Apply
映射并應用
  • In both cases, I used lambda functions. However, you can use regular functions. It’s also possible to provide a dict object for the map() function, which will map the old values to the new values based on the key-value pairs with keys being the old values and the values being the new values.

    在兩種情況下,我都使用lambda函數。 但是,您可以使用常規功能。 也可以為map()函數提供一個dict對象,該對象將根據鍵值對將舊值映射到新值,其中鍵為舊值,而值為新值。

  • For the apply() function, when we create a new column, we need to specify axis=1, because we’re accessing data row-wise.

    對于apply()函數,當我們創建新列時,我們需要指定axis=1 ,因為我們要逐行訪問數據。

  • For the apply() function, the example shown is intended for demonstration purposes, because I could’ve used the original column to do a simpler arithmetic subtraction like this: df_wide[‘change’] = df_wide[‘before_meds’] —df_wide[‘after_meds’].

    對于apply()函數,所示示例僅用于演示目的,因為我可以使用原始列來進行如下更簡單的算術減法: df_wide['change'] = df_wide['before_meds'] —df_wide['after_meds']

14.串聯與合并 (14. Concatenation and Merging)

When we have multiple datasets, it’s necessary to put them together from time to time. There are two common scenarios. The first scenario is when you have datasets of similar shape, either sharing the same index or same columns, you can consider concatenating them directly. The following code shows you some possible concatenations.

當我們有多個數據集時,有必要不時將它們放在一起。 有兩種常見方案。 第一種情況是,當您擁有形狀相似的數據集(共享相同的索引或相同的列)時,可以考慮直接將它們連接起來。 以下代碼顯示了一些可能的連接。

# When the data have the same columns, concatenate them verticallydfs_a = [df0a, df1a, df2a]
pd.concat(dfs_a, axis=0)# When the data have the same index, concatenate them horizontallydfs_b = [df0b, df1b, df2b]
pd.concat(dfs_b, axis=1)
  • By default, the concatenation performs an “outer” join, which means that if there are any non-overlapping index or columns, all of them will be kept. In other words, it’s like creating a union of two sets.

    默認情況下,串聯執行“外部”聯接,這意味著如果存在任何不重疊的索引或列,則將全部保留它們。 換句話說,這就像創建兩個集合的并集。
  • Another thing to remember is that if you need to concatenate multiple DataFrame objects, it’s recommended that you create a list to store these objects, and perform concatenation just once by avoiding generating intermediate DataFrame objects if you perform concatenation sequentially.

    要記住的另一件事是,如果需要連接多個DataFrame對象,建議您創建一個列表來存儲這些對象,并通過順序執行串聯操作避免生成中間DataFrame對象,從而只執行一次連接。

  • If you want to reset the index for the concatenated DataFrame, you can set ignore_index=True argument.

    如果要重置串聯的DataFrame的索引,則可以設置ignore_index=True參數。

The other scenario is to merge datasets that have one or two overlapping identifiers. For instance, one DataFrame has id number, name and gender, and the other has id number and transaction records. You can merge them using the id number column. The following code shows you how to merge them.

另一種情況是合并具有一個或兩個重疊標識符的數據集。 例如,一個DataFrame具有ID號,名稱和性別,而另一個具有ID號和交易記錄。 您可以使用ID號列合并它們。 以下代碼顯示了如何合并它們。

# Merge DataFrames that have the same merging keysdf_a0 = pd.DataFrame(dict(), columns=['id', 'name', 'gender'])
df_b0 = pd.DataFrame(dict(), columns=['id', 'name', 'transaction'])
merged0 = df_a0.merge(df_b0, how="inner", on=["id", "name"])# Merge DataFrames that have different merging keysdf_a1 = pd.DataFrame(dict(), columns=['id_a', 'name', 'gender'])
df_b1 = pd.DataFrame(dict(), columns=['id_b', 'transaction'])
merged1 = df_a1.merge(df_b1, how="outer", left_on="id_a", right_on="id_b")
  • When both DataFrame objects share the same key or keys, you can simply specify them (either one or multiple is fine) using the on argument.

    當兩個DataFrame對象共享一個或多個相同的鍵時,您可以使用on參數簡單地指定它們(一個或多個都可以)。

  • When they have different names, you can specify which one for the left DataFrame and which one for the right DataFrame.

    當他們有不同的名稱,你可以指定一個左數據框和一個合適的數據幀

  • By default, the merging will use the inner join method. When you want to have other join methods (e.g., left, right, outer), you set the proper value for the how argument.

    默認情況下,合并將使用內部連接方法。 當您要使用其他聯接方法(例如,左,右,外)時,可以為how參數設置適當的值。

15.放置列 (15. Drop Columns)

Although you can keep all the columns in the DataFrame by renaming them without any conflict, sometimes you’d like to drop some columns to keep the dataset clean. In this case, you should use the drop() function.

盡管您可以通過重命名所有列將其保留在DataFrame中,而不會發生任何沖突,但是有時您還是希望刪除一些列以保持數據集的整潔。 在這種情況下,應使用drop()函數。

# Drop the unneeded columnsdf.drop(['col0', 'col1'], axis=1)
  • By default, the drop() function uses labels to refer to columns or index, and thus you may want to make sure that the labels are contained in the DataFrame object.

    默認情況下, drop()函數使用標簽來引用列或索引,因此您可能需要確保標簽包含在DataFrame對象中。

  • To drop index, you use axis=0. If you drop columns, which I find them to be more common, you use axis=1.

    要刪除索引,請使用axis=0 。 如果刪除列(我發現它們更常見),則使用axis=1

  • Again, this operation creates a DataFrame object, and if you prefer changing the original DataFrame, you specify inplace=True.

    同樣,此操作將創建一個DataFrame對象,并且如果您希望更改原始DataFrame ,則可以指定inplace inplace=True

16.寫入外部文件 (16. Write to External Files)

When you want to communicate data with your collaborators or teammates, you need to write your DataFrame objects to external files. In most cases, the comma-delimited files should serve the purposes.

當您想與合作者或隊友交流數據時,您需要將DataFrame對象寫入外部文件。 在大多數情況下,以逗號分隔的文件應能達到目的。

# Write to a csv file, which will keep the indexdf.to_csv("filename.csv")# Write to a csv file without the indexdf.to_csv("filename.csv", index=False)# Write to a csv file without the headerdf.to_csv("filename.csv", header=False)
  • By default, the generated file will keep the index. You need to specify index=False to remove the index from the output.

    默認情況下,生成的文件將保留索引。 您需要指定index=False才能從輸出中刪除索引。

  • By default, the generated file will keep the header (e.g., column names). You need to specify header=False to remove the headers.

    默認情況下,生成的文件將保留標題(例如,列名)。 您需要指定header=False來刪除標題。

結論 (Conclusion)

In this article, we reviewed the basic operations that you’ll find them useful to get you started with the pandas library. As indicated by the article’s title, these techniques aren’t intended to handle the data in a fancy way. Instead, they’re all basic techniques to allow you to process the data in the way you want. Later on, you can probably find fancier ways to get some operations done.

在本文中,我們回顧了基本操作,您會發現它們對開始使用pandas庫很有用。 如文章標題所示,這些技術并非旨在以一種奇特的方式處理數據。 相反,它們都是允許您以所需方式處理數據的基本技術。 稍后,您可能會找到更理想的方法來完成一些操作。

翻譯自: https://towardsdatascience.com/nothing-fancy-but-16-essential-operations-to-get-you-started-with-pandas-5b0c2f649068

熊貓直播 使用什么sdk

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/387847.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/387847.shtml
英文地址,請注明出處:http://en.pswp.cn/news/387847.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

萌新一手包App前后端開發日記(一)

從事Android移動端也有些日子了,還記得一開始選擇這份工作,是憧憬著有朝一日能讓親朋好友用上自己開發的軟件,但日子久了才發現,并不是所有的公司,所有的項目的適用群體都是“親朋好友”,/無奈臉 攤手。當…

方差,協方差 、統計學的基本概念

一、統計學的基本概念 統計學里最基本的概念就是樣本的均值、方差、標準差。首先,我們給定一個含有n個樣本的集合,下面給出這些概念的公式描述: 均值: 標準差: 方差: 均值描述的是樣本集合的中間點&#xf…

關系型數據庫的核心單元是_核中的數據關系

關系型數據庫的核心單元是Nucleoid is an open source (Apache 2.0), a runtime environment that provides logical integrity in declarative programming, and at the same time, it stores declarative statements so that it doesn’t require external database, in shor…

MongoDB第二天

集合的操作: db.表名稱 show tables / collection db.表名.drop() 文檔的操作: 插入數據 db.表名.insert({"name":"jerry"}) db.insertMany([{"name":"sb",...}]) var ul {"name":"sb"} db.sb.insert(ul) db.sb.…

Python 主成分分析PCA

Python 主成分分析PCA 主成分分析&#xff08;PCA&#xff09;是一種基于變量協方差矩陣對數據進行壓縮降維、去噪的有效方法&#xff0c;PCA的思想是將n維特征映射到k維上&#xff08;k<n&#xff09;&#xff0c;這k維特征稱為主元&#xff0c;是舊特征的線性組合&#xf…

小程序 國際化_在國際化您的應用程序時忘記的一件事

小程序 國際化The hidden bugs waiting to be found by your international users您的國際用戶正在等待發現的隱藏錯誤 While internationalizing our applications, we focus on the things we can see: text, tool-tips, error messages, and the like. But, hidden in our …

三. 性能測試領域

能力驗證&#xff1a; 概念&#xff1a;系統能否在A條件下具備B能力 應用&#xff1a;為客戶進行系統上線后的驗收測試&#xff0c;作為第三方對一個已經部署系統的性能驗證 特點&#xff1a;需要在已確定的環境下運行 需要根據典型場景設計測試方案和用例 一個典型場景包括操…

PCA主成分分析Python實現

作者&#xff1a;拾毅者 出處&#xff1a;http://blog.csdn.net/Dream_angel_Z/article/details/50760130 Github源碼&#xff1a;https://github.com/csuldw/MachineLearning/tree/master/PCA PCA&#xff08;principle component analysis&#xff09; &#xff0c;主成分分…

scp

將文件或目錄從本地通過網絡拷貝到目標端。拷貝目錄要帶 -r 參數 格式&#xff1a;scp 本地用戶名IP地址:文件名1 遠程用戶名IP地址:文件名 2 例&#xff1a; scp media.repo root192.168.20.32:/etc/yum.repos.d/ 將遠程主機文件或目錄拷貝到本機&#xff0c;源和目的參數調換…

robo 3t連接_使用robo 3t studio 3t連接到地圖集

robo 3t連接Robo 3T (formerly Robomongo) is a graphical application to connect to MongoDB. The newest version now includes support for TLS/SSL and SNI which is required to connect to Atlas M0 free tier clusters.Robo 3T(以前稱為Robomongo )是用于連接MongoDB的…

JavaWeb--JavaEE

一、JavaEE平臺安裝1、升級eclipseEE插件2、MyEclipse二、配置Eclipse工作空間1.字體設置 2.工作空間編碼 UTF-83.JDK版本指定 4.集成Tomcat Server運行環境5.配置server webapps目錄 端口號 啟動時間等三、創建第一個Web項目1.創建 Web Project2.設置 tomcat、創建web.xml3.目…

軟件需求規格說明書通用模版_通用需求挑戰和機遇

軟件需求規格說明書通用模版When developing applications there will be requirements that are needed on more than one application. Examples of such common requirements are non-functional, cookie consent and design patterns. How can we work with these types of…

python版PCA(主成分分析)

python版PCA&#xff08;主成分分析&#xff09; 在用統計分析方法研究這個多變量的課題時&#xff0c;變量個數太多就會增加課題的復雜性。人們自然希望變量個數較少而得到的信息較多。在很多情形&#xff0c;變量之間是有一定的相關關系的&#xff0c;當兩個變量之間有一定…

干貨|Spring Cloud Bus 消息總線介紹

2019獨角獸企業重金招聘Python工程師標準>>> 繼上一篇 干貨&#xff5c;Spring Cloud Stream 體系及原理介紹 之后&#xff0c;本期我們來了解下 Spring Cloud 體系中的另外一個組件 Spring Cloud Bus (建議先熟悉 Spring Cloud Stream&#xff0c;不然無法理解 Spr…

一類動詞二類動詞三類動詞_基于http動詞的完全無效授權技術

一類動詞二類動詞三類動詞Authorization is a basic feature of modern web applications. It’s a mechanism of specifying access rights or privileges to resources according to user roles. In case of CMS like applications, it needs to be equipped with advanced l…

主成份分析(PCA)詳解

主成分分析法&#xff08;Principal Component Analysis&#xff09;大多在數據維度比較高的時候&#xff0c;用來減少數據維度&#xff0c;因而加快模型訓練速度。另外也有些用途&#xff0c;比如圖片壓縮&#xff08;主要是用SVD&#xff0c;也可以用PCA來做&#xff09;、因…

thinkphp5記錄

ThinkPHP5 隱藏index.php問題 thinkphp模板輸出cookie,session中… 轉載于:https://www.cnblogs.com/niuben/p/10056049.html

portainer容器可視化管理部署簡要筆記

參考鏈接&#xff1a;https://www.portainer.io/installation/ 1、單個宿主機部署in Linux&#xff1a;$ docker volume create portainer_data$ docker run -d -p 9000:9000 -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer 2、單…

證明您履歷表經驗的防彈五步法

How many times have you gotten the question “Tell me more about your work experience at …” or “Describe an experience when you had to overcome a technical challenge”? Is your answer solid and bullet-proof every single time you have to respond? If no…

2018-2019-1 20165231 實驗四 外設驅動程序設計

博客鏈接&#xff1a;https://www.cnblogs.com/heyanda/p/10054680.html 轉載于:https://www.cnblogs.com/Yhooyon/p/10056173.html