魅族mx5游戲模式小熊貓
重點 (Top highlight)
I’ve been using pandas for years and each time I feel I am typing too much, I google it and I usually find a new pandas trick! I learned about these functions recently and I deem them essential because of ease of use.
我已經使用熊貓多年了,每次我輸入太多單詞時,我都會用google搜索它,而且我通常會發現一個新的熊貓技巧! 我最近了解了這些功能,并且由于易于使用,我認為它們是必不可少的。
1.功能之間 (1. between function)

I’ve been using “between” function in SQL for years, but I only discovered it recently in pandas.
多年來,我一直在SQL中使用“ between”功能,但最近才在pandas中發現它。
Let’s say we have a DataFrame with prices and we would like to filter prices between 2 and 4.
假設我們有一個帶有價格的DataFrame,并且我們希望在2到4之間過濾價格。
df = pd.DataFrame({'price': [1.99, 3, 5, 0.5, 3.5, 5.5, 3.9]})
With between function, you can reduce this filter:
使用between功能,可以減少此過濾器:
df[(df.price >= 2) & (df.price <= 4)]
To this:
對此:
df[df.price.between(2, 4)]

It might not seem much, but those parentheses are annoying when writing many filters. The filter with between function is also more readable.
看起來似乎不多,但是編寫許多過濾器時這些括號令人討厭。 具有中間功能的過濾器也更易讀。
between function sets interval left <= series <= right.
功能集之間的間隔左<=系列<=右。
2.使用重新索引功能固定行的順序 (2. Fix the order of the rows with reindex function)

Reindex function conforms a Series or a DataFrame to a new index. I resort to the reindex function when making reports with columns that have a predefined order.
Reindex函數使Series或DataFrame符合新索引。 當使用具有預定義順序的列制作報表時,我求助于reindex函數。
Let’s add sizes of T-shirts to our Dataframe. The goal of analysis is to calculate the mean price for each size:
讓我們在數據框中添加T恤的尺寸。 分析的目的是計算每種尺寸的平ASP格:
df = pd.DataFrame({'price': [1.99, 3, 5], 'size': ['medium', 'large', 'small']})df_avg = df.groupby('size').price.mean()
df_avg

Sizes have a random order in the table above. It should be ordered: small, medium, large. As sizes are strings we cannot use the sort_values function. Here comes reindex function to the rescue:
尺寸在上表中具有隨機順序。 應該訂購:小,中,大。 由于大小是字符串,因此我們不能使用sort_values函數。 這里有reindex函數來解救:
df_avg.reindex(['small', 'medium', 'large'])

By
通過
3.描述類固醇 (3. Describe on steroids)

Describe function is an essential tool when working on Exploratory Data Analysis. It shows basic summary statistics for all columns in a DataFrame.
當進行探索性數據分析時,描述功能是必不可少的工具。 它顯示了DataFrame中所有列的基本摘要統計信息。
df.price.describe()

What if we would like to calculate 10 quantiles instead of 3?
如果我們想計算10個分位數而不是3個分位數怎么辦?
df.price.describe(percentiles=np.arange(0, 1, 0.1))

Describe function takes percentiles argument. We can specify the number of percentiles with NumPy's arange function to avoid typing each percentile by hand.
描述函數采用百分位數參數。 我們可以使用NumPy的arange函數指定百分位數,以避免手動鍵入每個百分位數。
This feature becomes really useful when combined with the group by function:
與group by函數結合使用時,此功能將非常有用:
df.groupby('size').describe(percentiles=np.arange(0, 1, 0.1))

4.使用正則表達式進行文本搜索 (4. Text search with regex)

Our T-shirt dataset has 3 sizes. Let’s say we would like to filter small and medium sizes. A cumbersome way of filtering is:
我們的T恤數據集有3種尺寸。 假設我們要過濾中小型尺寸。 繁瑣的過濾方式是:
df[(df['size'] == 'small') | (df['size'] == 'medium')]
This is bad because we usually combine it with other filters, which makes the expression unreadable. Is there a better way?
這很不好,因為我們通常將其與其他過濾器結合使用,從而使表達式不可讀。 有沒有更好的辦法?
pandas string columns have an “str” accessor, which implements many functions that simplify manipulating string. One of them is “contains” function, which supports search with regular expressions.
pandas字符串列具有“ str”訪問器,該訪問器實現了許多簡化操作字符串的功能。 其中之一是“包含”功能,該功能支持使用正則表達式進行搜索。
df[df['size'].str.contains('small|medium')]
The filter with “contains” function is more readable, easier to extend and combine with other filters.
具有“包含”功能的過濾器更具可讀性,更易于擴展并與其他過濾器組合。
5.比帶有熊貓的內存數據集更大 (5. Bigger than memory datasets with pandas)

pandas cannot even read bigger than the main memory datasets. It throws a MemoryError or Jupyter Kernel crashes. But to process a big dataset you don’t need Dask or Vaex. You just need some ingenuity. Sounds too good to be true?
熊貓讀取的數據甚至不能超過主內存數據集。 它引發MemoryError或Jupyter Kernel崩潰。 但是,要處理大型數據集,您不需要Dask或Vaex。 您只需要一些獨創性 。 聽起來好得令人難以置信?
In case you’ve missed my article about Dask and Vaex with bigger than main memory datasets:
如果您錯過了我的有關Dask和Vaex的文章,而這篇文章的內容比主內存數據集還大:
When doing an analysis you usually don’t need all rows or all columns in the dataset.
執行分析時,通常不需要數據集中的所有行或所有列。
In a case, you don’t need all rows, you can read the dataset in chunks and filter unnecessary rows to reduce the memory usage:
在某種情況下,您不需要所有行,您可以按塊讀取數據集并過濾不必要的行以減少內存使用量:
iter_csv = pd.read_csv('dataset.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
Reading a dataset in chunks is slower than reading it all once. I would recommend using this approach only with bigger than memory datasets.
分塊讀取數據集要比一次讀取所有數據集慢。 我建議僅對大于內存的數據集使用此方法。
In a case, you don’t need all columns, you can specify required columns with “usecols” argument when reading a dataset:
在某種情況下,不需要所有列,可以在讀取數據集時使用“ usecols”參數指定所需的列:
df = pd.read_csvsecols=['col1', 'col2'])
The great thing about these two approaches is that you can combine them.
這兩種方法的優點在于您可以將它們組合在一起。
你走之前 (Before you go)

These are a few links that might interest you:
這些鏈接可能會讓您感興趣:
- Your First Machine Learning Model in the Cloud- AI for Healthcare- Parallels Desktop 50% off- School of Autonomous Systems- Data Science Nanodegree Program- 5 lesser-known pandas tricks- How NOT to write pandas code
翻譯自: https://towardsdatascience.com/5-essential-pandas-tricks-you-didnt-know-about-2d1a5b6f2e7
魅族mx5游戲模式小熊貓
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391934.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391934.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391934.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!