先進的NumPy數據科學

We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a realtime dataset. Concepts covered here are more than enough to start your journey with data.

我們將介紹NumPy的一些高級概念,特別是實時數據集所需的功能和方法。 此處介紹的概念足以開始您的數據之旅。

To go ahead you are requested to know the basic concepts of NumPy if not I suggest you read my article “NumPy-The very basics!” first. You can find a link to it at the end of this article.

首先,要求您了解NumPy的基本概念,否則我建議您閱讀我的文章“ NumPy-非常基礎 !”。 第一。 您可以在本文末尾找到它的鏈接。

內容 (Contents)

  1. Universal Functions

    通用功能

  2. Aggregation

    聚合

  3. Broadcasting

    廣播

  4. Masking

    掩蔽

  5. Fancy Indexing

    花式索引

  6. Array Sorting

    數組排序

NumPy中的通用函數是什么? (What are Universal Functions in NumPy?)

Most of the time we have to loop over the array to perform simple computations like addition, subtraction, division, etc on each array element. Since these are repeated operations the time taken to compute increases with relatively larger data. Thankfully, NumPy makes this faster by using vectorized operations, generally implemented through NumPy’s universal functions (ufuncs). Let’s understand with an example.

大多數時候,我們必須遍歷數組以對每個數組元素執行簡單的計算,例如加法,減法,除法等。 由于這些是重復的操作,因此計算所需的時間隨著相對較大的數據而增加。 值得慶幸的是,NumPy通過使用矢量化操作(通常通過NumPy的通用函數(ufuncs)實現)使此操作更快 讓我們看一個例子。

Suppose we have an array of random integers between 1 to 10 and would like to get square of each element of the array. What we do with the knowledge of Python is:

假設我們有一個介于1到10之間的隨機整數數組,并且想要獲得數組中每個元素的平方。 我們對Python的了解是:

Numpy universal functions

This takes a lot of time to write and compute, especially for larger arrays in a real dataset. Let’s see how ufuncs make it simpler both ways.

這需要花費大量時間來編寫和計算,尤其是對于實際數據集中的較大數組。 讓我們看看ufuncs如何使這兩種方法都更簡單。

Numpy universal functions

Simply by performing an operation on the array it will be applied to each element?within?the?array. As we notice it also retains the dtype. Ufunc operations are extremely flexible. We can also perform operations between two arrays.

只需通過對數組執行操作即可將其應用于數組中的每個元素。 我們注意到它還保留了dtype 。 Ufunc操作非常靈活。 我們還可以在兩個數組之間執行操作。

Numpy universal functions

All these arithmetic operations are wrappers around NumPy builtin functions. For example, + operator is a wrapper for add function.

所有這些算術運算都是NumPy內置函數的包裝 。 例如,+運算符是add函數的包裝器。

Numpy universal functions

Below is the summary table of all the arithmetic operations in NumPy.

下表是NumPy中所有算術運算的匯總表。

Numpy universal functions

Some of the most useful functions provided by NumPy are trigonometric, logarithmic, and exponential functions. As data scientists, we are supposed to be aware of it. These will come handy while working on real datasets.

NumPy提供的一些最有用的函數是三角函數,對數函數和指數函數。 作為數據科學家,我們應該意識到這一點。 這些將在處理實際數據集時派上用場。

Image for post
Image for post
Image for post
Image for post

聚合 (Aggregation)

As a data analyst or data scientist, the very first step is to explore and understand the data. One way to do it is to compute summary statistics. Although, the most common statistical methods to summarize the data are mean and standard deviation other aggregates are also useful such as sum, product, median, maximum, minimum, etc.

作為數據分析師或數據科學家,第一步是探索和理解數據。 一種方法是計算匯總統計信息。 雖然,最常用的統計數據匯總方法是平均值和標準差,其他合計也很有用,例如總和,乘積,中位數,最大值,最小值等。

Let us understand with an example by computing the sum, min, and max.

讓我們以計算總和,最小和最大為例來理解。

Numpy aggregation

For most of the NumPy aggregates the shorthand syntax is to use methods of the array objects instead of functions. The above operation can also be performed as shown below which is of no difference computationally.

對于大多數NumPy聚合,速記語法是使用數組對象的方法而不是函數。 也可以如下所示執行上述操作,在計算上沒有區別。

Numpy aggregation

IMPORTANT-Difference between Python aggregate functions and NumPy aggregate functions

重要 -Python聚合函數和NumPy聚合函數之間的區別

The one question you can raise is why to use NumPy aggregate functions when these functions are already inbuilt in Python ( sum(), min(), max(), etc). Of course, the difference is NumPy functions are much faster but more importantly NumPy functions are aware of dimensions. Python functions behave differently on multidimensional arrays.

您可能會提出的一個問題是,為什么已經在Python中內置了NumPy聚合函數(sum(),min(),max()等)。 當然,區別在于NumPy函數要快得多,但更重要的是NumPy函數知道尺寸。 Python函數在多維數組上的行為有所不同。

Suppose we like to get some of all the elements in an array of size 2x5. For better understanding, we will take a simple array of numbers from 0 to 9.

假設我們希望以2x5的大小獲取所有元素。 為了更好地理解,我們將使用一個簡單的數字數組,從0到9。

Numpy aggregation

We were expecting the output to be 45 (0+1+2+3+4+5+6+7+8+9) but the result is very unexpected. These kinds of results will cost a lot while summarizing data. Hence, always make sure you are using the NumPy version of aggregate function while working on arrays.

我們期望輸出為45(0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9),但結果出乎意料。 這些結果在匯總數據時會花費很多。 因此,始終確保在處理數組時使用聚合函數的NumPy版本。

Multidimensional aggregates

多維聚合

One common type of operation is aggregation along rows and columns. Since NumPy functions are aware of dimensions?it?is?easier?to?do?so, for example, minimum value among each row and column. Functions take an additional argument that specifies the axis along which we wish to perform aggregation.

一種常見的操作類型是沿行和列的聚合。 由于NumPy函數知道尺寸,因此更容易做到,例如,每一行和每一列中的最小值。 函數采用一個附加參數,該參數指定了我們希望沿其執行聚合的軸。

Suppose we have a table of marks obtained by students and each column represents a different subject. We wish to find the minimum?and maximum marks in each subject and total marks scored?by?each?student. ‘axis = 0’ to specify columns-wise operation and ‘axis=1’ for row-wise. The result will an 1-d array.

假設我們有一張學生獲得的分數表,每一列代表一個不同的學科。 我們希望找到每個學科的最低和最高分數,以及每個學生的總分數。 'axis = 0'指定列操作,'axis = 1'指定行操作。 結果將是一維數組。

Numpy aggregation
Numpy aggregation

Other aggregation functions by NumPy

NumPy的其他聚合功能

np.prod, np.mean, np.std, np.var, np.argmin (find index of minimum value), np.argmax (find index of maximum value), np.median, np.percentile (compute rank-based statistics of elements).

np.prod,np.mean,np.std,np.var,np.argmin(最小值的查找索引),np.argmax(最大值的查找索引),np.median,np.percentile(基于計算等級)元素統計)。

廣播 (Broadcasting)

We have already seen NumPy universal functions at the very beginning. Broadcasting is another means of applying ufuncs but on arrays of different sizes. Broadcasting is nothing but a set of rules applied by NumPy to perform unfuncs on arrays of different sizes.

我們從一開始就已經看到了NumPy通用函數。 廣播是在其他大小的數組上應用ufunc的另一種方法。 廣播不過是NumPy應用于在不同大小的數組上執行取消功能的一組規則。

Consider adding two arrays of size 3x3 and 1x3. For our understanding, we can think of this operation as the smaller array is stretched or broadcasted to match the size of a larger array. This stretching of the array does not take place actually, this is just for better understanding.

考慮添加兩個大小為3x3和1x3的數組。 就我們的理解而言,我們可以認為此操作是將較小的數組拉伸或廣播以匹配較大的數組的大小。 數組的拉伸實際上并沒有發生,這只是為了更好地理解。

Numpy broadcasting

Confusion and complication increase when both the arrays need to be broadcasted.

當兩個陣列都需要廣播時,混亂和復雜性增加。

Numpy broadcasting

Jake VanderPlas, author of the book Python Data Science Handbook has provided excellent visualization to explain this process. The light-colored boxes represent the stretched values.

《 Python數據科學手冊》一書的作者Jake VanderPlas提供了出色的可視化效果來解釋這一過程。 淺色框代表拉伸值。

Numpy broadcasting
Source-Python for data science handbook
數據科學手冊的Source-Python

3 Rules for Broadcasting

3廣播規則

Above is the logical imagination to understand. We will explore the theoretical rules with examples.

以上是理解的邏輯想象。 我們將通過實例探索理論規則。

Example 1:

范例1:

m = np.arange(3).reshape((3,1))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)
By rule 1, if two arrays deffer in their shape the array with lesser shape should be padded with ‘1’ on it's left side.m.shape => (3, 1)
n.shape => (1, 3)
By rule 2, if still the shape of two arrays do not match then each array whose dimension is equal to 1 should be broadcasted to match the shape of another array.m.shape => (3, 3)
n.shape => (3, 3)

Stressing on rule 2, it says we can stretch the array only if value of one of its dimensions is 1. We cannot do this for dimension value other than 1. Let’s see an example where the dimension in the shape of an array will be different from 1 during the application of rule 2.

強調規則2,它說只有在其維度之一的值是1時,我們才能拉伸數組。我們不能對維度值除1進行拉伸。讓我們來看一個示例,其中數組形狀的維度將不同在應用規則2時從1開始。

Example 2:

范例2:

m = np.arange(3).reshape((3,2))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)
By rule 1,m.shape => (3, 2)
n.shape => (1, 3)
By rule 2,m.shape => (3, 2)
n.shape => (3, 3)
By rule 3, if shapes of both array disagree and any dimension of neither array is 1 then an error should be raised.

掩蔽 (Masking)

Masking is a method used extensively in the data processing. It allows us to extract, count, modify or manipulate values in an array based on certain criteria, these criteria are specified using comparison operators and boolean operators.

屏蔽是一種廣泛用于數據處理的方法。 它允許我們根據某些條件提取,計數,修改或操作數組中的值,這些條件是使用比較運算符和布爾運算符指定的。

Suppose we have a two-dimensional array of size (3, 4) we would like to get a subset of the array whose values are less than 5.

假設我們有一個大小為(3,4)的二維數組,我們希望得到該數組的一個子集,其值小于5。

Numpy masking

Let’s break it down

讓我們分解一下

We used a comparison operator ‘<’ on array x. As we already know this applies element-wise ufunc (np.less()) on the array. As a result, we get an array of boolean operators. True, if the element at the corresponding position is less than 5 else False.

我們在數組x上使用了比較運算符'<'。 眾所周知,這在數組上應用了逐元素的ufunc(np.less())。 結果,我們得到一個布爾運算符數組。 如果在相應位置的元素小于5,則為True,否則為False。

Numpy masking

When we say x[x<5], the above returned boolean values are applied on original array x resulting to return the elements of the array whose indices are True, eventually values less than 5. Similar way we can use all the comparison or boolean operators available in Python. We can even combine two operations say x[(x>3) & (x<6)] to get values between 3 and 6, only that the result of operations should be boolean. Notice, here we use bitwise operator ‘&’ rather than keyword ‘and’.

當我們說x [x <5]時,以上返回的布爾值將應用于原始數組x,從而返回索引為True且最終值小于5的數組元素。類似的方式,我們可以使用所有比較或布爾值Python中可用的運算符。 我們甚至可以結合兩個操作x [(x> 3)&(x <6)]來獲得3到6之間的值,只是操作的結果應該是布爾值。 注意,這里我們使用按位運算符“&”而不是關鍵字“ and”。

REMEMBER

記得

The keyword ‘and’ and ‘or’ performs single boolean operation on entire array while bitwise ‘&’ and ‘|’ performs multiple boolean operations on elements of an array. Always use bit-wise operators while masking.

關鍵字“ and”和“或”對整個數組執行單個布爾運算,而按位的“&”和“ |” 對數組的元素執行多個布爾操作。 屏蔽時始終使用按位運算符。

花式索引 (Fancy indexing)

Fancy indexing is similar to normal indexing as we already know. The only difference is we pass an array of indices here. This advanced version of indexing allows quick access and/or modification of complicated subsets of an array.

如我們所知,花式索引與普通索引相似。 唯一的區別是我們在這里傳遞了一組索引。 索引的此高級版本允許快速訪問和/或修改數組的復雜子集。

Suppose we want to access elements at index 2, 5, and 9 of an array, the old school method would be [x[2], x[5], x[9]]. This can we simplified using fancy indexing.

假設我們要訪問數組索引2、5和9的元素,則舊的方法是[x [2],x [5],x [9]]。 我們可以使用花式索引來簡化此操作。

Numpy indexing

Likewise, we can fancy index two-dimensional array. Let’s see equivalent operation of x[0, 2], x[1, 3] and x[2, 1] in fancy indexing.

同樣,我們可以看上二維數組的索引。 讓我們看一下花式索引中x [0,2],x [1,3]和x [2,1]的等效操作。

Numpy indexing

This can be further simplified if either row or column value is constant. Let’s say we like to get values at index x[2, 1], x[2, 3] and x[2, 4]. The below yellow color highlight is for row value and blue color for the column value. Similarly, we can also modify values using fancy indexing by using the assignment operator ‘=’.

如果行或列的值恒定,則可以進一步簡化。 假設我們喜歡獲取索引為x [2,1],x [2,3]和x [2,4]的值。 下面的黃色高亮顯示為行值,藍色為列值。 同樣,我們也可以通過賦值運算符'=' 使用花式索引修改值

Numpy indexing

數組排序 (Array sorting)

np.sort is a more efficient sorting function than Python’s built-in sort function. Additionally, np.sort is aware of dimensions. Let’s see a few flavors of the NumPy sorting function.

np.sort是比Python內置的sort函數更有效的排序函數。 另外, np.sort知道Dimensions 。 讓我們來看看NumPy排序函數的幾種風格。

Numpy sorting

Notice, when we use the method sort(), it alters the value of array x itself. Meaning, the original order of array x in lost. It is called in-place sorting.

注意,當我們使用方法sort()時,它會更改數組x本身的值。 意思是,數組x的原始順序丟失了。 這稱為就地排序

Advanced NumPy for Data Science?—?Thank you for reading
Photo by Kelly Sikkema on Unsplash
Kelly Sikkema在Unsplash上的照片

Although these are not the only concepts of NumPy still I have managed to cover all critical and must-know concepts. This is clearly more than enough for getting started with data science. Since Python is open-source many functions keep adding and deprecating regularly. Always keep an eye on NumPy’s official documentation. I will also make sure I keep updating content as and when required.

盡管這些不是NumPy的唯一概念,但我還是設法涵蓋了所有關鍵且必須知道的概念。 對于數據科學入門而言,這顯然綽綽有余。 由于Python是開源的,因此許多功能會定期添加和棄用。 始終注意NumPy的官方文檔 。 我還將確保在需要時不斷更新內容。

If you are facing difficulty in understanding the concepts try reading the below article first.

如果您在理解這些概念時遇到困難,請先閱讀以下文章。

Let’s connect

讓我們連接

翻譯自: https://medium.com/analytics-vidhya/advanced-numpy-218584c60c63

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388339.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388339.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388339.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

lsof命令詳解

基礎命令學習目錄首頁 原文鏈接&#xff1a;https://www.cnblogs.com/ggjucheng/archive/2012/01/08/2316599.html 簡介 lsof(list open files)是一個列出當前系統打開文件的工具。在linux環境下&#xff0c;任何事物都以文件的形式存在&#xff0c;通過文件不僅僅可以訪問常規…

Xcode中捕獲iphone/ipad/ipod手機攝像頭的實時視頻數據

目的&#xff1a;打開、關閉前置攝像頭&#xff0c;繪制圖像&#xff0c;并獲取攝像頭的二進制數據。 需要的庫 AVFoundation.framework 、CoreVideo.framework 、CoreMedia.framework 、QuartzCore.framework 該攝像頭捕抓必須編譯真機的版本&#xff0c;模擬器下編譯不了。 函…

統計和冰淇淋

Photo by Irene Kredenets on UnsplashIrene Kredenets在Unsplash上拍攝的照片 摘要 (Summary) In this article, you will learn a little bit about probability calculations in R Studio. As it is a Statistical language, R comes with many tests already built in it, …

信息流服務器哪種好,選購存儲服務器需要注意六大關鍵因素,你知道幾個?

原標題&#xff1a;選購存儲服務器需要注意六大關鍵因素&#xff0c;你知道幾個&#xff1f;信息技術的飛速發展帶動了整個信息產業的發展。越來越多的電子商務平臺和虛擬化環境出現在企業的日常應用中。存儲服務器作為企業建設環境的核心設備&#xff0c;在整個信息流中承擔著…

t3 深入Tornado

3.1 Application settings 前面的學習中&#xff0c;在創建tornado.web.Application的對象時&#xff0c;傳入了第一個參數——路由映射列表。實際上Application類的構造函數還接收很多關于tornado web應用的配置參數。 參數&#xff1a; debug&#xff0c;設置tornado是否工作…

vml編輯器

<HTML xmlns:v> <HEAD> <META http-equiv"Content-Type" content"text/html; Charsetgb2312"> <META name"GENERATOR" content"網絡程序員伴侶(Lshdic)2004"> <META name"GENERATORDOWNLOADADDRESS&q…

對數據倉庫進行數據建模_確定是否可以對您的數據進行建模

對數據倉庫進行數據建模Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.某些數…

15 并發編程-(IO模型)

一、IO模型介紹 1、阻塞與非阻塞指的是程序的兩種運行狀態 阻塞&#xff1a;遇到IO就發生阻塞&#xff0c;程序一旦遇到阻塞操作就會停在原地&#xff0c;并且立刻釋放CPU資源 非阻塞&#xff08;就緒態或運行態&#xff09;&#xff1a;沒有遇到IO操作&#xff0c;或者通過某種…

arduino消息服務器,在C(Arduino IDE)中將API鏈接消息解析為服務器(示例代碼)

我正在使用Arduino IDE來編程我的微控制器&#xff0c;它有一個內置的Wi-Fi芯片(ESP8266 NodeMCU)&#xff0c;它連接到我的互聯網路由器&#xff0c;然后有一個特定的IP(就像192.168.1.5)。所以我想通過添加到鏈接的消息發送命令(和數據)&#xff0c;然后鏈接變為&#xff1a;…

不提拔你,就是因為你只想把工作做好

2019獨角獸企業重金招聘Python工程師標準>>> 我有個朋友&#xff0c;他30出頭&#xff0c;在500強公司做技術經理。他戴無邊眼鏡&#xff0c;穿一身土黃色的夾克&#xff0c;下面是一條常年不洗的牛仔褲加休閑皮鞋&#xff0c;典型技術高手范。 三 年前&#xff0c;…

python內置函數多少個_每個數據科學家都應該知道的10個Python內置函數

python內置函數多少個Python is the number one choice of programming language for many data scientists and analysts. One of the reasons of this choice is that python is relatively easier to learn and use. More importantly, there is a wide variety of third pa…

C#使用TCP/IP與ModBus進行通訊

C#使用TCP/IP與ModBus進行通訊1. ModBus的 Client/Server模型 2. 數據包格式及MBAP header (MODBUS Application Protocol header) 3. 大小端轉換 4. 事務標識和緩沖清理 5. 示例代碼 0. MODBUS MESSAGING ON TCP/IP IMPLEMENTATION GUIDE 下載地址&#xff1a;http://www.modb…

Hadoop HDFS常用命令

1、查看hdfs文件目錄 hadoop fs -ls / 2、上傳文件 hadoop fs -put 文件路徑 目標路徑 在瀏覽器查看:namenodeIP:50070 3、下載文件 hadoop fs -get 文件路徑 保存路徑 4、設置副本數量 -setrep 轉載于:https://www.cnblogs.com/chaofan-/p/9742633.html

SAP UI 搜索分頁技術

搜索分頁技術往往和另一個術語Lazy Loading&#xff08;懶加載&#xff09;聯系起來。今天由Jerry首先介紹S/4HANA&#xff0c;CRM Fiori和S4CRM應用里的UI搜索分頁的實現原理。后半部分由SAP成都研究院菜園子小哥王聰向您介紹Twitter的懶加載實現。 關于王聰的背景介紹&#x…

萬彩錄屏服務器不穩定,萬彩錄屏 云服務器

萬彩錄屏 云服務器 內容精選換一換內網域名是指僅在VPC內生效的虛擬域名&#xff0c;無需購買和注冊&#xff0c;無需備案。云解析服務提供的內網域名功能&#xff0c;可以讓您在VPC中擁有權威DNS&#xff0c;且不會將您的DNS記錄暴露給互聯網&#xff0c;解析性能更高&#xf…

針對數據科學家和數據工程師的4條SQL技巧

SQL has become a common skill requirement across industries and job profiles over the last decade.在過去的十年中&#xff0c;SQL已成為跨行業和職位描述的通用技能要求。 Companies like Amazon and Google will often demand that their data analysts, data scienti…

C# 讀取CAD文件縮略圖(DWG文件)

//C# 讀取CAD文件縮略圖&#xff08;DWG文件&#xff09; 楊航收集技術資料&#xff0c;分享給大家 //2010-09-04 16:34:58| 分類&#xff1a; C# |字號 訂閱//在不使用任務插件的情況下讀取DWG文件的縮略圖&#xff0c;以便在沒有安裝AutoCAD的計算機上瀏覽。using System;u…

全排列算法實現

版權聲明&#xff1a;本文為博主原創文章&#xff0c;未經博主允許不得轉載。 https://blog.csdn.net/summerxiachen/article/details/605796231.全排列的定義和公式&#xff1a; 從n個數中選取m&#xff08;m<n&#xff09;個數按照一定的順序進行排成一個列&#xff0c;叫…

14.并發容器之ConcurrentHashMap(JDK 1.8版本)

1.ConcurrentHashmap簡介 在使用HashMap時在多線程情況下擴容會出現CPU接近100%的情況&#xff0c;因為hashmap并不是線程安全的&#xff0c;通常我們可以使用在java體系中古老的hashtable類&#xff0c;該類基本上所有的方法都采用synchronized進行線程安全的控制&#xff0c;…

modbus注意幾點

1、 在利用Modbus通訊的過程中&#xff0c;遇到這樣一個問題&#xff0c;即浮點數的傳輸問題。因為一般浮點數都是32位&#xff0c;而Modbus總線中只能傳輸最多16位的數據。解決方法&#xff1a;可以利用兩個整形數傳送一個浮點數&#xff08;即將一個32位的二進制數分割成兩個…