binary masks
All men are sculptors, constantly chipping away the unwanted parts of their lives, trying to create their idea of a masterpiece … Eddie Murphy
所有的人都是雕塑家,不斷地消除生活中不必要的部分,試圖建立自己的杰作理念……埃迪·墨菲(Eddie Murphy)
If you ever wonder how to filter or handle unwanted, missing, or invalid data in your data science projects or, in general, Python programming, then you must learn the helpful concept of Masking. In this post, I will first guide you through an example for 1-d arrays, followed by 2-d arrays (matrices), and then provide an application of Masking in a Data Science Problem.
如果您想知道如何過濾或處理數據科學項目或通常是Python編程中不需要的,丟失的或無效的數據,那么您必須學習Masking的有用概念。 在本文中,我將首先為您介紹一個1-d數組的示例,然后是一個2-d數組(矩陣)的示例,然后提供在數據科學問題中屏蔽的應用 。
一維陣列 (1-d Arrays)
Suppose we have the following NumPy array:
假設我們有以下NumPy數組:
import numpy as nparr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
Now, we want to compute the sum of elements smaller than 4 and larger than 6. One tedious way is to use a for loop, check if a number fulfills these conditions, and then add it to a variable. This would look something like:
現在,我們要計算小于4和大于6的元素之和。一種乏味的方法是使用for循環,檢查數字是否滿足這些條件,然后將其添加到變量中。 這看起來像:
total = 0 # Variable to store the sumfor num in arr:
if (num<4) or (num>6):
total += numprint(total)
>>> 21
You can reduce this code to a one-liner using a list comprehension as,
您可以使用列表推導將代碼簡化為單行代碼,如下所示:
total = sum([num for num in arr if (num<4) or (num>6)])
>>> 21
The same task can be achieved using the concept of Masking. It essentially works with a list of Booleans (True/False), which when applied to the original array returns the elements of interest. Here, True refers to the elements that satisfy the condition (smaller than 4 and larger than 6 in our case), and False refers to the elements that do not satisfy the condition.
使用Masking的概念可以實現相同的任務。 本質上,它與布爾值列表(True / False)一起使用 ,將其應用于原始數組時會返回感興趣的元素。 在這里, True表示滿足條件的元素(在我們的例子中,小于4且大于6),而False表示不滿足條件的元素。
Let us first create this mask manually.
讓我們首先手動創建此蒙版。
mask = [True, True, True, False, False, False, True, True]
Next, we pass this mask (list of Booleans) to our array using indexing. This will return only the elements that satisfy this condition. You can then sum up this sub-array. The following snippet explains it. You will notice that you do not get back 4, 5, and 6 because the corresponding value was False.
接下來,我們使用索引將此掩碼(布爾值列表)傳遞給我們的數組。 這將僅返回滿足此條件 的元素 。 然后,您可以總結此子數組。 以下代碼段對此進行了說明。 您會注意到,由于相應的值為False ,所以您沒有取回4、5和6。
arr[mask]
>>> array([1, 2, 3, 7, 8])arr[mask].sum()
>>> 21
Numpy的MaskedArray模塊 (Numpy’s MaskedArray Module)
Numpy offers an in-built MaskedArray module called ma
. The masked_array()
function of this module allows you to directly create a "masked array" in which the elements not fulfilling the condition will be rendered/labeled "invalid". This is achieved using the mask
argument, which contains True/False or values 0/1.
Numpy提供了一個名為ma
的內置MaskedArray模塊 。 該模塊的masked_array()
函數使您可以直接創建一個“ masked array”,在其中將不滿足條件的元素呈現/標記為“ invalid” 。 這可以使用mask
參數實現,該參數包含True / False或值0/1。
Caution: Now, when the mask=False
or mask=0
, it literally means do not label this value as invalid. Simply put, include it during the computation. Similarly, mask=True
or mask=1
means label this value as invalid. By contrast, earlier you saw that False value was excluded when we used indexing.
注意 :現在,當mask=False
或mask=0
,字面意思是不要將此值標記為無效。 簡而言之,在計算時將其包括在內 。 同樣, mask=True
或mask=1
表示將此值標記為 無效。 相比之下,您先前發現使用索引時排除了False值。
Therefore, you have to now swap the True and False values while using thema
module. So, the new mask becomes
因此,您現在必須在使用 ma
模塊 時交換True和False值 。 所以,新的面具變成
mask = [False, False, False, True, True, True, False, False]
mask = [False, False, False, True, True, True, False, False]
import numpy.ma as ma"""First create a normal Numpy array"""
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])ma_arr = ma.masked_array(arr, mask=[False, False, False, True,
True, True, False, False])
>>> masked_array(data=[1, 2, 3, --, --, --, 7, 8],
mask=[False, False, False, True, True, True, False,
False], fill_value=999999)
ma_arr.sum()
>>> 21
The masked (invalid) values are now represented by --
. The shape/size of the resulting masked_array is the same as the original array. Previously, when we used arr[mask]
, the resulting array was not of the same length as the original array because the invalid elements were not in the output. This feature allows easy arithmetic operations on arrays of equal length but with different maskings.
掩碼的(無效)值現在由--
表示。 生成的masked_array的形狀/大小與原始數組相同 。 以前,當我們使用arr[mask]
,結果數組與原始數組的長度不同,因為無效元素不在輸出中。 此功能允許對長度相等但具有不同掩碼的數組進行簡單的算術運算。
Like before, you can also create the mask using list comprehension. However, because you want to swap the True and False values, you can use the tilde operator ~
to reverse the Booleans.
和以前一樣,您也可以使用列表推導來創建掩碼。 但是,由于要交換True和False值,因此可以使用波浪號~
來反轉布爾值。
"""Using Tilde operator to reverse the Boolean"""
ma_arr = ma.masked_array(arr, mask=[~((a<4) or (a>6)) for a in arr])ma_arr.sum()
>>> 21
You can also use a mask consisting of 0 and 1.
您還可以使用由0和1組成的掩碼。
ma_arr = ma.masked_array(arr, mask=[0, 0, 0, 1, 1, 1, 0, 0])
Depending on the type of masking condition, NumPy offers several other in-built masks that avoid your manual task of specifying the Boolean mask. Few such conditions are:
根據屏蔽條件的類型,NumPy提供了其他幾種內置屏蔽,從而避免了您手動指定布爾屏蔽的任務。 這些條件很少是:
- less than (or less than equal to) a number 小于(或小于等于)一個數字
- greater than (or greater than equal to) a number 大于(或大于等于)數字
- within a given range 在給定范圍內
- outside a given range 超出給定范圍
小于(或小于等于)數字 (Less than (or less than equal to) a number)
The function masked_less()
will mask/filter the values less than a number.
函數masked_less()
將屏蔽/過濾小于數字的值。
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])ma_arr = ma.masked_less(arr, 4)
>>> masked_array(data=[--, --, --, 4, 5, 6, 7, 8],
mask=[True, True, True, False, False, False,
False, False], fill_value=999999)ma_arr.sum()
>>> 30
To filter the values less than or equal to a number, use masked_less_equal()
.
要過濾小于或等于數字的值,請使用masked_less_equal()
。
大于(或大于等于)數字 (Greater than (or greater than equal to) a number)
We use the function masked_greater()
to filter the values greater than 4.
我們使用函數masked_greater()
過濾大于 4的值。
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])ma_arr = ma.masked_greater(arr, 4)>>> masked_array(data=[1, 2, 3, 4, --, --, --, --],
mask=[False, False, False, False, True, True,
True, True], fill_value=999999)ma_arr.sum()
>>> 10
Likewise, masked_greater_equal()
filters value greater than or equal to 4.
同樣, masked_greater_equal()
過濾大于或等于 4的值。
在給定范圍內 (Within a given range)
The function masked_inside()
will mask/filter the values lying between two given numbers (both inclusive). The following filter values between 4 and 6.
函數masked_inside()
將屏蔽/過濾兩個給定數字(包括兩個數字)之間的值。 以下4到6之間的過濾器值。
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])ma_arr = ma.masked_inside(arr, 4, 6)
>>> masked_array(data=[1, 2, 3, --, --, --, 7, 8],
mask=[False, False, False, True, True, True,
False, False], fill_value=999999)ma_arr.sum()
>>> 21
超出給定范圍 (Outside a given range)
The function masked_inside()
will mask/filter the values lying between two given numbers (both inclusive). The following filter values outside 4-6.
函數masked_inside()
將屏蔽/過濾兩個給定數字(包括兩個數字)之間的值。 以下4-6之外的過濾器值。
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])ma_arr = ma.masked_outside(arr, 4, 6)
>>> masked_array(data=[--, --, --, 4, 5, 6, --, --],
mask=[True, True, True, False, False, False,
True, True], fill_value=999999)ma_arr.sum()
>>> 15
在算術運算期間忽略NaN
和/或infinite
值 (Neglecting NaN
and/or infinite
values during arithmetic operations)
This is a cool feature! Often a realistic dataset has lots of missing values (NaNs) or some weird, infinity values. Such values create problems in computations and, therefore, are either neglected or imputed.
這是一個很酷的功能! 實際的數據集通常具有許多缺失值(NaN)或一些奇怪的無窮大值。 這樣的值在計算中產生問題,因此被忽略或推算。
For example, the sum or the mean of this 1-d NumPy array will benan
.
例如,此1-d NumPy數組的總和或平均值將為nan
。
arr = np.array([1, 2, 3, np.nan, 5, 6, np.inf, 8])
arr.sum()
>>> nan
You can easily exclude the NaN and infinite values using masked_invalid()
that will exclude these values from the computations. These invalid values will now be represented as --
. This feature is extremely useful in dealing with the missing data in large datasets in data science problems.
您可以使用masked_invalid()
輕松排除NaN和無限值,該值將從計算中排除這些值。 這些無效值現在將表示為--
。 此功能對于處理數據科學問題中大型數據集中的丟失數據非常有用 。
ma_arr = ma.masked_invalid(arr)
>>> masked_array(data=[1.0, 2.0, 3.0, --, 5.0, 6.0, --, 8.0],
mask=[False, False, False, True, False, False,
True, False], fill_value=1e+20)ma_arr.mean()
>>> 4.166666666666667
Let’s say you want to impute or fill these NaNs or inf values with the mean of the remaining, valid values. You can do this easily using filled()
as,
假設您要用剩余的有效值的平均值來估算或填充這些NaN或inf值。 您可以使用filled()
輕松地做到這一點,
ma_arr.filled(ma_arr.mean())
>>> [1., 2., 3., 4.16666667, 5., 6., 4.16666667, 8.]
遮罩二維數組(矩陣) (Masking 2-d arrays (matrices))
Often your big data is in the form of a large 2-d matrix. Let’s see how you can use masking for matrices. Consider the following 3 x 3 matrix.
通常,您的大數據是以大二維矩陣的形式出現的。 讓我們看看如何對矩陣使用掩碼。 考慮下面的3 x 3矩陣。
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Suppose we want to compute the column-wise sum excluding the numbers that are larger than 4. Now, we have to use a 2-d mask. As mentioned earlier, you can also use a 2-d mask of True/False in the following.
假設我們要計算不包括大于4的數字在內的按列求和。 現在,我們必須使用 2-d mask 。 如前所述,您還可以在下面使用True / False的二維遮罩。
ma_arr = ma.masked_array(arr, mask=[[0, 0, 0],
[0, 1, 1],
[1, 1, 1]])>>> masked_array(data=[[1, 2, 3],
[4, --, --],
[--, --, --]],
mask=[[False, False, False],
[False, True, True],
[True, True, True]], fill_value=999999)"""Column-wise sum is computed using the argument
ma_arr.sum(axis=0)>>> masked_array(data=[5, 2, 3], mask=[False, False, False],
fill_value=999999)
In the above code, we created a 2-d mask manually using 0 and 1. You can make your life easier by using the same functions as earlier for a 1-d case. Here, you can use masked_greater()
to exclude the values greater than 4.
在上面的代碼中,我們使用0和1手動創建了二維蒙版。 通過對1-d情況使用與以前相同的功能,可以使生活更輕松。 在這里,您可以使用masked_greater()
排除大于4的值。
ma_arr = ma.masked_greater(arr, 4)ma_arr.sum(axis=0)
>>> masked_array(data=[5, 2, 3], mask=[False, False, False],
fill_value=999999)
NOTE: You can use all the functions, earlier shown for 1-d, also for 2-d arrays.
注意:可以將所有功能(前面顯示的用于1-d)也用于2-d陣列。
在數據科學問題中使用掩蔽 (Use of masking in a data science problem)
A routine task of any data science project is an exploratory data analysis (EDA). One key step in this direction is to visualize the statistical relationship (correlation) between the input features. For example, Pearson’s correlation coefficient provides a measure of linear correlation between two variables.
任何數據科學項目的例行任務都是探索性數據分析(EDA)。 朝這個方向邁出的關鍵一步是可視化輸入要素之間的統計關系(相關性)。 例如,皮爾遜相關系數提供了兩個變量之間線性相關的度量。
Let’s consider the Boston Housing Dataset and compute the correlation matrix which results in coefficients ranging between -1 and 1.
讓我們考慮波士頓住房數據集并計算相關矩陣,得出系數在-1和1之間。
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_bostonboston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
Let’s now plot the correlation matrix using the Seaborn library.
現在讓我們使用Seaborn庫繪制相關矩陣。
correlation = df.corr()ax = sns.heatmap(data=correlation, cmap='coolwarm',
linewidths=2, cbar=True)

Now, suppose you want to highlight or easily distinguish the values which are having an absolute correlation of over 70%, i.e. 0.7. The above-introduced concepts of Masking come into play here. You can use the masked_outside()
function, explained earlier, to mask your required values and highlight them using a special color in your Seaborn plot.
現在,假設您要突出顯示或輕松區分絕對相關度超過70%(即0.7)的值 。 上面介紹的“掩蔽”概念在這里起作用。 您可以使用前面解釋過的masked_outside()
函數來掩蓋所需的值,并在Seaborn圖中使用特殊顏色突出顯示它們。
correlation = df.corr()"""Create a mask for abs(corr) > 0.7"""
corr_masked = ma.masked_outside(np.array(correlation), -0.7, 0.7)"""Set gold color for the masked/bad values"""
cmap = plt.get_cmap('coolwarm')
cmap.set_bad('gold')ax = sns.heatmap(data=correlation, cmap=cmap,
mask=corr_masked.mask,
linewidths=2, cbar=True)

翻譯自: https://towardsdatascience.com/the-concept-of-masks-in-python-50fd65e64707
binary masks
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388352.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388352.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388352.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!