聚合 數據處理_R中聚合的簡介:強大的數據處理工具

聚合 數據處理

by Satyam Singh Chauhan

薩蒂揚·辛格·喬漢(Satyam Singh Chauhan)

R中聚合的簡介:強大的數據處理工具 (An introduction to aggregates in R: a powerful tool for playing with data)

Data Visualization is not just about colors and graphs. It’s about exploring the data and visualizing the right thing.

數據可視化不僅涉及顏色和圖形。 這是關于探索數據并使可視化正確的事情。

While playing with the data, the most powerful tool that comes handy is Aggregates. Aggregates is just the type of transformation that we apply to any given data.

在處理數據時,最方便使用的最強大的工具是聚合。 聚合只是我們應用于任何給定數據的轉換類型。

我們提供11種匯總函數: (We have 11 aggregate function available to us:)

  • avg

    平均

    Average of all numeric values is calculated and returned.

    計算并返回所有數值的平均值。

  • count

    計數

    Function count returns total number of items in each group.

    函數計數返回每個組中的項目總數。

  • first

    第一

    The first value of each group is returned by the function first.

    函數首先返回每個組的第一個值。

  • last

    持續

    The last value of each group is returned by the function last.

    每個組的最后一個值由函數last返回。

  • max

    最高

    The max value of each group is returned by the function max.

    每個組的最大值由函數max返回。

    It is very helpful to identify outliers as well.

    識別異常值也非常有幫助。

  • median

    中位數

    The median of all numeric values for the mentioned group is returned by the function median.

    函數中位數返回所提及組的所有數值的中位數。

  • min

    The min value of each group is returned by the function min.

    每個組的最小值由函數min返回。

    It is very helpful to identify outliers as well.

    識別異常值也非常有幫助。

  • mode

    模式

    The mode of all numeric values for the mentioned group is returned by the function mode.

    功能組返回所提及組的所有數值的模式。

  • rms

    均方根

    Root Mean Square, rms value for all numeric values in the group is returned by the fucntion rms.

    均方根均方根值由功能均方根值返回。

  • sttdev

    sttdev

    Standard Deviation of all Numeric values given in the group is returned by the function stddev.

    函數stddev返回該組中所有給定數值的標準偏差。

  • sum

    Sum of all the numeric values is returned by the function sum.

    所有數值的總和由函數sum返回。

基本范例 (Basic Examples)

使用聚合函數的基本視覺散點圖-總和 (Basic Visual Scatter plot using aggregate function — sum)

#Include the Librarylibrary(plotly)
#Store the graph in one variable to make it easier to manipulate.p <- plot_ly(     type = 'scatter',     y = iris$Petal.Length/iris$Petal.Width,     x = iris$Species,     mode = 'markers',     marker = list(          size = 15,          color = 'green',          opacity = 0.8     ),     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'sum', enabled = T                    )               )          )     ))
#Display the graphp

這是什么意思? (What does this mean?)

Function sum, as mentioned above, calculates the sum of each group.Thus, here the groups are categorized as species. This code uses the Iris Data Set which consist three different species, setosa, veriscolor, and virginica. For each species there are 50 observations in the data set. This data set is available in R (built-in) and can be loaded directly.

如上所述,函數sum計算每個組的總和,因此這里將這些組歸類為物種。 此代碼使用了鳶尾花數據集,該數據集包含三種不同的樹種:setosa,veriscolor和virginica。 對于每個物種,數據集中都有50個觀測值。 該數據集位于R(內置)中,可以直接加載。

There are “iris” and “iris3” - two data sets are available. You can choose any one of them to run this code. The Data-Set used in this article is “iris”.

有“ iris”和“ iris3”-兩個數據集可用。 您可以選擇其中任何一個來運行此代碼。 本文中使用的數據集為“ iris”。

此代碼的作用是什么 (What does this code do exactly?)

This code uses the function sum and calculates the sum of all the Petal.Length of each group respectively. Then, the calculated sum is plotted on the x-y axis. Where the x-axis is Species, the y-axis shows the Summation.

此代碼使用函數sum并分別計算每個組的所有Petal.Length總和 。 然后,將計算出的總和繪制在xy軸上。 x軸為“物種”時,y軸顯示“求和”。

From this graph, we can get an idea that the petal size of setosa is smallest as the sum is the smallest, but it’s not conclusive evidence. To get conclusive evidence we can use the function avg.

從這張圖中,我們可以得出一個結論,setosa的花瓣大小最小,因為總和最小,但這不是決定性的證據 。 為了獲得確鑿的證據,我們可以使用函數avg。

The function sum is very suitable for almost the whole data set. For example, one of the best places where this can be used is in Population Data Set. In the world population data set, we can aggregate countries according to continents and find the sum of all the population of the countries in it.

函數和非常適用于幾乎整個數據集 。 例如,可以使用的最佳位置之一是“人口數據集”。 在世界人口數據集中,我們可以按大洲匯總國家/地區,并找到其中所有國家的總和。

最常用的功能-平均 (Most used function — avg)

#Include the Librarylibrary(plotly)
#Store the graph in one variable to make it easier to manipulate.q <- plot_ly(     type = 'bar',     y = iris$Petal.Length/iris$Petal.Width,     x = iris$Species,     color = iris$Species,     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'avg', enabled = T                    )               )          )     ))
#Display the graphq

這是什么意思? (What does this mean?)

The iris data-set contains two columns for Petals, Petal.Width and Petal.Length. Further, it can be used to calculate the average of the ratio of Petal.Length & Petal.Width.

虹膜數據集包含用于花瓣的兩列,花瓣寬度和花瓣長度。 此外,它可用于計算Petal.Length和Petal.Width之比的平均值。

該代碼的作用是什么? (What does this code do exactly?)

For each observation, the ratio of Petal.Length to Petal.Width is calculated before the average of all the gained values is plotted. As we can observe from this Bar Plot, Setosa has the max ratio with a near-ratio of 7, which shows that the petal length in Setosa is 7 times longer than its width. While on the other hand, virginica has the smallest ratio with nearly 3 times the width.

對于每個觀察,在繪制所有獲得值的平均值之前,先計算Petal.Length與Petal.Width的比率。 從該條形圖中可以看出,Setosa的最大比例接近7,表明Setosa的花瓣長度是其寬度的7倍。 另一方面,維吉尼亞具有最小的比例,幾乎是寬度的3倍。

This function is very flexible and especially when it’s used very wisely to get the best result. For example, if we consider some other data-set like Population, then we can calculate the average birth to death ratio for each country.

此功能非常靈活,尤其是在非常明智地使用以獲得最佳效果時。 例如,如果我們考慮其他數據集,例如人口,那么我們可以計算每個國家的平均出生與死亡比率。

Let’s use all the functions in one graph. Now we’re going to plot a scatter plot for each category and we’re going to use all the functions. To this graph we will add a button from which we can select the desired function to make our work easier and get the results quicker.

讓我們在一張圖中使用所有函數。 現在,我們將為每個類別繪制一個散點圖,并使用所有功能。 在此圖中,我們將添加一個按鈕,從中可以選擇所需的功能以使我們的工作更輕松并更快地獲得結果。

所有功能的匯總-一幅圖中的所有功能 (Aggregation of all functions — all functions in one-graph)

#Include the Librarylibrary(plotly)
#Store the graph in one variable to make it easier to manipulate.s <- schema()agg <- s$transforms$aggregate$attributes$aggregations$items$aggregation$func$valuesl = list()
for (i in 1:length(agg)) {     ll = list(method = "restyle",     args = list('transforms[0].aggregations[0].func', agg[i]),     label = agg[i])     l[[i]] = ll     }
p <- plot_ly(     type = 'scatter',     x = iris$Species,     y = iris$Sepal.Length / iris$Sepal.Width,     mode = 'markers',     marker = list(          size = 20,          color = 'orange',          opacity = 0.8          ),     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'avg', enabled = T                    )               )            )     )) %>%layout(     title = '<b>Plotly Aggregations by Satyam Chauhan</b><br>use     dropdown to change aggregation<br><b>Sepal ratio of Length to     Width</b>',     xaxis = list(title = 'Species'),     yaxis = list(title = 'Sepal ratio: Length/Width'),     updatemenus = list(          list(               x = 0.2,               y = 1.2,               xref = 'paper',               yref = 'paper',               yanchor = 'top',               buttons = l          )     ))
#Display the graphs

這是什么意思? (What does this mean?)

We make a list where all the function attributes of aggregation are stored. We use this function to experiment with all the functions of Aggregations in R.

我們列出存儲聚合的所有功能屬性的列表。 我們使用此功能來試驗R中聚合的所有功能。

A few of the graphs with different examples are shown below.

下面顯示了一些帶有不同示例的圖形。

該代碼的作用是什么? (What does this code do exactly?)

First, a list is created as mentioned earlier, in which all the functions are stored. After the list is made, the y-axis is set to the ratio of Sepal.Length to Sepal.Width and x-axis is set to Species.

首先,如前所述創建一個列表,其中存儲了所有功能。 列出后,將y軸設置為Sepal.Length與Sepal.Width的比率,將x軸設置為Species。

After calculating the ratio, the function transform is called in which the func = ‘avg’ is mentioned for just the starting phase. When we run this code and select the function ‘mode’, we get Fig. 3 (above), which shows that the mode of setosa is the least among the three at around 1.4. Mode tells that the ratio 1.4 is repeated the most times or that value is most likely to be sampled. The different pattern we saw here is that the highest value most likely to be sampled is from the category veriscolor having a mode near to 2.2.

在計算出比率之后,將調用函數變換,其中僅在開始階段就提到了func ='avg'。 當我們運行此代碼并選擇函數“ mode”時,我們得到圖3(上方),該圖表明setosa的模式在這三個模式中最小,約為1.4。 模式表明,比率1.4重復最多,或者最有可能被采樣。 我們在這里看到的不同模式是,最有可能被采樣的最高值來自veriscolor類別,其模式接近2.2。

In Fig. 4 above, the change of ratio of Sepal Length to Sepal Width is plotted and we get very different results compared to the rest of the graphs. We observe the change of Setosa and Virginica to be the same and positive, while in the change of ratio by species, veriscolor is almost negative and is three times the change of the setosa and virginica.

在上面的圖4中,繪制了Sepal Length與Sepal Width之比的變化圖,與其余圖表相比,我們得到了截然不同的結果。 我們觀察到Setosa和Virginica的變化相同且為正,而在物種比例變化中,veriscolor幾乎為負,是Setosa和virginica的三倍。

On the other hand, the right figure shows the rms values of each species. We can easily see that the species veriscolor and virginica have almost same value which is significantly greater than the rms value of setosa.

另一方面,右圖顯示了每種物質的均方根值。 我們可以很容易地看到,veriscolor和virginica物種的值幾乎相同,大大高于setosa的rms值。

結論 (Conclusion)

Aggregation functions are one of the most powerful tools developers can ask for. They can provide you the patterns and results that you wouldn’t expect. To analyse the data visually, you have to play with the data, and to do that we need to manipulate and transform it. Aggregation functions do that for you, and they’re one of the most widely used functions in transform. This article is just a start. You can certainly explore more and apply more. That’s what explorers do.

聚合功能是開發人員可以要求的最強大的工具之一。 他們可以為您提供意想不到的模式和結果。 要以可視方式分析數據,您必須處理數據,并且為此,我們需要對數據進行操作和轉換。 聚合函數可以為您做到這一點,它們是transform中使用最廣泛的函數之一。 本文只是一個開始。 您當然可以探索更多并應用更多。 那就是探險家所做的。

翻譯自: https://www.freecodecamp.org/news/aggregates-in-r-one-of-the-most-powerful-tool-you-can-ask-for-4dd14eafff1f/

聚合 數據處理

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392194.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392194.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392194.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

大數據 notebook_Dockerless Notebook:數據科學期待已久的未來

大數據 notebookData science is hard. Data scientists spend hours figuring out how to install that Python package on their laptops. Data scientists read many pages of Google search results to connect to that database. Data scientists write a detailed docume…

【NGN學習筆記】6 代理(Proxy)和背靠背用戶代理(B2BUA)

1. 什么是Proxy模式&#xff1f; 按照RFC3261中的定義&#xff0c;Proxy服務器是一個中間的實體&#xff0c;它本身即作為客戶端也作為服務端&#xff0c;為其他客戶端提供請求的轉發服務。一個Proxy服務器首先提供的是路由服務&#xff0c;也就是說保證請求被發到更加”靠近”…

分布與并行計算—并行計算π(Java)

并行計算π public class pithread extends Thread {private static long mini1000000000;private long start,diff;double sum0;double cur1/(double)mini;public pithread(long start,long diff) {this.startstart;this.diffdiff;}Overridepublic void run() {long istart;f…

linux復制文件跳過相同,Linux cp指令,怎么跳過相同的文件

1、使用cp命令的-n參數即可跳過相同的文件 。2、cp命令使用詳解&#xff1a;1)、用法&#xff1a;cp [選項]... [-T] 源文件 目標文件或&#xff1a;cp [選項]... 源文件... 目錄或&#xff1a;cp [選項]... -t 目錄 源文件...將源文件復制至目標文件&#xff0c;或將多個源文件…

eclipse類自動生成注釋

1.創建新類時自動生成注釋 window&#xff0d;>preference&#xff0d;>java&#xff0d;>code styple&#xff0d;>code template 當你選擇到這部的時候就會看見右側有一個框顯示出code這個選項&#xff0c;你點開這個選項&#xff0c;點一下他下面的New …

rman恢復

--建表create table sales( product_id number(10), sales_date date, sales_cost number(10,2), status varchar2(20));--插數據insert into sales values (1,sysdate-90,18.23,inactive);commit; --啟用rman做全庫備份 運行D:\autobackup\rman\backup_orcl.bat 生成…

微軟大數據_我對Microsoft的數據科學采訪

微軟大數據Microsoft was one of the software companies that come to hire interns at my university for 2021 summers. This year, it was the first time that Microsoft offered any Data Science Internship for pre-final year undergraduate students.微軟是到2021年夏…

再次檢查打印機名稱 并確保_我們的公司名稱糟透了。 這是確保您沒有的方法。...

再次檢查打印機名稱 并確保by Dawid Cedrych通過戴維德塞德里奇 我們的公司名稱糟透了。 這是確保您沒有的方法。 (Our company name sucked. Here’s how to make sure yours doesn’t.) It is harder than one might think to find a good business name. Paul Graham of Y …

linux中文本查找命令,Linux常用的文本查找命令 find

一、常用的文本查找命令grep、egrep命令grep&#xff1a;文本搜索工具&#xff0c;根據用戶指定的文本模式對目標文件進行逐行搜索&#xff0c;先是能夠被模式匹配到的行。后面跟正則表達式&#xff0c;讓grep工具相當強大。-E之后還支持擴展的正則表達式。# grep [options] …

分布與并行計算—日志挖掘(Java)

日志挖掘——處理數據、計費統計 1、讀取附件中日志的內容&#xff0c;找出自己學號停車場中對應的進出車次數&#xff08;in/out配對的記錄數&#xff0c;1條in、1條out&#xff0c;視為一個車次&#xff0c;本日志中in/out為一一對應&#xff0c;不存在缺失某條進或出記錄&a…

《人人都該買保險》讀書筆記

內容目錄&#xff1a; 1.你必須知道的保險知識 2.家庭理財的必需品 3.保障型保險產品 4.儲蓄型保險產品 5.投資型保險產品 6.明明白白買保險 現在我所在的公司Manulife是一家金融保險公司&#xff0c;主打業務就是保險&#xff0c;因此我需要熟悉一下保險的基礎知識&#xff0c…

Linux下查看txt文檔

當我們在使用Window操作系統的時候&#xff0c;可能使用最多的文本格式就是txt了&#xff0c;可是當我們將Window平臺下的txt文本文檔復制到Linux平臺下查看時&#xff0c;發現原來的中文所有變成了亂碼。沒錯&#xff0c; 引起這個結果的原因就是兩個平臺下&#xff0c;編輯器…

如何擊敗騰訊_擊敗股市

如何擊敗騰訊個人項目 (Personal Proyects) Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an…

滑塊 組件_組件制作:如何使用鏈接的輸入創建滑塊

滑塊 組件by Robin Sandborg羅賓桑德伯格(Robin Sandborg) 組件制作&#xff1a;如何使用鏈接的輸入創建滑塊 (Component crafting: how to create a slider with a linked input) Here at Stacc, we’re huge fans of React and the render-props pattern. When it came time…

配置靜態IPV6 NAT-PT

一.概述&#xff1a; IPV6 NAT-PT( Network Address Translation - Port Translation)應用與ipv4和ipv6網絡互訪的情況&#xff0c;根據參考鏈接配置時出現一些問題&#xff0c;所以記錄下來。參考鏈接&#xff1a;http://www.cisco.com/en/US/tech/tk648/tk361/technologies_c…

linux 線程與進程 pid,linux下線程所屬進程號問題

這一段看《unix環境高級編程》&#xff0c;一個關于線程的小例子。#include#include#includepthread_t ntid;void printids(const char *s){pid_t pid;pthread_t tid;pidgetpid();tidpthread_self();printf("%s pid %u tid %u (0x%x)n",s,(unsigned int)pid,(unsigne…

python3虛擬環境中解決 ModuleNotFoundError: No module named '_ssl'

前提是已經安裝了openssl 問題 當我在python3虛擬環境中導入ssl模塊時報錯&#xff0c;報錯如下&#xff1a; (py3) [rootlocalhost Python-3.6.3]# python3 Python 3.6.3 (default, Nov 19 2018, 14:18:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help…

python 使用c模塊_您可能沒有使用(但應該使用)的很棒的Python模塊

python 使用c模塊by Adam Goldschmidt亞當戈德施密特(Adam Goldschmidt) 您可能沒有使用(但應該使用)的很棒的Python模塊 (Awesome Python modules you probably aren’t using (but should be)) Python is a beautiful language, and it contains many built-in modules that…

分布與并行計算—生產者消費者模型實現(Java)

在實際的軟件開發過程中&#xff0c;經常會碰到如下場景&#xff1a;某個模塊負責產生數據&#xff0c;這些數據由另一個模塊來負責處理&#xff08;此處的模塊是廣義的&#xff0c;可以是類、函數、線程、進程等&#xff09;。產生數據的模塊&#xff0c;就形象地稱為生產者&a…

通過Xshell登錄遠程服務器實時查看log日志

主要想總結以下幾點&#xff1a; 1.如何使用生成密鑰的方式來登錄Xshell連接遠端服務器 2.在遠程服務器上如何上傳和下載文件&#xff08;下載log文件到本地&#xff09; 3.如何實時查看log&#xff0c;提取錯誤信息 一. 使用生成密鑰的方式來登錄Xshell連接遠端服務器 ssh登錄…