aws emr 大數據分析_DataOps —使用AWS Lambda和Amazon EMR的全自動,低成本數據管道

aws emr 大數據分析

Progression is continuous. Taking a flashback journey through my 25 years career in information technology, I have experienced several phases of progression and adaptation.

進步是連續的。 在我25年的信息技術職業生涯中經歷了一次閃回之旅,我經歷了發展和適應的多個階段。

From a newly hired recruit who carefully watched every single SQL command run to completion to a confident DBA who scripted hundreds of SQL’s and ran them together as batch jobs using the Cron scheduler. In the modern era I adapted to DAG tools like Oozie and Airflow that not only provide job scheduling but can to run a series of jobs as data pipelines in an automated fashion.

從新雇用的新手,他仔細地觀察每條SQL命令的執行情況,到自信的DBA,他用Cron調度程序編寫了數百個SQL腳本并將它們作為批處理作業一起運行。 在現代時代,我適應了OAGie和Airflow之類的DAG工具,這些工具不僅可以提供作業調度,還可以以自動化方式將一系列作業作為數據管道運行。

Lately, the adoption of cloud has changed the whole meaning of automation.

最近,云的采用改變了自動化的整體含義。

存儲價格便宜,計算機價格昂貴 (STORAGE is cheap, COMPUTE is expensive)

In the cloud era, we can design automation methods that were previously unheard of. I admit that cloud storage resources are getting cheaper by the day but the compute resources (high CPU and memory) are still relatively expensive. Keeping that in mind, wouldn’t it be super cool if DataOps can help us save on compute costs. Let’s find out how this can be done:

在云時代,我們可以設計以前聞所未聞的自動化方法。 我承認云存儲資源一天比一天便宜,但是計算資源(高CPU和內存)仍然相對昂貴。 記住這一點,如果DataOps可以幫助我們節省計算成本,那豈不是超級酷。 讓我們找出如何做到這一點:

Typically, we run data pipelines as follows:

通常,我們按以下方式運行數據管道:

Data collection at regular time intervals (daily, hourly or by the minute) saved to storage like S3. This is usually followed up by data processing jobs using permanently spawned distributed computing clusters like EMR.

按固定的時間間隔(每天,每小時或每分鐘)收集數據,并像S3一樣保存到存儲中。 接下來通常是使用永久生成的分布式計算集群(如EMR)進行數據處理作業。

Pros: Processing Jobs run on a schedule. Permanent cluster can be utilized for other purposes like querying using Hive, streaming workloads etc.

優點:處理作業按計劃運行。 永久群集可以用于其他目的,例如使用Hive查詢,流式處理工作負載等。

Cons: There can be a delay between the time data arrives vs. when it gets processed. Compute resources may not be optimally utilized. There may be under utilization at times, therefore wasting expensive $$$

缺點:數據到達與處理之間可能會有延遲。 計算資源可能沒有得到最佳利用。 有時可能利用率不足,因此浪費了昂貴的$$$

Image for post
(Image by Author)
(圖片由作者提供)

Here is an alternative that can help achieve the right balance between operation and costs. This method may not apply to all use cases, but if it does then rest assured it will save you a lot of $$$.

這是可以幫助在運營和成本之間實現適當平衡的替代方法。 此方法可能不適用于所有用例,但是如果可以放心使用,則可以節省很多資金。

Image for post
(Image by Author)
(圖片由作者提供)

In this method the storage layer stays pretty much the same, except an event notification is added to the storage that invokes a Lambda function when new data arrives. In turn the Lambda function invokes the creation of a transient EMR cluster for data processing. A transient EMR cluster is a special type of cluster that deploys, runs the data processing job and then self-destructs.

在這種方法中,存儲層幾乎保持不變,只是事件通知被添加到存儲中,該通知在新數據到達時調用Lambda函數。 Lambda函數進而調用瞬態EMR集群的創建以進行數據處理。 臨時EMR群集是一種特殊類型的群集,它可以部署,運行數據處理作業,然后自毀。

Pros: Processing jobs can start as soon as the data is available. No waits. Compute resources optimally utilized. Only pay for what you use and save $$$.

優點:數據可用后即可開始處理作業。 沒有等待。 計算最佳利用的資源。 只需支付您使用的費用,即可節省$$$。

Cons: Cluster cannot be utilized for other purposes like querying using Hive, streaming workloads etc.

缺點:群集不能用于其他目的,例如使用Hive查詢,流工作負載等。

Here is how the entire process is handled technically:

這是技術上處理整個過程的方式:

Assume the data files are delivered to s3://<BUCKET>/raw/files/renewable/hydropower-consumption/<DATE>_<HOUR>

假設數據文件已傳遞到s3:// <BUCKET> / raw / files / renewable / hydropower-consumption / <DATE> _ <HOUR>

Clone my git repo:

克隆我的git repo:

$ git clone https://github.com/mkukreja1/blogs.git

Create a new S3 Bucket for running the demo. Remember to change the bucket name since S3 bucket name are globally unique.

創建一個新的S3存儲桶以運行演示。 請記住要更改存儲桶名稱,因為S3存儲桶名稱是全局唯一的。

$ S3_BUCKET=lambda-emr-pipeline  #Edit as per your bucket name$ REGION='us-east-1' #Edit as per your AWS region$ JOB_DATE='2020-08-07_2PM' #Do not Edit this$ aws s3 mb s3://$S3_BUCKET$ aws s3 cp blogs/lambda-emr/emr.sh s3://$S3_BUCKET/bootstrap/$ aws s3 cp blogs/lambda-emr/hydropower-processing.py s3://$S3_BUCKET/spark/

Create a role for the Lambda Function

為Lambda函數創建角色

$ aws iam create-role --role-name trigger-pipeline-role --assume-role-policy-document file://blogs/lambda-emr//lambda-policy.json$ aws iam attach-role-policy --role-name trigger-pipeline-role  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole$ aws iam attach-role-policy --role-name trigger-pipeline-role  --policy-arn arn:aws:iam::aws:policy/AmazonElasticMapReduceFullAccess

Create the Lambda Function in AWS

在AWS中創建Lambda函數

$ ROLE_ARN=`aws iam get-role --role-name trigger-pipeline-role | grep Arn |sed 's/"Arn"://'  |sed 's/,//' | sed 's/"//g'`;echo $ROLE_ARN$ cat blogs/lambda-emr/trigger-pipeline.py | sed "s/YOUR_BUCKET/$S3_BUCKET/g" | sed "s/YOUR_REGION/'$REGION'/g" > lambda_function.py$ zip trigger-pipeline.zip lambda_function.py$ aws lambda delete-function --function-name trigger-pipeline$ LAMBDA_ARN=` aws lambda create-function --function-name trigger-pipeline --runtime python3.6 --role $ROLE_ARN --handler lambda_function.lambda_handler --timeout 60 --zip-file fileb://trigger-pipeline.zip | grep FunctionArn | sed -e 's/"//g' -e 's/,//g'  -e 's/FunctionArn//g' -e 's/: //g' `;echo $LAMBDA_ARN$ aws lambda add-permission --function-name trigger-pipeline --statement-id 1 --action lambda:InvokeFunction --principal s3.amazonaws.com

Finally, let's create the S3 event notification. This notification will invoke the Lambda function above.

最后,讓我們創建S3事件通知。 此通知將調用上面的Lambda函數。

$ cat blogs/lambda-emr/notification.json | sed "s/YOUR_LAMBDA_ARN/$LAMBDA_ARN/g" | sed "s/\    arn/arn/" > notification.json$ aws s3api put-bucket-notification-configuration --bucket $S3_BUCKET --notification-configuration file://notification.json
Image for post
(Image by Author)
(圖片由作者提供)

Let's kick off the process by copying data to S3

讓我們開始將數據復制到S3的過程

$ aws s3 rm s3://$S3_BUCKET/curated/ --recursive
$ aws s3 rm s3://$S3_BUCKET/data/ --recursive$ aws s3 sync blogs/lambda-emr/data/ s3://$S3_BUCKET/data/

If everything ran OK you should be able to see a running Cluster in EMR with Status=Starting

如果一切正常,您應該可以在Status = Starting的 EMR中看到正在運行的集群

Image for post
(Image by Author)
(圖片由作者提供)

After some time the EMR cluster should change to Status=Terminated.

一段時間后,EMR群集應更改為Status = Terminated。

Image for post
(Image by Author)
(圖片由作者提供)

To check if the Spark program was successful check the S3 folder as below:

要檢查Spark程序是否成功,請檢查S3文件夾,如下所示:

$ aws s3 ls s3://$S3_BUCKET/curated/2020-08-07_2PM/
2020-08-10 17:10:36 0 _SUCCESS
2020-08-10 17:10:35 18206 part-00000-12921d5b-ea28-4e7f-afad-477aca948beb-c000.snappy.parquet

Moving into the next phase of Data Engineering and Data Science automation of data pipelines is becoming a critical operation. If done correctly it carries a potential for streamlining not only operations but resource costs as well.

進入數據工程和數據科學的下一階段,數據管道自動化已成為一項關鍵操作。 如果做得正確,它不僅可以簡化運營,還可以簡化資源成本。

Hope you gained some valuable insights from the article above. Feel free to contact me if you need further clarifications and advice.

希望您從上面的文章中獲得了一些有價值的見解。 如果您需要進一步的說明和建議,請隨時與我聯系。

翻譯自: https://towardsdatascience.com/dataops-fully-automated-low-cost-data-pipelines-using-aws-lambda-and-amazon-emr-c4d94fdbea97

aws emr 大數據分析

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388343.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388343.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388343.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

21eval 函數

eval() 函數十分強大 ---- 將字符串 當成 有效的表達式 來求職 并 返回計算結果 1 # 基本的數學計算2 print(eval("1 1")) # 23 4 # 字符串重復5 print(eval("* * 5")) # *****6 7 # 將字符串轉換成列表8 print(eval("[1, 2, 3, 4]")) # [1,…

聯想r630服務器開啟虛擬化,整合虛擬化 聯想萬全R630服務器上市

虛擬化技術的突飛猛進&#xff0c;對運行虛擬化應用的服務器平臺的運算性能提出了更高的要求。近日&#xff0c;聯想萬全R630G7正式對外發布。這款計算性能強勁&#xff0c;IO吞吐能力強大的四路四核服務器&#xff0c;主要面向高端企業級應用而開發。不僅能夠完美承載大規模數…

Iphone屏幕旋轉

該示例是想在手機屏幕方向發生改變時重新定位視圖&#xff08;這里是一個button&#xff09; 1.創建一個View—based Application項目,并在View窗口中添加一個Round Rect Button視圖&#xff0c;通過尺寸檢查器設置其位置&#xff0c;然后單擊View窗口右上角的箭頭圖標來旋轉窗…

先進的NumPy數據科學

We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a realtime dataset. Concepts covered here are more than enough to start your journey with data.我們將介紹NumPy的一些高級概念&#xff0c;特別是…

lsof命令詳解

基礎命令學習目錄首頁 原文鏈接&#xff1a;https://www.cnblogs.com/ggjucheng/archive/2012/01/08/2316599.html 簡介 lsof(list open files)是一個列出當前系統打開文件的工具。在linux環境下&#xff0c;任何事物都以文件的形式存在&#xff0c;通過文件不僅僅可以訪問常規…

Xcode中捕獲iphone/ipad/ipod手機攝像頭的實時視頻數據

目的&#xff1a;打開、關閉前置攝像頭&#xff0c;繪制圖像&#xff0c;并獲取攝像頭的二進制數據。 需要的庫 AVFoundation.framework 、CoreVideo.framework 、CoreMedia.framework 、QuartzCore.framework 該攝像頭捕抓必須編譯真機的版本&#xff0c;模擬器下編譯不了。 函…

統計和冰淇淋

Photo by Irene Kredenets on UnsplashIrene Kredenets在Unsplash上拍攝的照片 摘要 (Summary) In this article, you will learn a little bit about probability calculations in R Studio. As it is a Statistical language, R comes with many tests already built in it, …

信息流服務器哪種好,選購存儲服務器需要注意六大關鍵因素,你知道幾個?

原標題&#xff1a;選購存儲服務器需要注意六大關鍵因素&#xff0c;你知道幾個&#xff1f;信息技術的飛速發展帶動了整個信息產業的發展。越來越多的電子商務平臺和虛擬化環境出現在企業的日常應用中。存儲服務器作為企業建設環境的核心設備&#xff0c;在整個信息流中承擔著…

t3 深入Tornado

3.1 Application settings 前面的學習中&#xff0c;在創建tornado.web.Application的對象時&#xff0c;傳入了第一個參數——路由映射列表。實際上Application類的構造函數還接收很多關于tornado web應用的配置參數。 參數&#xff1a; debug&#xff0c;設置tornado是否工作…

vml編輯器

<HTML xmlns:v> <HEAD> <META http-equiv"Content-Type" content"text/html; Charsetgb2312"> <META name"GENERATOR" content"網絡程序員伴侶(Lshdic)2004"> <META name"GENERATORDOWNLOADADDRESS&q…

對數據倉庫進行數據建模_確定是否可以對您的數據進行建模

對數據倉庫進行數據建模Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.某些數…

15 并發編程-(IO模型)

一、IO模型介紹 1、阻塞與非阻塞指的是程序的兩種運行狀態 阻塞&#xff1a;遇到IO就發生阻塞&#xff0c;程序一旦遇到阻塞操作就會停在原地&#xff0c;并且立刻釋放CPU資源 非阻塞&#xff08;就緒態或運行態&#xff09;&#xff1a;沒有遇到IO操作&#xff0c;或者通過某種…

arduino消息服務器,在C(Arduino IDE)中將API鏈接消息解析為服務器(示例代碼)

我正在使用Arduino IDE來編程我的微控制器&#xff0c;它有一個內置的Wi-Fi芯片(ESP8266 NodeMCU)&#xff0c;它連接到我的互聯網路由器&#xff0c;然后有一個特定的IP(就像192.168.1.5)。所以我想通過添加到鏈接的消息發送命令(和數據)&#xff0c;然后鏈接變為&#xff1a;…

不提拔你,就是因為你只想把工作做好

2019獨角獸企業重金招聘Python工程師標準>>> 我有個朋友&#xff0c;他30出頭&#xff0c;在500強公司做技術經理。他戴無邊眼鏡&#xff0c;穿一身土黃色的夾克&#xff0c;下面是一條常年不洗的牛仔褲加休閑皮鞋&#xff0c;典型技術高手范。 三 年前&#xff0c;…

python內置函數多少個_每個數據科學家都應該知道的10個Python內置函數

python內置函數多少個Python is the number one choice of programming language for many data scientists and analysts. One of the reasons of this choice is that python is relatively easier to learn and use. More importantly, there is a wide variety of third pa…

C#使用TCP/IP與ModBus進行通訊

C#使用TCP/IP與ModBus進行通訊1. ModBus的 Client/Server模型 2. 數據包格式及MBAP header (MODBUS Application Protocol header) 3. 大小端轉換 4. 事務標識和緩沖清理 5. 示例代碼 0. MODBUS MESSAGING ON TCP/IP IMPLEMENTATION GUIDE 下載地址&#xff1a;http://www.modb…

Hadoop HDFS常用命令

1、查看hdfs文件目錄 hadoop fs -ls / 2、上傳文件 hadoop fs -put 文件路徑 目標路徑 在瀏覽器查看:namenodeIP:50070 3、下載文件 hadoop fs -get 文件路徑 保存路徑 4、設置副本數量 -setrep 轉載于:https://www.cnblogs.com/chaofan-/p/9742633.html

SAP UI 搜索分頁技術

搜索分頁技術往往和另一個術語Lazy Loading&#xff08;懶加載&#xff09;聯系起來。今天由Jerry首先介紹S/4HANA&#xff0c;CRM Fiori和S4CRM應用里的UI搜索分頁的實現原理。后半部分由SAP成都研究院菜園子小哥王聰向您介紹Twitter的懶加載實現。 關于王聰的背景介紹&#x…

萬彩錄屏服務器不穩定,萬彩錄屏 云服務器

萬彩錄屏 云服務器 內容精選換一換內網域名是指僅在VPC內生效的虛擬域名&#xff0c;無需購買和注冊&#xff0c;無需備案。云解析服務提供的內網域名功能&#xff0c;可以讓您在VPC中擁有權威DNS&#xff0c;且不會將您的DNS記錄暴露給互聯網&#xff0c;解析性能更高&#xf…

針對數據科學家和數據工程師的4條SQL技巧

SQL has become a common skill requirement across industries and job profiles over the last decade.在過去的十年中&#xff0c;SQL已成為跨行業和職位描述的通用技能要求。 Companies like Amazon and Google will often demand that their data analysts, data scienti…