aws(學習筆記第三十三課) 深入使用cdk 練習aws athena

文章目錄

aws(學習筆記第三十三課) 深入使用cdk
學習內容：
- 1. 使用`aws athena`
- - 1.1 什么是`aws athena`
  - 1.2 什么是`aws glue`
  - 1.2 為什么`aws athena`和`aws glue`一起使用
- 2. 開始練習`aws athena`
- - 2.1 代碼鏈接
  - 2.2 整體架構
  - 2.3 代碼解析
  - - 2.3.1 創建測試數據的`S3 bucket`
    - 2.3.2 創建保存查詢結果的`S3 bucket`
    - 2.3.3 將示例的程序`json`數據文件同期到`S3 bucket`
    - 2.3.4 創建`aws glue`的`cfnDatabase`
    - 2.3.5 創建`aws glue crawler`需要的權限`Role`
    - 2.3.6 創建`aws glue crawler`
    - 2.3.7 創建`aws athena work group`
    - 2.3.8 創建`aws athena query`
    - 2.3.9 調整執行順序
  - 2.4 開始執行`aws cdk for athena`
  - - 2.4.1 執行部署
    - 2.4.2 執行`crawler`爬蟲
    - 2.4.3 查看`aws athena`的`queries`
    - 2.4.4 執行`aws athena`的`queries`
    - 2.4.5 查看`aws athena`的`queries`執行結果

aws(學習筆記第三十三課) 深入使用cdk

使用cdk生成athena以及aws glue crawler

學習內容：

使用aws athena + aws glue crawler

1. 使用`aws athena`

1.1 什么是`aws athena`

aws athena是aws提供的數據分析service，可以使用SQL語言對S3上保存的數據進行分析。

managed service，所以不需要維護。
基于OpenSource的框架構筑
基于處理的數據量進行收費
對數據提供加密功能
注意和RDB不能進行JOIN操作，所以只能提供對csv和json進行數據查詢

1.2 什么是`aws glue`

aws glue是aws提供的managed ETL service。能夠簡單的進行分析數據的準備和load。table和schema關聯的metadata能夠作為aws glue catalog data進行保存。

1.2 為什么`aws athena`和`aws glue`一起使用

aws athena結合aws glue能夠將aws glue作成的database或者schema，使用aws athena進行查詢。

2. 開始練習`aws athena`

2.1 代碼鏈接

代碼鏈接aws-cdk-examples

2.2 整體架構

在這里插入圖片描述

2.3 代碼解析

2.3.1 創建測試數據的`S3 bucket`

 # creating the buckets where the logs will be placedlogs_bucket = s3.Bucket(self, 'logs-bucket',bucket_name=f"auditing-logs-{self.account}",removal_policy=RemovalPolicy.DESTROY,auto_delete_objects=True)

在這里插入圖片描述

2.3.2 創建保存查詢結果的`S3 bucket`

 # creating the bucket where the  queries output will be placedquery_output_bucket = s3.Bucket(self, 'query-output-bucket',bucket_name=f"auditing-analysis-output-{self.account}",removal_policy=RemovalPolicy.DESTROY,auto_delete_objects=True)

在這里插入圖片描述

2.3.3 將示例的程序`json`數據文件同期到`S3 bucket`

# uploading the log files to the bucket as exampless3_deployment.BucketDeployment(self, 'sample-files',destination_bucket=logs_bucket,sources=[s3_deployment.Source.asset('./log-samples')],content_type='application/json',retain_on_delete=False
)

在這里插入圖片描述

2.3.4 創建`aws glue`的`cfnDatabase`

    # creating the Glue Database to serve as our Data Catalogglue_database = glue.CfnDatabase(self, 'log-database',catalog_id=self.account,database_input=glue.CfnDatabase.DatabaseInputProperty(name="log-database"))

在這里插入圖片描述

2.3.5 創建`aws glue crawler`需要的權限`Role`

# creating the permissions for the crawler to enrich our Data Catalogglue_crawler_role = iam.Role(self, 'glue-crawler-role',role_name='glue-crawler-role',assumed_by=iam.ServicePrincipal(service='glue.amazonaws.com'),managed_policies=[# Remember to apply the Least Privilege Principle and provide only the permissions needed to the crawleriam.ManagedPolicy.from_managed_policy_arn(self, 'AmazonS3FullAccess','arn:aws:iam::aws:policy/AmazonS3FullAccess'),iam.ManagedPolicy.from_managed_policy_arn(self, 'AWSGlueServiceRole','arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole')])

這里需要兩個policy，AmazonS3FullAccess和AWSGlueServiceRole。
在這里插入圖片描述

2.3.6 創建`aws glue crawler`

 # creating the Glue Crawler that will automatically populate our Data Catalog. Don't forget to run the crawler# as soon as the deployment finishes, otherwise our Data Catalog will be empty. Check out the README for more instructionsglue.CfnCrawler(self, 'logs-crawler',name='logs-crawler',database_name=glue_database.database_input.name,role=glue_crawler_role.role_name,targets={"s3Targets": [{"path": f's3://{logs_bucket.bucket_name}/products'},{"path": f's3://{logs_bucket.bucket_name}/users'}]})

這里，aws glue crawler執行ETL Extract Transform Load，將S3 bucket里面的products和users的數據文件，經過轉換將json數據文件load到glue database。
在這里插入圖片描述

2.3.7 創建`aws athena work group`

# creating the Athena Workgroup to store our querieswork_group = athena.CfnWorkGroup(self, 'log-auditing-work-group',name='log-auditing',work_group_configuration=athena.CfnWorkGroup.WorkGroupConfigurationProperty(result_configuration=athena.CfnWorkGroup.ResultConfigurationProperty(output_location=f"s3://{query_output_bucket.bucket_name}",encryption_configuration=athena.CfnWorkGroup.EncryptionConfigurationProperty(encryption_option="SSE_S3"))))

在這里插入圖片描述
aws athena通過work group進行管理，創建了workgroup之后，在里面繼續創建query。

2.3.8 創建`aws athena query`

# creating an example query to fetch all product events by dateproduct_events_by_date_query = athena.CfnNamedQuery(self, 'product-events-by-date-query',database=glue_database.database_input.name,work_group=work_group.name,name="product-events-by-date",query_string="SELECT * FROM \"log-database\".\"products\" WHERE \"date\" = '2024-01-19'")# creating an example query to fetch all user events by dateuser_events_by_date_query = athena.CfnNamedQuery(self, 'user-events-by-date-query',database=glue_database.database_input.name,work_group=work_group.name,name="user-events-by-date",query_string="SELECT * FROM \"log-database\".\"users\" WHERE \"date\" = '2024-01-22'")# creating an example query to fetch all events by the user IDall_events_by_userid_query = athena.CfnNamedQuery(self, 'all-events-by-userId-query',database=glue_database.database_input.name,work_group=work_group.name,name="all-events-by-userId",query_string="SELECT * FROM (\n""    SELECT transactionid, userid, username, domain, datetime, action FROM \"log-database\".\"products\" \n""UNION \n""    SELECT transactionid, userid, username, domain, datetime, action FROM \"log-database\".\"users\" \n"") WHERE \"userid\" = '123'")

2.3.9 調整執行順序

# adjusting the resource creation order
product_events_by_date_query.add_dependency(work_group)
user_events_by_date_query.add_dependency(work_group)
all_events_by_userid_query.add_dependency(work_group)

2.4 開始執行`aws cdk for athena`

2.4.1 執行部署

python -m venv .venv
source .venv/Scripts/activate # windows platform
pip install -r requirements.txt
cdk synth
cdk --require-approval never deploy

2.4.2 執行`crawler`爬蟲

在這里插入圖片描述
默認crawler是不啟動的，需要run起來。

正常執行完畢。數據都由S3 bucket的json文件，經過ETL，進入到aws glue database里面了。

2.4.3 查看`aws athena`的`queries`

AWS Athena > 查詢編輯器 > 已保存的查詢 > 工作組 > log auditing
在這里插入圖片描述

2.4.4 執行`aws athena`的`queries`

在這里插入圖片描述

2.4.5 查看`aws athena`的`queries`執行結果

在這里插入圖片描述

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/72731.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/72731.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/72731.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！