marlin 三角洲_三角洲湖泊和數據湖泊-入門

marlin 三角洲

Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!

數據湖正被越來越多的尋求有效存儲其資產的公司采用。 與行業標準數據倉庫相比,其背后的理論非常簡單。 總結這篇文章解釋了背后的邏輯基礎,并用名為Delta Lake的工具提出了實際用例。 請享用!

什么是數據湖? (What is data lake?)

A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

集中式存儲庫,可讓您以任何規模存儲所有結構化和非結構化數據。 您可以按原樣存儲數據,而無需先構建數據結構并運行不同類型的分析-從儀表板和可視化到大數據處理,實時分析和機器學習,以指導更好的決策。

Amazon Web Services

亞馬遜網絡服務

Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:

首先,數據湖背后的原理與廣泛使用的數據倉庫非常相似。 盡管它們屬于同一類別,但它們背后的邏輯卻有很大不同。 例如,數據倉庫的性質是存儲在其中的信息已經過預處理。 換句話說,必須知道存儲的原因并明確定義數據模型。 但是數據湖采取不同的方法。 因此,不必定義存儲原因和數據模型。 總之,可以如下比較兩種變體:

+-----------+----------------------+-------------------+
| | Data Warehouse | Data Lake |
+-----------+----------------------+-------------------+
| Data | Structured | Unstructured data |
| Schema | Schema on write | Schema on read |
| Storage | High-cost storage | Low-cost storage |
| Users | Business analysts | Data scientists |
| Analytics | BI and visualization | Data Science |
+-----------+----------------------+-------------------+

使用Delta Lake OSS創建數據湖 (Using Delta Lake OSS create a data lake)

Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple — you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:

現在,讓我們使用該理論知識,并使用Delta Lake OSS進行應用。 Delta Lake是基于Apache Spark的開源框架,用于檢索,管理數據并將其轉換為Data Lake。 入門非常簡單-您將需要一個Apache Spark項目(使用此鏈接可獲得更多指導)。 首先,添加Delta Lake作為SBT依賴項:

libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"

將數據保存到Delta (Saving data to Delta)

Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.

接下來,讓我們創建第一個表。 為此,您將需要一個Spark Dataframe,它可以是任意集合,也可以是從其他格式(如JSON或Parquet)讀取的數據。

val data = spark.range(0, 50)
data.write.format("delta").save("/data/delta-table")

從Delta讀取數據 (Reading data from Delta)

Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.

讀取數據就像寫入數據一樣簡單。 只需指定路徑和正確的格式即可,就像處理CSV或JSON數據一樣。

val df = spark.read.format("delta").load("/data/delta-table")
df.show()

在Delta中更新數據 (Updating the data in Delta)

The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:

借助其ACID模型,Delta Lake OSS支持一系列更新選項。 讓我們使用它來運行批處理更新,該更新將覆蓋現有數據。 我們通過使用以下代碼來做到這一點:

val data = spark.range(0, 100)
data.write.format("delta").mode("overwrite").save("/data/delta-table")
df.show()

摘要 (Summary)

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally you can follow me on my social media if you fancy so :)

我希望您發現這篇文章有用。 如果是這樣,請隨時喜歡或分享此帖子。 此外,如果您愿意,也可以在我的社交媒體上關注我:)

演示地址

Sources: https://docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

資料來源: https : //docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

翻譯自: https://medium.com/swlh/delta-lake-and-data-lakes-getting-started-41ce957ed0da

marlin 三角洲

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392442.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392442.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392442.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

tomcat中設置Java 客戶端程序的http(https)訪問代理

1、假定http/https代理服務器為 127.0.0.1 端口為8118 2、在tomcat/bin/catalina.sh腳本文件中設置JAVA_OPTS,如下圖: 保存后重啟tomcat就能生效。轉載于:https://www.cnblogs.com/zhangmingcheng/p/11211776.html

java界面中顯示圖片_java中怎樣在界面中顯示圖片?

方法一:JLabel helloLabel new JLabel("New label");helloLabel.setIcon(new ImageIcon("E:\\javaSE\u4EE3\u7801\\TimeManager\\asset\\hello.gif"));helloLabel.setBackground(Color.BLACK);helloLabel.setBounds(0, 0, 105, 50);contentPan…

one-of-k 編碼算法_我們如何教K-12學生如何編碼

one-of-k 編碼算法by Christopher George克里斯托弗喬治(Christopher George) 我們如何教K-12學生如何編碼 (How we’re teaching K-12 students how to code) Hello World! (Sorry, I couldn’t resist.) My name is Christopher George and I am currently a Junior at Carn…

knime簡介_KNIME簡介

knime簡介Data Science is abounding. It considers different realms of the data world including its preparation, cleaning, modeling, and whatnot. To be precise, it is massive in terms of the span it covers and the opportunities it offers. Needless to say, th…

hadoop2.x HDFS快照介紹

說明:由于近期正好在研究hadoop的快照機制。看官網上的文檔講的非常仔細。就順手翻譯了。也沒有去深究一些名詞的標準譯法,所以可能有些翻譯和使用方法不是非常正確,莫要介意~~ 原文地址:(Apache hadoop的官方文檔&…

MQTT服務器搭建--Mosquitto用戶名密碼配置

前言: 基于Mosquitto服務器已經搭建成功,大部分都是采用默認的是允許匿名用戶登錄模式,正式上線的系統需要進行用戶認證。 1.用戶參數說明 Mosquitto服務器的配置文件為/etc/mosquitto/mosquitto.conf,關于用戶認證的方式和讀取的…

java number string_java基礎系列(一):Number,Character和String類及操作

這篇文章總結了Java中最基礎的類以及常用的方法,主要有:Number,Character,String。1、Number類在實際開發的過程中,常常會用到需要使用對象而不是內置的數據類型的情形。所以,java語言為每個內置數據類型都…

誰參加了JavaScript 2018狀況調查?

by Sacha Greif由Sacha Greif 誰參加了JavaScript 2018狀況調查? (Who Took the State of JavaScript 2018 Survey?) 我們如何努力使調查更具代表性 (How we’re working to make the survey more representative) I was recently listening to a podcast episode…

機器學習 建立模型_建立生產的機器學習系統

機器學習 建立模型When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking …

CDH使用秘籍(一):Cloudera Manager和Managed Service的數據庫

背景從業務發展需求,大數據平臺須要使用spark作為機器學習、數據挖掘、實時計算等工作,所以決定使用Cloudera Manager5.2.0版本號和CDH5。曾經搭建過Cloudera Manager4.8.2和CDH4,在搭建Cloudera Manager5.2.0版本號的時候,發現對…

leetcode 455. 分發餅干(貪心算法)

假設你是一位很棒的家長,想要給你的孩子們一些小餅干。但是,每個孩子最多只能給一塊餅干。 對每個孩子 i,都有一個胃口值 g[i],這是能讓孩子們滿足胃口的餅干的最小尺寸;并且每塊餅干 j,都有一個尺寸 s[j]…

壓縮/批量壓縮/合并js文件

寫在前面 如果文件少的話,直接去網站轉化一下就行。 http://tool.oschina.net/jscompress?type3 1.壓縮單個js文件 cnpm install uglify-js -g 安裝 1>壓縮單個js文件打開cmd,目錄引到當前文件夾,cduglifyjs inet.js -o inet-min.js 或者 uglifyjs i…

angular依賴注入_Angular依賴注入簡介

angular依賴注入by Neeraj Dana由Neeraj Dana In this article, we will see how the dependency injection of Angular works internally. Suppose we have a component named appcomponent which has a basic and simple structure as follows:在本文中,我們將看…

leetcode 85. 最大矩形(dp)

給定一個僅包含 0 和 1 、大小為 rows x cols 的二維二進制矩陣,找出只包含 1 的最大矩形,并返回其面積。 示例 1: 輸入:matrix [[“1”,“0”,“1”,“0”,“0”],[“1”,“0”,“1”,“1”,“1”],[“1”,“1”,“1”,“1”,“…

如何查看系統版本

1. winR,輸入cmd,確定,打開命令窗口,輸入msinfo32,注意要在英文狀態下輸入,回車。然后在彈出的窗口中就可以看到系統的具體版本號了。 2.winR,輸入cmd,確定,打開命令窗口,輸入ver&am…

java activemq jmx_通過JMX 獲取Activemq 隊列信息

首先在 activemq.xml 中新增以下屬性在broker 節點新增屬性 useJmx"true"在managementContext 節點配置斷開與訪問服務iP配置成功后啟動下面來看測試代碼/*** Title: ActivemqTest.java* Package activemq* Description: TODO(用一句話描述該文件做什么)* author LYL…

風能matlab仿真_發現潛力:使用計算機視覺對可再生風能發電場的主要區域進行分類(第1部分)

風能matlab仿真Github Repo: https://github.com/codeamt/WindFarmSpotterGithub回購: https : //github.com/codeamt/WindFarmSpotter This is a series:這是一個系列: Part 1: A Brief Introduction on Leveraging Edge Devices and Embedded AI to …

【Leetcode_easy】821. Shortest Distance to a Character

problem 821. Shortest Distance to a Character 參考 1. Leetcode_easy_821. Shortest Distance to a Character; 完轉載于:https://www.cnblogs.com/happyamyhope/p/11214805.html

tdd測試驅動開發課程介紹_測試驅動開發的實用介紹

tdd測試驅動開發課程介紹by Luca Piccinelli通過盧卡皮奇內利 測試驅動開發很難! 這是不為人知的事實。 (Test Driven Development is hard! This is the untold truth about it.) These days you read a ton of articles about all the advantages of doing Test …

軟件安裝(JDK+MySQL+TOMCAT)

一,JDK安裝 1,查看當前Linux系統是否已經安裝了JDK 輸入 rpm -qa | grep java 如果有: 卸載兩個openJDK,輸入rpm -e --nodeps 要卸載的軟件 2,上傳JDK到Linux 3,安裝jdk運行需要的插件yum install gl…