marlin 三角洲
Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!
數據湖正被越來越多的尋求有效存儲其資產的公司采用。 與行業標準數據倉庫相比,其背后的理論非常簡單。 總結這篇文章解釋了背后的邏輯基礎,并用名為Delta Lake的工具提出了實際用例。 請享用!
什么是數據湖? (What is data lake?)
A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
集中式存儲庫,可讓您以任何規模存儲所有結構化和非結構化數據。 您可以按原樣存儲數據,而無需先構建數據結構并運行不同類型的分析-從儀表板和可視化到大數據處理,實時分析和機器學習,以指導更好的決策。
Amazon Web Services
亞馬遜網絡服務
Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:
首先,數據湖背后的原理與廣泛使用的數據倉庫非常相似。 盡管它們屬于同一類別,但它們背后的邏輯卻有很大不同。 例如,數據倉庫的性質是存儲在其中的信息已經過預處理。 換句話說,必須知道存儲的原因并明確定義數據模型。 但是數據湖采取不同的方法。 因此,不必定義存儲原因和數據模型。 總之,可以如下比較兩種變體:
+-----------+----------------------+-------------------+
| | Data Warehouse | Data Lake |
+-----------+----------------------+-------------------+
| Data | Structured | Unstructured data |
| Schema | Schema on write | Schema on read |
| Storage | High-cost storage | Low-cost storage |
| Users | Business analysts | Data scientists |
| Analytics | BI and visualization | Data Science |
+-----------+----------------------+-------------------+
使用Delta Lake OSS創建數據湖 (Using Delta Lake OSS create a data lake)
Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple — you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:
現在,讓我們使用該理論知識,并使用Delta Lake OSS進行應用。 Delta Lake是基于Apache Spark的開源框架,用于檢索,管理數據并將其轉換為Data Lake。 入門非常簡單-您將需要一個Apache Spark項目(使用此鏈接可獲得更多指導)。 首先,添加Delta Lake作為SBT依賴項:
libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"
將數據保存到Delta (Saving data to Delta)
Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.
接下來,讓我們創建第一個表。 為此,您將需要一個Spark Dataframe,它可以是任意集合,也可以是從其他格式(如JSON或Parquet)讀取的數據。
val data = spark.range(0, 50)
data.write.format("delta").save("/data/delta-table")
從Delta讀取數據 (Reading data from Delta)
Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.
讀取數據就像寫入數據一樣簡單。 只需指定路徑和正確的格式即可,就像處理CSV或JSON數據一樣。
val df = spark.read.format("delta").load("/data/delta-table")
df.show()
在Delta中更新數據 (Updating the data in Delta)
The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:
借助其ACID模型,Delta Lake OSS支持一系列更新選項。 讓我們使用它來運行批處理更新,該更新將覆蓋現有數據。 我們通過使用以下代碼來做到這一點:
val data = spark.range(0, 100)
data.write.format("delta").mode("overwrite").save("/data/delta-table")
df.show()
摘要 (Summary)
I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally you can follow me on my social media if you fancy so :)
我希望您發現這篇文章有用。 如果是這樣,請隨時喜歡或分享此帖子。 此外,如果您愿意,也可以在我的社交媒體上關注我:)
演示地址
Sources: https://docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
資料來源: https : //docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
翻譯自: https://medium.com/swlh/delta-lake-and-data-lakes-getting-started-41ce957ed0da
marlin 三角洲
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392442.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392442.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392442.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!