marlin 三角洲_帶火花的三角洲湖:什么和為什么?

marlin 三角洲

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

首先,我介紹一下我在Apache Spark上的經歷反復解決的兩個問題:

  1. Data “overwrite” on the same path causing data loss in case of Job Failure.

    在作業失敗的情況下,同一路徑上的數據“覆蓋”會導致數據丟失。
  2. Updates in the data.

    數據更新。

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

有時我通過設計更改解決了上述問題,有時通過引入了諸如Aerospike的另一層,或者有時通過維護歷史增量數據來解決。

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.

維護歷史數據通常是一個立即解決方案,但是如果不是真正需要歷史增量數據,我真的不喜歡處理它,因為(至少對我來說)這會帶來回填的痛苦,以防止發生故障(雖然這不太可能,但不可避免)。

The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.

以上兩個問題是“問題”,因為Apache Spark并不真正支持ACID。 我知道這絕不是Spark處理事務的用例(您好,您不能擁有所有東西),但是有時候,在某些情況下(如上述兩個問題),ACID合規性會派上用場。

When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?

當我閱讀Delta Lake及其ACID合規性時,我將其視為解決我的兩個問題的可能解決方案之一。 請繼續閱讀,以找出這兩個問題與ACID合規性失敗之間的關系以及如何將三角洲湖視為救星?

什么是三角洲湖? (What is Delta Lake?)

Delta Lake Documentation introduces Delta lake as:

Delta Lake文檔將Delta Lake引入為:

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake是一個開源存儲層 ,可為數據湖帶來可靠性。 Delta Lake提供ACID事務,可伸縮的元數據處理,并統一流和批處理數據處理。 Delta Lake在您現有的數據湖之上運行,并且與Apache Spark API完全兼容。

Delta Lake key points:

三角洲湖重點:

  • Supports ACID

    支持ACID
  • Enables Time travel

    啟用時間旅行
  • Enables UPSERT

    啟用UPSERT

Spark如何使ACID失敗? (How Spark fails ACID?)

Consider the following piece of code to remove duplicates from a dataset:

考慮以下代碼,從數據集中刪除重復項:

# Read from HDFS
df = spark.read.parquet("/path/on/hdfs") # Line 1
# Remove duplicates
df = df.distinct() # Line 2
# Overwrite the data
df.cache() # Line 3
df.write.parquet("/path/on/hdfs", mode="overwrite") # Line 4

For my spark application running above piece of code consider a scenario where it fails on Line 4, that is while writing the data. This may or may not lead to data loss. [Problem #1: As mentioned above].You can replicate the scenario, by creating a test dataset and kill the job when it’s in the Write stage.

對于在上述代碼段上運行的我的spark應用程序,請考慮一種情況,即它在第4行上失敗,即在寫入數據時。 這可能會或可能不會導致數據丟失。 [問題1:如上所述]。您可以通過創建測試數據集來復制方案,并在處于Write階段時取消該作業。

Let us try to understand ACID failure in spark with the above scenario.

讓我們嘗試了解上述情況下火花中的ACID故障。

ACID中的A代表原子性, (A in ACID stands for Atomicity,)

  • What is Atomicity: Either all changes take place or none, the system is never in halfway state.

    什么是原子性:要么所有更改都發生,要么全部不發生,系統永遠不會處于中間狀態。

  • How spark fails: While writing data, (at Line 4 above), if a failure occurs at a stage where old data is removed and new data is not yet written, data loss occurs. We have lost old data and we were not able to write new data due to job failure, atomicity fails. [It can vary according to file output committer used, please do read about File output committer to see how data writing takes place, the scenario I explained is for v2]

    火花如何失敗:在寫入數據時(在上面的第4行),如果在刪除舊數據而尚未寫入新數據的階段發生故障,則會發生數據丟失。 我們丟失了舊數據,并且由于作業失敗,原子性失敗而無法寫入新數據。 [根據使用的文件輸出提交程序,它可能有所不同,請閱讀有關文件輸出提交程序的信息,以了解如何進行數據寫入,我所說明的場景適用于v2]

ACID中的C代表一致性, (C in ACID stands for Consistency,)

  • What is Consistency: Data must be consistent and valid in the system at all times.

    什么是一致性 :數據必須始終在系統中保持一致和有效。

  • How Spark fails: As seen above, in the case of failure and data loss, we are left with invalid data in the system, consistency fails.

    Spark如何失敗:如上所示,在失敗和數據丟失的情況下,我們在系統中留有無效數據,一致性失敗。

ACID中的I代表隔離, (I in ACID stands for Isolation,)

  • What is Isolation: Multiple transactions occur in isolation

    什么是隔離:多個事務是隔離發生的

  • How spark fails: Consider two jobs running in parallel, one as described above and another which is also using the same dataset, if one job overwrites the dataset while other is still using it, failure might happen, isolation fails.

    Spark如何失敗:考慮兩個并行運行的作業,一個如上所述,另一個使用相同的數據集,如果一個作業覆蓋了數據集而另一個仍在使用它,則可能會發生故障,隔離失敗。

ACID中的D代表耐久性, (D in ACID stands for Durability,)

  • What is Durability: Changes once made are never lost, even in the case of system failure.

    什么是耐久性:即使系統發生故障,一旦進行更改也不會丟失。

  • How spark might fail: Spark really doesn’t affect the durability, it is mainly governed by the storage layer, but since we are losing data in case of job failures, in my opinion, it is a durability failure.

    Spark 可能如何失敗: Spark確實不會影響持久性,它主要由存儲層控制,但是由于我們在工作失敗的情況下會丟失數據,因此我認為這是持久性失敗。

Delta Lake如何支持ACID? (How Delta Lake supports ACID?)

Delta lake maintains a delta log in the path where data is written. Delta Log maintains details like:

Delta Lake在寫入數據的路徑中維護一個delta日志。 Delta Log維護以下詳細信息:

  • Metadata like

    像元數據

    - Paths added in the write operation.

    -在寫操作中添加的路徑。

    - Paths removed in the write operation.

    -在寫操作中刪除了路徑。

    - Data size

    -數據大小

    - Changes in data

    -數據變化

  • Data Schema

    數據架構
  • Commit information like

    提交信息,例如

    - Number of output rows

    -輸出行數

    - Output bytes

    -輸出字節

    - Timestamp

    -時間戳

Sample log file in _delta_log_ directory created after some operations:

某些操作后在_delta_log_目錄中創建的示例日志文件:

Image for post

After successful execution, a log file is created in the _delta_log_ directory. The important thing to note is when you save your data as delta, no files once written are removed. The concept is similar to versioning.

成功執行后,將在_delta_log_目錄中創建一個日志文件。 需要注意的重要一點是,當您將數據另存為增量時,寫入后不會刪除任何文件。 該概念類似于版本控制。

By keeping track of paths removed, added and other metadata information in the _delta_log_, Delta lake is ACID-compliant.

通過跟蹤_delta_log_中路徑的刪除,添加和其他元數據信息,Delta Lake符合ACID。

Versioning enables time travel property of Delta Lake, which is, I can go back to any state of data because all this information is being maintained in _delta_log_.

版本控制啟用了Delta Lake的時間旅行屬性,即,由于所有這些信息都保存在_delta_log_中,因此我可以返回到任何數據狀態。

Delta Lake如何解決上述兩個問題? (How Delta Lake solves my two problems mentioned above?)

  • With the support for ACID, if my job fails during the “overwrite” operation, data is not lost, as changes won’t be committed to the log file of _delta_log_ directory. Also, since Delta Lake, does not remove old files in the “overwrite operation”, old state of my data is maintained and there is no data loss. (Yes, I have tested it)

    有了ACID的支持,如果我的工作在“覆蓋”操作期間失敗,則數據不會丟失,因為更改不會提交到_delta_log_目錄的日志文件中。 另外,由于Delta Lake不會在“覆蓋操作”中刪除舊文件,因此我的數據保持了舊狀態,并且沒有數據丟失。 (是的,我已經測試過了)
  • Delta lake supports Update operation as mentioned above so it makes dealing with updates in data easier.

    Delta Lake支持如上所述的Update操作,因此使數據更新更容易。

Until next time,Ciao.

直到下次,Ciao。

翻譯自: https://towardsdatascience.com/delta-lake-with-spark-what-and-why-6d08bef7b963

marlin 三角洲

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391297.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391297.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391297.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

環境變量的作用

1. PATH環境變量。作用是指定命令搜索路徑,在shell下面執行命令時,它會到PATH變量所指定的路徑中查找看是否能找到相應的命令程序。我們需要把 jdk安裝目錄下的bin目錄增加到現有的PATH變量中,bin目錄中包含經常要用到的可執行文件如javac/ja…

WeWork通過向225,000個社區征稅來拼命地從Meetup.com榨取現金

Update: A few hours after I published this article, Meetup quietly added a note to the top of their announcement. They have not tweeted or done anything else to publicize this note, but some people noticed it and shared it with me.更新:在我發布本…

eda分析_EDA理論指南

eda分析Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to ch…

leetcode 897. 遞增順序搜索樹(中序遍歷)

給你一棵二叉搜索樹,請你 按中序遍歷 將其重新排列為一棵遞增順序搜索樹,使樹中最左邊的節點成為樹的根節點,并且每個節點沒有左子節點,只有一個右子節點。 示例 1: 輸入:root [5,3,6,2,4,null,8,1,null…

【一針見血】 JavaScript this

JavaScript this 指向一站式解決轉載于:https://www.cnblogs.com/xueyejinghong/p/8403987.html

基于ssm框架和freemarker的商品銷售系統

項目說明 1、項目文件結構 2、項目主要接口及其實現 (1)Index: 首頁頁面:展示商品功能,可登錄或查看商品詳細信息 (2)登錄:/ApiLogin 3、dao層 數據持久化層,把商品和用戶…

c++飛揚的小鳥游戲_通過建立一個飛揚的鳥游戲來學習從頭開始

c飛揚的小鳥游戲Learn how to use Scratch 3.0 by building a flappy bird game in this course developed by Warfame. Scratch is a free programming language and online community where you can create your own interactive stories, games, and animations. Scratch is…

345

345 轉載于:https://www.cnblogs.com/Forever77/p/11512701.html

簡·雅各布斯指數第二部分:測試

In Part I, I took you through the data gathering and compilation required to rank Census tracts by the four features identified by Jane Jacobs as the foundation of a great neighborhood:在第一部分中 ,我帶您完成了根據簡雅各布斯(Jacobs Jacobs)所確定…

Docker 入門(3)Docke的安裝和基本配置

1. Docker Linux下的安裝 1.1 Docker Engine 的版本 社區版 ( CE, Community Edition ) 社區版 ( Docker Engine CE ) 主要提供了 Docker 中的容器管理等基礎功能,主要針對開發者和小型團隊進行開發和試驗企業版 ( EE, Enterprise Edition ) 企業版 ( Docker Engi…

python:單元測試框架pytest的一個簡單例子

之前一般做自動化測試用的是unitest框架,發現pytest同樣不錯,寫一個例子感受一下 test_sample.py import cx_Oracle import config from send_message import send_message from insert_cainiao_oracle import insert_cainiao_oracledef test_cainiao_mo…

mkdir命令使用范例

mkdir -p dir1/dir2/dir3/dir4 :-p 創建不存在的中間目錄mkdir -m 000 demdir :-m 000 為新創建的目錄指定權限轉載于:https://blog.51cto.com/2685141/2068162

pwa 問題_您真的需要PWA嗎? 這里有四個問題可以幫助您做出決定。

pwa 問題為什么需要PWA并不成問題。 讓我們看看為什么您可能不需要它 (Why you need a PWA is not in question. Let’s see why you may NOT need it) My inbox has been filled with questions regarding PWAs after my last two articles. 在上兩篇文章之后,我的…

利用ssh反向代理以及autossh實現從外網連接內網服務器

http://www.cnblogs.com/kwongtai/p/6903420.html轉載于:https://www.cnblogs.com/littlehb/p/7598037.html

抑郁癥損傷神經細胞嗎_使用神經網絡探索COVID-19與抑郁癥之間的聯系

抑郁癥損傷神經細胞嗎The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social…

倦怠和枯燥_如何不斷學習(不倦怠)

倦怠和枯燥In tech, constantly learning (both in and out of work) is an unstated job requirement. 在科技界,不斷學習(工作中和工作中)是一項未闡明的工作要求。 When I was growing up, I would go to the bookstore with my dad every weekend, and every t…

Xcode 9.0 新增功能大全

Xcode是用于為Apple TV,Apple Watch,iPad,iPhone和Mac創建應用程序的完整開發人員工具集。Xcode開發環境采用tvOS SDK,watchOS SDK,iOS SDK和macOS SDK的形式捆綁Instruments分析工具,Simulator和OS框架。 …

Docker 入門(4)鏡像與容器

1. 鏡像與容器 1.1 鏡像 Docker鏡像類似于未運行的exe應用程序,或者停止運行的VM。當使用docker run命令基于鏡像啟動容器時,容器應用便能為外部提供服務。 鏡像實際上就是這個用來為容器進程提供隔離后執行環境的文件系統。我們也稱之為根文件系統&a…

python:pytest中的setup和teardown

原文:https://www.cnblogs.com/peiminer/p/9376352.html  之前我寫的unittest的setup和teardown,還有setupClass和teardownClass(需要配合classmethod裝飾器一起使用),接下來就介紹pytest的類似于這類的固件。 &#…

如何開始使用任何類型的數據? - 第1部分

從數據開始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的數據科學之旅從在德國最大的汽車制造商之一…