鋪裝s路畫法
Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates financial information to simplify user workflows, customer care interactions made effective with use of data and AI, etc. Data pipelines that capture data from the source systems, perform transformations on the data and make the data available to the machine learning (ML) and analytics platforms, are critical for enabling these experiences.
數據是Intuit的主要賭注,因為我們在新客戶體驗上投入了大量資金:一個將世界各地專家與客戶和小型企業主聯系起來的平臺,一個與數千家機構連接并匯總財務信息以簡化用戶工作流程,客戶的平臺護理交互通過使用數據和AI等而有效。數據管道從源系統捕獲數據,對數據執行轉換并將數據提供給機器學習(ML)和分析平臺,對于實現這些體驗至關重要。
With the move to Cloud data lakes, data engineers now have a multitude of processing runtimes and tools available to build these data pipelines. The wealth of choices has lead to silos of computation, inconsistent implementation of the pipelines and an overall reduction in the effectiveness of extracting data insights efficiently. In this blog article we will describe a “paved road“ for creating, managing and monitoring data pipelines, to eliminate the silos and increase the effectiveness of processing in the data lake.
隨著向Cloud數據湖的遷移,數據工程師現在擁有大量的處理運行時和可用于構建這些數據管道的工具。 眾多的選擇導致了計算孤島,流水線的實現不一致以及有效提取數據見解的有效性的總體下降。 在這篇博客文章中,我們將描述用于創建,管理和監視數據管道,消除孤島并提高數據湖中處理效率的“鋪平道路”。
在數據湖中處理 (Processing in the Data Lake)
Data is ingested into the lake from a variety of internal and external sources, cleansed, augmented, transformed and made available to ML and analytics platforms for insight. We have different types of pipelines, to ingest data into the data lake, curate the data, transform, and load data into data marts.
數據從各種內部和外部來源被吸收到湖泊中,經過清洗,擴充,轉換并提供給ML和分析平臺以進行洞察。 我們有不同類型的管道,用于將數據提取到數據湖,整理數據,轉換數據并將數據加載到數據集市。
攝取管道 (Ingestion Pipelines)
A key tenet of data transformation is to ensure that all data is ingested into the data lake and made available in a format that is easily discoverable. We standardized on Parquet as the file format for all ingestion into the data lake, with support for materialization (mutable data sets). The bulk of our datasets are materialized through Intuit’s own materialization engine, though Delta Lake is rapidly gaining momentum as a materialization format of choice. A data catalog built using Apache Atlas is used for searching and discovering the datasets, while Apache Superset is used for exploring the data sets.
數據轉換的主要原則是確保所有數據都被吸收到數據湖中并以易于發現的格式提供。 我們對Parquet進行了標準化,將其作為所有數據倉庫中的所有文件格式,并支持實現(可變數據集)。 我們的大部分數據集都是通過Intuit自己的實現引擎實現的,盡管Delta Lake作為選擇的實現格式正在Swift獲得發展。 使用Apache Atlas構建的數據目錄用于搜索和發現數據集,而Apache Superset用于探索數據集。
ETL管道和數據流 (ETL Pipelines & Data Streams)
Before data in the lake is consumed by the ML and analytics platform, it needs to be transformed (cleansed, augmented from additional sources, aggregated etc). The bulk of these transformations are done periodically on schedule: once a day, once every few hours, etc, although as we begin to embrace the concepts of real-time processing, there has been an uptick in converting the batch-oriented pipelines to streaming.
在ML和分析平臺使用湖中的數據之前,需要對其進行轉換(清理,從其他來源擴充,匯總等)。 這些轉換的大部分是按計劃定期進行的:每天一次,每幾個小時一次,等等,盡管隨著我們開始接受實時處理的概念,將面向批處理的管道轉換為流式傳輸的趨勢有所提高。
Batch processing in the lake is done primarily through Hive and Spark SQL jobs. More complex transformations that cannot be represented in SQL are done using Spark Core. The main engine today for batch processing is AWS EMR. Scheduling of the batch jobs is done through an enterprise scheduler with more than 20K jobs scheduled daily.
湖中的批處理主要通過Hive和Spark SQL作業完成。 無法使用SQL表示的更復雜的轉換是使用Spark Core完成的。 如今,用于批處理的主要引擎是AWS EMR。 批處理作業的調度是通過企業調度程序完成的,每天調度超過2萬個作業。
Stream pipelines process messages that are read from a Kafka-based event bus, use Apache Beam for analyzing the data streams and Apache Flink as the engine for stateful computation on the data streams.
流管道處理從基于Kafka的事件總線讀取的消息,使用Apache Beam分析數據流,并使用Apache Flink作為對數據流進行狀態計算的引擎。
為什么我們需要一條“柏油路”? (Why do we need a “Paved Road”?)
With the advent of cloud data lakes and open source, data engineers have a wealth of choices for implementing the pipelines. For batch computation users have the option of picking from AWS EMR, AWS Glue, AWS Batch, AWS Lambda from AWS, Apache Airflow data pipelines, Apache Spark etc from Opensource/enterprise vendors. Data Streams can be implemented on AWS Kinesis streams, Apache Beam, Spark Streaming, Apache Flink etc.
隨著云數據湖和開放源代碼的到來,數據工程師擁有許多實現管道的選擇。 對于批處理計算,用戶可以選擇從AWS EMR,AWS Glue,AWS Batch,來自AWS的AWS Lambda,來自開放源代碼/企業供應商的Apache Airflow數據管道,Apache Spark等中進行選擇。 數據流可以在AWS Kinesis流,Apache Beam,Spark流,Apache Flink等上實現。
Though choice is a good thing to aspire to, if not applied carefully can lead to fragmentation and islands of computing. As users adopt different infrastructure and tools for their pipelines it can inadvertently lead to silos and inconsistencies in the capabilities across pipelines.
盡管選擇是一件好事,但如果應用不當,則會導致碎片化和計算孤島。 當用戶為他們的管道采用不同的基礎結構和工具時,可能會無意間導致跨管道的孤島和功能不一致。
Lineage: Different infrastructure and tools provide different levels of lineage and in some cases none at all and do not integrate with each other. For example pipelines built using EMR do not share the lineage with pipelines built using other frameworks.
沿襲 :不同的基礎架構和工具提供不同級別的沿襲,在某些情況下根本沒有,并且彼此不集成。 例如,使用EMR構建的管道不會與使用其他框架構建的管道共享血統。
Pipeline Management: Creation and Management of pipelines can be different and inconsistent across different pipeline infrastructures.
管道管理 : 管道的創建和管理在不同的管道基礎架構中可能是不同的且不一致的。
Monitoring & Alerting: Monitoring and Alerting is not standardized across different pipeline infrastructures.
監視和警報 :監視和警報未在不同的管道基礎結構之間標準化。
數據管道的鋪裝之路 (A Paved Road for Data Pipelines)
A Paved Road for the data pipelines provides a consistent set of infrastructure components and tools for implementing data pipelines.
數據管道的鋪平道路為實現數據管道提供了一致的基礎結構組件和工具集。
- A standard way to create and manage pipelines 創建和管理管道的標準方法
- A standard way to promote the pipelines from development/QA to production environments. 從開發/質量保證到生產環境的管道升級的標準方法。
- A standard way to monitor, debug, analyze failures and remediate errors in the pipelines. 監視,調試,分析故障和修復管道中錯誤的標準方法。
- Pipelines tools such for Lineage, Data Anomaly detection and Data Parity checks work consistently across all the pipelines. 流水線,數據異常檢測和數據奇偶校驗等流水線工具在所有流水線中始終如一地工作。
- A small set of execution environments that host the pipelines and provide a consistent experience to the users of the pipelines. 托管管道并為管道用戶提供一致體驗的一小組執行環境。
The Paved Road begins with Intuit’s Development Portal where data engineers manage their pipelines.
鋪平的道路始于Intuit的開發門戶,由數據工程師管理其管道。
Intuit開發門戶 (Intuit Development Portal)
Our development portal is an entry point for all developers at Intuit for managing their web applications, microservices, AWS Accounts and other types of assets.
我們的開發門戶是Intuit所有開發人員管理其Web應用程序,微服務,AWS賬戶和其他類型資產的入口。
We extended the development portal to allow data engineers to manage their data pipelines. It is a central location for data engineers to create, manage, monitor and remediate their pipelines.
我們擴展了開發門戶,以允許數據工程師管理其數據管道。 它是數據工程師創建,管理,監視和修復其管道的中心位置。
處理器與管道 (Processors & Pipelines)
Processors are reusable code artifacts that represent a task within a data pipeline. In batch pipelines, they correspond to a HiveSQL or Spark SQL code that performs a transformation by reading data from input tables in the data lake and writing the transformed data back to another table. In stream pipelines, messages are read from the event bus, transformed and written back to the event bus.
處理器是可重用的代碼工件,它們代表數據管道中的任務。 在批處理管道中,它們對應于HiveSQL或Spark SQL代碼,該代碼通過從數據湖中的輸入表中讀取數據并將轉換后的數據寫回到另一個表中來執行轉換。 在流管道中,從事件總線讀取消息,將其轉換并寫回到事件總線。
Pipelines are a series of processors chained together to perform an activity/job. Batch pipelines are typically scheduled or triggered on the completion of other batch pipelines. Stream pipelines execute when messages arrive on the event bus.
管道是一系列鏈接在一??起以執行活動/作業的處理器。 通常在其他批處理管道完成時調度或觸發批處理管道。 當消息到達事件總線時,流管道將執行。
Defining the Pipelines
定義管道
Intuit data engineers create pipelines using the data pipeline widget in the development portal. During the pipeline creation, data engineers implement the pipeline’s processors, define its schedule, and specify upstream dependencies or additional triggers required for initiating the pipelines.
Intuit數據工程師使用開發門戶中的數據管道小部件創建管道。 在管道創建過程中,數據工程師將實施管道的處理器,定義其時間表并指定上游依賴項或啟動管道所需的其他觸發器。
Processors within a pipeline specify the datasets they work on and the datasets they output to define the lineage. Pipelines are defined in development/QA environments, tested and promoted to production.
流水線中的處理器指定它們要處理的數據集以及它們輸出的數據集以定義譜系。 在開發/ QA環境中定義管道,將其測試并提升為生產。
Managing the Pipelines
管理管道
From the development portal users are able to navigate to their pipelines and manage them. Each pipeline has a custom monitoring dashboard that displays the current active instances of the pipeline and historical instances. The dashboard also has widgets for metrics such as execution time, cpu, memory usage metrics etc. A pipeline specific logging dashboard allows users to look at the pipeline logs and debug in case of errors.
通過開發門戶,用戶可以導航到其管道并進行管理。 每個管道都有一個自定義的監控儀表板,可顯示管道的當前活動實例和歷史實例。 儀表板還具有用于度量標準的小部件,例如執行時間,cpu,內存使用率指標等。特定于管道的日志記錄儀表板允許用戶查看管道日志并在發生錯誤的情況下進行調試。
Users can edit the pipelines to add or delete processors, change the schedules and upstream dependencies, etc. as a part of the day-to-day operations for managing the pipelines.
用戶可以作為管理管道的日常操作的一部分,編輯管道以添加或刪除處理器,更改計劃和上游依存關系等。

管道執行環境 (Pipeline Execution Environments)
The primary execution environment for our batch pipelines is AWS EMR. These pipelines are scheduled using an enterprise scheduler. This environment has been the workhorse and will continue to remain so, but it has started to show its age. The scheduler built in an enterprise world and has struggled to make the transition to cloud environment. Hadoop/YARN, which forms the basis of AWS EMR have not kept pace with advances in container runtimes. In the target state for batch pipelines we are working towards execution environments that are optimized for container runtimes and cloud native schedulers.
我們的批處理管道的主要執行環境是AWS EMR。 這些管道使用企業調度程序進行調度。 這種環境一直是工作的重心,并將繼續保持這種狀態,但是它已經開始顯示其年代久遠。 調度程序構建在企業環境中,一直在努力過渡到云環境。 構成AWS EMR基礎的Hadoop / YARN未能跟上容器運行時的發展步伐。 在批處理管道的目標狀態下,我們正在努力為容器運行時和云本機調度程序優化的執行環境。
We’re also investing in reducing the friction to switch the pipelines from one execution environment to another. To change the execution environment for example from Hadoop/YARN to Kubernetes, all the data engineer is required to do is to redeploy the pipeline to the new environment.
我們還投資減少摩擦,以將管道從一個執行環境切換到另一個執行環境。 要將執行環境從Hadoop / YARN更改為Kubernetes,數據工程師要做的就是將管道重新部署到新環境。
管道工具 (Pipeline Tools)
A key aspect of a paved road is a comprehensive set of tools for capabilities such as lineage, data parity, anomaly detection, etc. Consistency of tools across all pipelines and their execution environments is crucial for increasing the value we extract from the data and confidence/trust we instill in consumers of this data.
鋪裝道路的一個關鍵方面是一套用于功能的綜合工具,例如沿襲,數據奇偶校驗,異常檢測等。所有管道及其執行環境中工具的一致性對于增加我們從數據中提取的價值和信心至關重要/ trust,我們向這些數據的使用者灌輸。
Lineage
血統
A lineage tool is critical for the productivity of the data engineers and their ability to operate the data pipelines because it tracks the lineage of all the pipelines from the source systems to the ingestion frameworks to the data lake and the analytics/reporting systems.
沿襲工具對于數據工程師的生產力及其操作數據管道的能力至關重要,因為它可以跟蹤從源系統到攝取框架再到數據湖和分析/報告系統的所有管道的沿襲。
Data Anomaly Detection
數據異常檢測
Another important tool in the data pipelines arsenal is detection of data anomalies. There is a multitude of data anomalies to consider, including data freshness, lack of new data coming in, missing/duplicated data, etc.
數據管道中的另一個重要工具是檢測數據異常。 有許多數據異常需要考慮,包括數據新鮮度,缺少新數據,丟失/重復的數據等。

Data anomaly detection tools increase the confidence/trust in the correctness of the data. The anomaly detection algorithms model seasonality, dynamically adjust thresholds, and alert consumers when anomalies are detected.
數據異常檢測工具可提高對數據正確性的置信度/信任度。 異常檢測算法可對季節性進行建模,動態調整閾值,并在檢測到異常時提醒消費者。
Data Parity
數據奇偶校驗
Data Parity checks are performed at multiple stages of the data pipelines to ensure the correctness of the data as it flows through the pipeline. Parity checks are another key capability for addressing the compliance requirements such as SOX.
數據奇偶校驗在數據管道的多個階段執行,以確保數據流經管道時的正確性。 奇偶校驗是滿足合規性要求(如SOX)的另一項關鍵功能。
結論與未來工作 (Conclusion & Future Work)
Intuit has thousands of data pipelines that span across all business units and across various functions such as marketing, customer success, risk, etc. These pipelines are critical to enabling the data-driven experiences.. The paved road described here provides a consistent environment for managing the pipelines. But, it’s only the beginning of our data pipeline journey.
Intuit擁有成千上萬的數據管道,涵蓋所有業務部門以及各種功能,例如市場營銷,客戶成功,風險等。這些管道對于實現數據驅動的體驗至關重要。管理管道。 但是,這僅僅是我們數據管道之旅的開始。
Pipelines & Entity Graphs
管道和實體圖
Data lakes are a collection of thousands of tables that are hard to discover and explore, poorly documented, and continue to remain difficult to use because they don’t capture the entities that describe a business and the relationships between them. In future, we envision Entity Graphs that represent how business use and extract insights from data. The data pipelines to acquire, transform and serve data will evolve to understand these entity graphs.
數據湖是成千上萬個表的集合,這些表很難發現和探索,記錄不良,并且由于難以捕獲描述業務及其之間關系的實體而仍然難以使用。 將來,我們會設想實體圖,這些圖代表業務如何使用和從數據中提取見解。 用于獲取,轉換和提供數據的數據管道將不斷發展,以了解這些實體圖。
Data Mesh
數據網格
In her paper on “Distributed Data Mesh,” Zhamak Dehghani, principal consultant, member of technical advisory board, and portfolio director at ThoughtWorks, lays the foundation for domain-oriented decomposition and ownership of data pipelines. To realize the vision of a data mesh and successfully enable domain owners to define and manage their own data pipelines, the “paved road” for data pipelines described here is a foundational stepping stone.
ThoughtWorks首席顧問,技術顧問委員會成員兼投資總監 Zhamak Dehghani在其有關“ 分布式數據網格 ”的論文中 ,為面向領域的分解和數據管道所有權奠定了基礎。 為了實現數據網格的愿景并成功地使域所有者定義和管理自己的數據管道,此處所述的數據管道“鋪平的道路”是基礎的墊腳石。
翻譯自: https://medium.com/intuit-engineering/a-paved-road-for-data-pipelines-779004143e41
鋪裝s路畫法
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390878.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390878.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390878.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!