開源 數據倉庫_使用這些開源工具進行數據倉庫

開源 數據倉庫

by Simon Sp?ti

西蒙·斯派蒂(SimonSp?ti)

使用這些開源工具進行數據倉庫 (Use these open-source tools for Data Warehousing)

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

如今,每個人都在談論開源軟件。 但是,這在數據倉庫(DWH)字段中仍然不常見。 為什么是這樣?

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

在這篇文章中,我選擇了一些開源技術,并將它們一起用于構建數據倉庫系統的完整數據體系結構。

I went with Apache Druid for data storage, Apache Superset for querying, and Apache Airflow as a task orchestrator.

我使用Apache Druid進行數據存儲,使用Apache Superset進行查詢,并使用Apache Airflow作為任務編排器。

德魯伊—數據存儲 (Druid — the data store)

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.

Druid是一個用Java編寫的開源,面向列的分布式數據存儲。 它旨在快速提取大量事件數據,并在數據之上提供低延遲查詢。

為什么要使用德魯伊? (Why use Druid?)

Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost effectiveness.

Druid具有許多關鍵功能,包括亞秒級OLAP查詢,實時流接收,可伸縮性和成本效益。

With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.

考慮到現代OLAP技術的比較 ,我選擇了Druid而不是ClickHouse,Pinot和Apache Kylin。 最近, Microsoft宣布將把Druid添加到其Azure HDInsight 4.0中。

為什么不德魯伊? (Why not Druid?)

Carter Shanklin wrote a detailed post about Druid’s limitations at Horthonwork.com. The main issue is with its support for SQL joins, and advanced SQL capabilities.

Carter Shanklin在Horthonwork.com上寫了一篇有關Druid局限性的詳細文章 。 主要問題是它對SQL連接的支持以及高級SQL功能。

德魯伊的體系結構 (The Architecture of Druid)

Druid is scalable due to its cluster architecture. You have three different node types — the Middle-Manager-Node, the Historical Node and the Broker.

由于其集群體系結構,Druid可擴展。 您有三種不同的節點類型-中間管理者節點,歷史節點和代理。

The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.

很棒的是,您可以在最適合您的特定區域中添加任意數量的節點。 如果要運行許多查詢,則可以添加更多代理。 或者,如果需要批量處理大量數據,則可以添加中層管理人員,依此類推。

A simple architecture is shown below. You can read more about Druid’s design here.

一個簡單的架構如下所示。 您可以在此處閱讀有關Druid設計的更多信息。

Apache Superset —用戶界面 (Apache Superset — the UI)

The easiest way to query against Druid is through a lightweight, open-source tool called Apache Superset.

針對Druid進行查詢的最簡單方法是通過一個名為Apache Superset的輕量級開源工具。

It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more.

它易于使用,并具有所有常見的圖表類型,例如氣泡圖,字數統計,熱圖,箱線圖等等 。

Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.

Druid提供了Rest-API,并且在最新版本中還提供了SQL Query API。 這使得可以輕松使用任何工具,無論它是標準SQL,任何現有的BI工具還是自定義應用程序。

Apache Airflow-協調器 (Apache Airflow — the Orchestrator)

As mentioned in Orchestrators — Scheduling and monitor workflows, this is one of the most critical decisions.

如Orchestrators中的“計劃和監視工作流”中所述 ,這是最關鍵的決定之一。

In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.

過去,ETL工具(例如Microsoft SQL Server集成服務(SSIS)和其他工具)得到了廣泛使用。 它們是您進行數據轉換,清理和標準化的地方。

In more modern architectures, these tools aren’t enough anymore.

在更現代的體系結構中,這些工具已經遠遠不夠了。

Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.

而且,代碼和數據轉換邏輯對于公司中其他精通數據的人來說更有價值。

I highly recommend you read a blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing. This goes much deeper into how modern data pipelines should be.

我強烈建議您閱讀Maxime Beauchemin的博客文章有關功能數據工程(一種用于批處理數據的現代范例) 。 這將更深入地介紹現代數據管道的方式。

Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

另外,請考慮閱讀《數據工程師的垮臺》一書,其中Max解釋了打破“數據孤島”等問題。

為什么要使用氣流? (Why use Airflow?)

Apache Airflow is a very popular tool for this task orchestration. Airflow is written in Python. Tasks are written as Directed Acyclic Graphs (DAGs). These are also written in Python.

Apache Airflow是用于此任務編排的非常流行的工具。 氣流是用Python編寫的。 任務被編寫為有向無環圖( DAG )。 這些也是用Python編寫的。

Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.

無需將關鍵轉換邏輯封裝在工具中的任何位置,而是將其放置在Orchestrator內部的位置。

Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.

另一個優點是使用普通的Python。 無需封裝其他依賴項或要求,例如從FTP提取,將數據從A復制到B,編寫批處理文件。 您可以執行此操作,其他所有操作都在同一位置。

氣流特征 (Features of Airflow)

Moreover, you get a fully functional overview of all current tasks in one place.

此外,您可以在一處獲得所有當前任務的完整功能概述。

More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.

Airflow的更多相關功能是您像編寫程序一樣編寫工作流。 諸如Databricks,Spark等的外部作業沒有問題。

Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.

作業測試通過Airflow本身進行。 這包括將參數傳遞給下游的其他作業,或驗證Airflow上正在運行的內容并查看實際代碼。 日志文件和其他元數據可通過Web GUI訪問。

(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.

僅在部分工作流程上運行(重新),并且相關任務是一項至關重要的功能,當您使用Airflow創建工作流程時,該功能即開即用。 作業/任務在上下文中運行,調度程序傳遞必要的詳細信息,然后工作將在任務級別(而不是DAG級別)上跨集群分布。

For many more feature visit the full list.

有關更多功能,請訪問完整列表 。

使用Apache Airflow的ETL (ETL with Apache Airflow)

If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, and Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.

如果要以Apache Airflow作為新的ETL工具開始,請從與您共享的Airflow的ETL最佳實踐開始。 它具有簡單的ETL示例,帶有簡單SQL,帶有HIVE ,帶有Data Vault , Data Vault 2和帶有大數據流程的Data Vault 。 它為您提供了一個很好的概述,介紹了可行的方法以及如何實現它。

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from here.

同時,您可以使用一個Docker容器,這意味著您甚至不必設置任何基礎架構。 您可以從此處拉出容器。

For the GitHub-repo follow the link on etl-with-airflow.

對于GitHub-repo,請點擊etl-with-airflow上的鏈接。

結論 (Conclusion)

If you’re searching for open-source data architecture, you cannot ignore Druid for speedy OLAP responses, Apache Airflow as an orchestrator that keeps your data lineage and schedules in line, plus an easy to use dashboard tool like Apache Superset.

如果您正在尋找開源數據架構,則不能忽略Druid的快速OLAP響應,Apache Airflow作為協調器(使您的數據沿襲和時間表保持一致)以及易于使用的儀表板工具(如Apache Superset)。

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at Impy which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.

到目前為止,我的經驗是Druid的速度非常快,并且以傳統方式非常適合OLAP多維數據集替換 ,但是仍然需要更輕松的啟動來安裝集群,提取數據,查看日志等。如果需要,請看看由Druid的創始人創建的Impy 。 它圍繞您需要的Druid創建所有服務。 不幸的是,它不是開源的。

Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

在傳統的商業智能環境中,Apache Airflow及其作為協調器的功能尚未發生很多事情。 我相信,當您開始使用開源和更多新技術時,這種變化會自然而然地出現。

And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Apache Superset是一種簡便,快速的方法,可用于啟動和運行以及顯示來自Druid的數據。 那里有更好的工具,例如Tableau等,但不是免費的。 這就是為什么如果您已經在使用上述開源技術,那么Superset非常適合生態系統。 但是作為一家企業公司,您可能需要在該類別中花一些錢,因為這是用戶最終可以看到的。

Related Links:

相關鏈接:

  • Understanding Apache Airflow’s key concepts

    了解Apache Airflow的關鍵概念

  • How Druid enables analytics at Airbnb

    Druid如何在Airbnb上啟用分析

  • Google launches Cloud Composer, a new workflow automation tool for developers

    Google推出了Cloud Composer,這是面向開發人員的全新工作流程自動化工具

  • A fully managed workflow orchestration service built on Apache Airflow

    基于Apache Airflow的完全托管的工作流程編排服務

  • Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark

    集成Apache Airflow和Databricks:使用Apache Spark構建ETL管道

  • ETL with Apache Airflow

    使用Apache Airflow的ETL

  • What is Data Engineering and the future of Data Warehousing

    什么是數據工程和數據倉庫的未來

  • Imply — Managed Druid platform (closed-source)

    暗示—托管Druid平臺(封閉源)

  • Ultra-fast OLAP Analytics with Apache Hive and Druid

    使用Apache Hive和Druid的超快速OLAP分析

Originally published at www.sspaeti.com on November 29, 2018.

最初于2018年11月29日發布在www.sspaeti.com 。

翻譯自: https://www.freecodecamp.org/news/open-source-data-warehousing-druid-apache-airflow-superset-f26d149c9b7/

開源 數據倉庫

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392574.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392574.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392574.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

.netcore mysql_.netcore基于mysql的codefirst

.netcore基于mysql的codefirst此文僅是對于netcore基于mysql的簡單的codefirst實現的簡單記錄。示例為客服系統消息模板的增刪改查實現第一步、創建實體項目,并在其中建立對應的實體類,以及數據庫訪問類須引入Pomelo.EntityFrameworkCore.MySql和Microso…

leetcode 34. 在排序數組中查找元素的第一個和最后一個位置(二分查找)

給定一個按照升序排列的整數數組 nums,和一個目標值 target。找出給定目標值在數組中的開始位置和結束位置。 如果數組中不存在目標值 target,返回 [-1, -1]。 進階: 你可以設計并實現時間復雜度為 O(log n) 的算法解決此問題嗎&#xff1…

CentOS6.7上使用FPM打包制作自己的rpm包

自定義rpm包,還是有逼格和實際生產環境的意義的。 (下面的文檔有的代碼由于博客排版的問題導致擠在了一起,需要自己判別) 安裝FPM fpm是ruby寫的,因此系統環境需要ruby,且ruby版本號大于1.8.5。 # 安裝ruby模塊 yum -y…

漢堡菜單_開發人員在編寫漢堡菜單時犯的錯誤

漢堡菜單by Jared Tong湯杰(Jared Tong) 開發人員在編寫漢堡菜單時犯的錯誤 (The mistake developers make when coding a hamburger menu) What do The New York Times’ developers get wrong about the hamburger menu, and what do Disney’s and Wikipedia’s get right?…

android 漲潮動畫加載_Android附帶漲潮動畫效果的曲線報表繪制

寫在前面本文屬于部分原創,實現安卓平臺正弦曲線類報表繪制功能介紹,基于網絡已有的曲線報表繪制類(LineGraphicView)自己添加了漲潮的漸變動畫算法最終效果圖廢話少說,直接上源碼一、自定義View LineGraphicView,本類注釋不算多&…

使用css3屬性transition實現頁面滾動

<!DOCTYPE html> <html><head><meta http-equiv"Content-type" content"text/html; charsetutf-8" /><title>慕課七夕主題</title><script src"http://libs.baidu.com/jquery/1.9.1/jquery.min.js">&…

leetcode 321. 拼接最大數(單調棧)

給定長度分別為 m 和 n 的兩個數組&#xff0c;其元素由 0-9 構成&#xff0c;表示兩個自然數各位上的數字。現在從這兩個數組中選出 k (k < m n) 個數字拼接成一個新的數&#xff0c;要求從同一個數組中取出的數字保持其在原數組中的相對順序。 求滿足該條件的最大數。結…

Oracle Study之--Oracle等待事件(5)

Db file single write這個等待事件通常只發生在一種情況下&#xff0c;就是Oracle 更新數據文件頭信息時&#xff08;比如發生Checkpoint&#xff09;。當這個等待事件很明顯時&#xff0c;需要考慮是不是數據庫中的數據文件數量太大&#xff0c;導致Oracle 需要花較長的時間來…

兩臺centos之間免密傳輸 scp

兩臺linux服務器之間免密scp&#xff0c;在A機器上向B遠程拷貝文件 操作步驟&#xff1a;1、在A機器上&#xff0c;執行ssh-keygen -t rsa&#xff0c;一路按Enter&#xff0c;不需要輸入任何內容。&#xff08;如有提示是否覆蓋&#xff0c;可輸入y后按回車&#xff09;2、到/…

jsp導出數據時離開頁面_您應該在要離開的公司開始使用數據

jsp導出數據時離開頁面If you’re new in data science, “doing data science” likely sounds like a big deal to you. You might think that you need meticulously collected data, all the tools for data science and a flawless knowledge before you can claim that y…

分步表單如何實現 html_HTML表格入門的分步指南

分步表單如何實現 htmlby Abhishek Jakhar通過阿比舍克賈卡(Abhishek Jakhar) HTML表格入門的分步指南 (A step-by-step guide to getting started with HTML tables) 總覽 (Overview) The web is filled with information like football scores, cricket scores, lists of em…

laravel mysql pdo,更改Laravel中的基本PDO配置

My shared web host have some problems with query prepares and I want to enable PDOs emulated prepares, theres no option for this in the config\database.php.Is there any way I can do that in Laravel?解決方案You can add an "options" array to add o…

Java多線程-工具篇-BlockingQueue

Java多線程-工具篇-BlockingQueue 轉載 http://www.cnblogs.com/jackyuj/archive/2010/11/24/1886553.html 這也是我們在多線程環境下&#xff0c;為什么需要BlockingQueue的原因。作為BlockingQueue的使用者&#xff0c;我們再也不需要關心什么時候需要阻塞線程&#xff0c;什…

leetcode 204. 計數質數

統計所有小于非負整數 n 的質數的數量。 示例 1&#xff1a; 輸入&#xff1a;n 10 輸出&#xff1a;4 解釋&#xff1a;小于 10 的質數一共有 4 個, 它們是 2, 3, 5, 7 。 解題思路 大于等于5的質數一定和6的倍數相鄰。例如5和7&#xff0c;11和13,17和19等等&#xff1b…

JAVA 網絡編程小記

在進行JAVA網絡編程時&#xff0c;發現寫入的數據對方等200ms左右才會收到。起初認為是JAVA自已進行了 Cache。進行flush也沒有效果。查看JDK代碼&#xff0c;Write操作直接調用的native方法&#xff0c;說明JAVA層面并沒有緩存。再看flush&#xff0c;只是一個空方法. FileOut…

vue生成靜態js文件_如何立即使用Vue.js生成靜態網站

vue生成靜態js文件by Ond?ej Polesn通過Ond?ejPolesn 如何立即使用Vue.js生成靜態網站 (How to generate a static website with Vue.js in no time) You have decided to build a static site, but where do you start? How do you select the right tool for the job wit…

查看文件夾大小的4種方法,總有一種是你喜歡的

有必要檢查文件夾的大小,以確定它們是否占用了太多的存儲空間。此外,如果你通過互聯網或其他存儲設備傳輸文件夾,還需要查看文件夾大小。 幸運的是,在Windows設備上查看文件夾大小非常容易。窗口中提供了圖形化和基于命令行的應用程序,為你提供了多種方法。 如何在Windo…

Python 獲取服務器的CPU個數

在使用gunicorn時&#xff0c;需要設置workers&#xff0c; 例如&#xff1a; gunicorn --workers3 app:app -b 0.0.0.0:9000 其中&#xff0c;worker的數量并不是越多越好&#xff0c;推薦值是CPU的個數x21&#xff0c; CPU個數使用如下的方式獲取&#xff1a; python -c impo…

多種數據庫連接工具_20多種熱門數據工具及其不具備的功能

多種數據庫連接工具In the past few months, the data ecosystem has continued to burgeon as some parts of the stack consolidate and as new challenges arise. Our first attempt to help stakeholders navigate this ecosystem highlighted 25 Hot New Data Tools and W…

怎么連接 mysql_怎樣連接連接數據庫

這個博客是為了說明怎么連接數據庫第一步&#xff1a;肯定是要下載數據庫&#xff0c;本人用的SqlServer2008&#xff0c;是從別人的U盤中拷來的。第二步&#xff1a;數據庫的登錄方式設置為混合登錄&#xff0c;步驟如下&#xff1a;1.打開數據庫這是數據庫界面&#xff0c;要…