開源數據倉庫

by Simon Sp?ti

西蒙·斯派蒂(SimonSp?ti)

使用這些開源工具進行數據倉庫 (Use these open-source tools for Data Warehousing)

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

如今，每個人都在談論開源軟件。但是，這在數據倉庫(DWH)字段中仍然不常見。為什么是這樣？

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

在這篇文章中，我選擇了一些開源技術，并將它們一起用于構建數據倉庫系統的完整數據體系結構。

I went with Apache Druid for data storage, Apache Superset for querying, and Apache Airflow as a task orchestrator.

我使用Apache Druid進行數據存儲，使用Apache Superset進行查詢，并使用Apache Airflow作為任務編排器。

德魯伊—數據存儲 (Druid — the data store)

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.

Druid是一個用Java編寫的開源，面向列的分布式數據存儲。它旨在快速提取大量事件數據，并在數據之上提供低延遲查詢。

為什么要使用德魯伊？ (Why use Druid?)

Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost effectiveness.

Druid具有許多關鍵功能，包括亞秒級OLAP查詢，實時流接收，可伸縮性和成本效益。

With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.

考慮到現代OLAP技術的比較，我選擇了Druid而不是ClickHouse，Pinot和Apache Kylin。最近， Microsoft宣布將把Druid添加到其Azure HDInsight 4.0中。

為什么不德魯伊？ (Why not Druid?)

Carter Shanklin wrote a detailed post about Druid’s limitations at Horthonwork.com. The main issue is with its support for SQL joins, and advanced SQL capabilities.

Carter Shanklin在Horthonwork.com上寫了一篇有關Druid局限性的詳細文章。主要問題是它對SQL連接的支持以及高級SQL功能。

德魯伊的體系結構 (The Architecture of Druid)

Druid is scalable due to its cluster architecture. You have three different node types — the Middle-Manager-Node, the Historical Node and the Broker.

由于其集群體系結構，Druid可擴展。您有三種不同的節點類型-中間管理者節點，歷史節點和代理。

The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.

很棒的是，您可以在最適合您的特定區域中添加任意數量的節點。如果要運行許多查詢，則可以添加更多代理。或者，如果需要批量處理大量數據，則可以添加中層管理人員，依此類推。

A simple architecture is shown below. You can read more about Druid’s design here.

一個簡單的架構如下所示。您可以在此處閱讀有關Druid設計的更多信息。

Apache Superset —用戶界面 (Apache Superset — the UI)

The easiest way to query against Druid is through a lightweight, open-source tool called Apache Superset.

針對Druid進行查詢的最簡單方法是通過一個名為Apache Superset的輕量級開源工具。

It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more.

它易于使用，并具有所有常見的圖表類型，例如氣泡圖，字數統計，熱圖，箱線圖等等。

Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.

Druid提供了Rest-API，并且在最新版本中還提供了SQL Query API。這使得可以輕松使用任何工具，無論它是標準SQL，任何現有的BI工具還是自定義應用程序。

Apache Airflow-協調器 (Apache Airflow — the Orchestrator)

As mentioned in Orchestrators — Scheduling and monitor workflows, this is one of the most critical decisions.

如Orchestrators中的“計劃和監視工作流”中所述，這是最關鍵的決定之一。

In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.

過去，ETL工具(例如Microsoft SQL Server集成服務(SSIS)和其他工具)得到了廣泛使用。它們是您進行數據轉換，清理和標準化的地方。

In more modern architectures, these tools aren’t enough anymore.

在更現代的體系結構中，這些工具已經遠遠不夠了。

Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.

而且，代碼和數據轉換邏輯對于公司中其他精通數據的人來說更有價值。

I highly recommend you read a blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing. This goes much deeper into how modern data pipelines should be.

我強烈建議您閱讀Maxime Beauchemin的博客文章有關功能數據工程(一種用于批處理數據的現代范例) 。這將更深入地介紹現代數據管道的方式。

Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

另外，請考慮閱讀《數據工程師的垮臺》一書，其中Max解釋了打破“數據孤島”等問題。

為什么要使用氣流？ (Why use Airflow?)

Apache Airflow is a very popular tool for this task orchestration. Airflow is written in Python. Tasks are written as Directed Acyclic Graphs (DAGs). These are also written in Python.

Apache Airflow是用于此任務編排的非常流行的工具。氣流是用Python編寫的。任務被編寫為有向無環圖( DAG )。這些也是用Python編寫的。

Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.

無需將關鍵轉換邏輯封裝在工具中的任何位置，而是將其放置在Orchestrator內部的位置。

Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.

另一個優點是使用普通的Python。無需封裝其他依賴項或要求，例如從FTP提取，將數據從A復制到B，編寫批處理文件。您可以執行此操作，其他所有操作都在同一位置。

氣流特征 (Features of Airflow)

Moreover, you get a fully functional overview of all current tasks in one place.

此外，您可以在一處獲得所有當前任務的完整功能概述。

More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.

Airflow的更多相關功能是您像編寫程序一樣編寫工作流。諸如Databricks，Spark等的外部作業沒有問題。

Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.

作業測試通過Airflow本身進行。這包括將參數傳遞給下游的其他作業，或驗證Airflow上正在運行的內容并查看實際代碼。日志文件和其他元數據可通過Web GUI訪問。

(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.

僅在部分工作流程上運行(重新)，并且相關任務是一項至關重要的功能，當您使用Airflow創建工作流程時，該功能即開即用。作業/任務在上下文中運行，調度程序傳遞必要的詳細信息，然后工作將在任務級別(而不是DAG級別)上跨集群分布。

For many more feature visit the full list.

有關更多功能，請訪問完整列表。

使用Apache Airflow的ETL (ETL with Apache Airflow)

If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, and Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.

如果要以Apache Airflow作為新的ETL工具開始，請從與您共享的Airflow的ETL最佳實踐開始。它具有簡單的ETL示例，帶有簡單SQL，帶有HIVE ，帶有Data Vault ， Data Vault 2和帶有大數據流程的Data Vault 。它為您提供了一個很好的概述，介紹了可行的方法以及如何實現它。

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from here.

同時，您可以使用一個Docker容器，這意味著您甚至不必設置任何基礎架構。您可以從此處拉出容器。

For the GitHub-repo follow the link on etl-with-airflow.

對于GitHub-repo，請點擊etl-with-airflow上的鏈接。

結論 (Conclusion)

If you’re searching for open-source data architecture, you cannot ignore Druid for speedy OLAP responses, Apache Airflow as an orchestrator that keeps your data lineage and schedules in line, plus an easy to use dashboard tool like Apache Superset.

如果您正在尋找開源數據架構，則不能忽略Druid的快速OLAP響應，Apache Airflow作為協調器(使您的數據沿襲和時間表保持一致)以及易于使用的儀表板工具(如Apache Superset)。

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at Impy which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.

到目前為止，我的經驗是Druid的速度非常快，并且以傳統方式非常適合OLAP多維數據集替換，但是仍然需要更輕松的啟動來安裝集群，提取數據，查看日志等。如果需要，請看看由Druid的創始人創建的Impy 。它圍繞您需要的Druid創建所有服務。不幸的是，它不是開源的。

Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

在傳統的商業智能環境中，Apache Airflow及其作為協調器的功能尚未發生很多事情。我相信，當您開始使用開源和更多新技術時，這種變化會自然而然地出現。

And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Apache Superset是一種簡便，快速的方法，可用于啟動和運行以及顯示來自Druid的數據。那里有更好的工具，例如Tableau等，但不是免費的。這就是為什么如果您已經在使用上述開源技術，那么Superset非常適合生態系統。但是作為一家企業公司，您可能需要在該類別中花一些錢，因為這是用戶最終可以看到的。