The term “Data Warehouse” is widely used in the data analytics world, however, it’s quite common for people who are new with data analytics to ask the above question.
術語“數據倉庫”在數據分析領域中被廣泛使用,但是,對于數據分析新手來說,問上述問題非常普遍。
This post attempts to help explain the definition of a data warehouse, when, and why to consider setting up one.
這篇文章試圖幫助解釋數據倉庫的定義,何時以及為什么考慮建立一個數據倉庫。
Ps: This is a section of a guidebook our team is writing, The Analytics Set-up Guidebook. If you are interested to learn more about the high-level or best practices of modern BI stacks, feel free to check out the link to see our progress.
附:這是我們團隊正在編寫的指南的一部分, 即Google Analytics(分析)設置指南 。 如果您想了解有關現代BI堆棧的高級或最佳實踐的更多信息,請隨時查看鏈接以了解我們的進展。
什么是數據倉庫? (What is a data warehouse?)
A data warehouse is a type of analytics database that stores and processes your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.
數據倉庫是一種分析數據庫,用于存儲和處理數據以進行分析。 數據倉庫將處理分析的兩個主要功能: 存儲分析數據和處理分析數據 。
Why do you need one? You will need a data warehouse for two main purposes:
為什么需要一個? 您將需要一個數據倉庫來實現兩個主要目的:
- First, you can’t combine data from multiple business functions easily if they sit in different sources. 首先,如果來自多個業務功能的數據位于不同的來源中,則無法輕松合并它們。
- Second, your source systems are not designed to run heavy analytics, and doing so might jeopardize your business operations as it increases the load on those systems. 其次,您的源系統并非旨在運行繁重的分析,這樣做可能會危害您的業務運營,因為它會增加這些系統的負載。
Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:
數據倉庫是分析管道流程中每個步驟的核心,它具有三個主要目的:
Storage: In the consolidate (Extract & Load) step, your data warehouse will receive and store data coming from multiple sources.
存儲:在合并(提取和加載)步驟中,您的數據倉庫將接收和存儲來自多個來源的數據 。
Process: In the process (Transform & Model) step, your data warehouse will handle most (if not all) of the intensive processing generated from the transform step.
流程:在流程(“轉換和模型”)步驟中,您的數據倉庫將處理從轉換步驟生成的大部分(如果不是全部) 密集處理 。
Access: In the reporting (Visualize & Delivery) step, reports are being gathered within the data-warehouse first, then visualized and delivered to end-users.
訪問:在報告(可視化和交付)步驟中,首先在數據倉庫中收集報告,然后將其可視化并交付給最終用戶。
At the moment, most data warehouses use SQL as their primary querying language.
目前,大多數數據倉庫都使用SQL作為其主要查詢語言。
什么時候是合適的時間來獲取數據倉庫? (When is the right time to get a data warehouse?)
The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.
TL; DR的答案取決于它。 這取決于公司的階段,擁有的數據量,預算等。
At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (A simple BI setup for people just starting out).
在早期階段,您可能不需要數據倉庫就可以解決問題,并將商業智能(BI)工具直接連接到生產數據庫( 針對剛開始的人們的簡單BI設置 )。
However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:
但是,如果仍然不確定數據倉庫是否適合您的公司,請考慮以下幾點:
First, do you need to analyze data from different sources?
首先,您需要分析來自不同來源的數據嗎?
At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.
在公司生命中的某個時刻,您需要合并來自不同內部工具的數據,以便做出更好,更明智的業務決策。
For instance, if you’re a restaurant and want to analyze orders/waitress ratio efficiency (which hour of the week the staff is most busy vs most free), you need to combine your sales data (from POS system) with your staff duty data (from HR system).
例如,如果您是一家餐館,并且想分析訂單/女服務員的工作效率(工作人員在一周中的哪一小時最忙與最空閑),則需要將銷售數據(來自POS系統)與工作人員的職責結合起來數據(來自HR系統)。
For those analyses, it is a lot easier to do if your data is located in one central location.
對于這些分析,如果您的數據位于一個中央位置,則操作會容易得多。
Second, do you need to separate your analytical data from your transactional data?
其次,您是否需要將分析數據與交易數據分開?
As mentioned, your transactional systems are not designed for analytical purposes. So if you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly.
如前所述,您的交易系統并非為分析目的而設計。 因此,如果您在應用程序中收集活動日志或其他可能有用的信息,將這些數據存儲在應用程序的數據庫中并讓分析師直接在生產數據庫上工作可能不是一個好主意。
Instead, it’s a much better idea to purchase a data warehouse — one that’s designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.
取而代之的是,購買一個數據倉庫(一個專門用于復雜查詢的數據倉庫)并在那里傳輸分析數據是一個更好的主意。 這樣,您的應用程序的性能就不會受到分析工作的影響。
Third, is your original data source not suitable for querying?
第三,您的原始數據源是否不適合查詢?
For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.
例如,絕大多數BI工具不能與MongoDB等NoSQL數據存儲一起很好地工作。 這意味著在后端使用MongoDB的應用程序需要將其分析數據傳輸到數據倉庫,以便數據分析人員有效地使用它。
Fourth, do you want to increase the performance of your analytical queries?
第四,您是否想提高分析查詢的性能?
If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.
如果您的事務數據包含成千上萬的行,那么創建匯總表以將該數據聚合為更可查詢的形式可能是一個好主意。 如果不這樣做,將導致查詢異常緩慢-更不用說使查詢成為數據庫的不必要負擔。

If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.
如果您對以上任何一個問題的回答為“是”,那么很可能就應該獲得一個數據倉庫。
That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.
也就是說,在我們看來,最好是先獲得一個數據倉庫,因為在云時代,數據倉庫并不昂貴。
我應該選擇哪個數據倉庫? (Which Data Warehouse Should I Pick?)
Here are some common data warehouses that you may pick from:
以下是一些常見的數據倉庫,您可以從中選擇:
- Amazon Redshift 亞馬遜Redshift
- Google BigQuery Google BigQuery
- Snowflake 雪花
- ClickHouse (self-hosted) ClickHouse(自托管)
- Presto (self-hosted) Presto(自托管)
If you’re just getting started and don’t have a strong preference, we suggest that you go with Google BigQuery for the following reasons:
如果您只是入門而又沒有強烈的偏好,建議您使用Google BigQuery,原因如下:
BigQuery is free for the first 10GB storage and first 1TB of queries. After that it’s pay-per-usage.
BigQuery對前10GB的存儲空間和前1TB的查詢免費 。 之后是按使用量付費。
BigQuery is fully managed (serverless): There is no physical (or virtual) server to spin up or manage.
BigQuery是完全托管(無服務器)的 :沒有物理(或虛擬)服務器可以啟動或管理。
As a result of its architecture, BigQuery auto-scales: BigQuery will automatically determine the right amount of computing resources to allocate to each query, depending on the query’s complexity and the amount of data you scan, without you having to manually fine-tune it.
作為其架構的結果, BigQuery會自動擴展: BigQuery將自動確定要分配給每個查詢的正確計算資源量,具體取決于查詢的復雜性和您掃描的數據量,而無需手動進行微調。
(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).
(請注意:我們與Google沒有任何隸屬關系,并且沒有獲得宣傳BigQuery的報酬)。
However, if you have a rapidly increasing volume of data, or if you have complex/special use cases, you will need to carefully evaluate your options.
但是,如果您的數據量Swift增加,或者您有復雜/特殊的用例,則需要仔細評估您的選擇。
Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.
下面,我們列出了最受歡迎的數據倉庫。 我們的目的是讓您對數據倉庫空間中最常見的選擇有一個高層次的了解。 這絕不是全面的,也不足以幫助您做出明智的決定。
But it is, we think, a good start:
但是,我們認為這是一個不錯的開始:

是什么使數據倉庫與普通SQL數據庫不同? (What makes a data warehouse different from normal SQL database?)
At this point some of you might be asking:
此時,有些人可能會問:
“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”
“嘿,數據倉庫不像存儲數據以供分析的關系數據庫嗎? 我不能只使用MySQL,PostgreSQL,MSSQL或Oracle之類的數據倉庫嗎?”
The short answer is: yes you can.
簡短的答案是:是的,您可以。
The long answer is: it depends. First, we need to understand a few concepts.
長的答案是:這取決于。 首先,我們需要了解一些概念。
事務性工作量與分析性工作量 (Transactional Workloads vs Analytical Workloads)
It is important to understand the difference between two kinds of database workloads: transactional workloads and analytical workloads.
重要的是要了解兩種數據庫工作負載之間的區別:事務性工作負載和分析性工作負載。
A transactional workload is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the database to fetch this product, and return the result to the application for processing.
事務性工作負載是為正常業務應用程序提供服務的查詢工作負載。 當訪問者在Web應用程序中加載產品頁面時,查詢將發送到數據庫以獲取此產品,并將結果返回給應用程序進行處理。
SELECT * FROM products WHERE id = 123
(the query above retrieves information for a single product with ID 123)
(上面的查詢檢索ID為123的單個產品的信息)
Here are several common attributes of transactional workloads:
以下是事務性工作負載的幾個常見屬性:
Each query usually retrieves a single record or a small number of records (e.g. get the first 10 blog posts in a category)
每個查詢通常檢索單個記錄或少量記錄(例如,獲取類別中的前10個博客帖子)
Transactional workloads typically involve simple queries that take a very short time to run (less than 1 second)
事務性工作負載通常涉及需要很短時間 (不到1秒) 運行的簡單查詢
Lots of concurrent queries at any point in time, limited by the number of concurrent visitors of the application. For big websites, this can go to the thousands or hundreds of thousands.
在任何時間點都有很多并發查詢 ,受應用程序并發訪問者數量的限制。 對于大型網站,這可以達到成千上萬。
Usually interested in the whole data record (e.g. every column in the product table).
通常對整個數據記錄感興趣(例如,產品表中的每一列)。
Analytical workloads, on the other hand, refer to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end-users.
分析工作負載 ,而另一方面,是指工作量分析的目的,那種工作量是這本書的會談。 運行數據報告時,查詢將發送到數據庫以計算結果,然后顯示給最終用戶。
SELECT category_name, count(*) as num_products FROM products GROUP BY 1
(The above query scans the entire products table to count how many products are there in each category)
(上面的查詢會掃描整個產品表,以計算每個類別中有多少個產品)
Analytical workloads, on the other hand, have the following attributes:
另一方面,分析工作負載具有以下屬性:
Each query typically scans a large number of rows in the table.
每個查詢通常會掃描表中的大量行 。
Each query is heavy and takes a long time (minutes, or even hours) to finish
每個查詢都很繁瑣,需要很長時間 (幾分鐘甚至幾小時)才能完成
Not a lot of concurrent queries happen, limited by the number of reports or internal staff members using the analytics system.
并發查詢不會很多 ,受使用分析系統的報告或內部員工數量的限制。
Usually interested in just a few columns of data.
通常只對幾列數據感興趣。
Below is a comparison table between transactional vs analytical workload/databases.
下表是事務性與分析性工作負載/數據庫之間的比較表。

Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.
事務性工作負載具有許多簡單查詢,而分析工作負載則具有很少的繁重查詢。
Analytics(分析)資料庫的后端不同 (The Backend for Analytics Databases is Different)
Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.
由于上述兩個工作負載之間的巨大差異,因此兩個工作負載的數據庫基礎后端設計非常不同。 事務數據庫針對并發量大的快速,短查詢進行了優化,而分析數據庫針對長時間運行的資源密集型查詢進行了優化。
What are the differences in architecture you ask? This will take a dedicated section to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:
您要求的架構有什么區別? 這將用專門的部分進行解釋,但要點是分析數據庫使用以下技術來保證卓越的性能:
Columnar storage engine: Instead of storing data row by row on disk, analytical databases group columns of data together and store them.
列式存儲引擎:分析數據庫不是將數據逐行存儲在磁盤上,而是將數據列分組在一起并進行存儲。
Compression of columnar data: Data within each column is compressed for smaller storage and faster retrieval.
壓縮列數據:每列中的數據都經過壓縮,以減少存儲量并加快檢索速度。
Parallelization of query executions: Modern analytical databases are typically run on top of thousands of machines. Each analytical query can thus be split into multiple smaller queries to be executed in parallel amongst those machines (divide and conquer strategy)
查詢執行的并行化:現代分析數據庫通常在數千臺計算機上運行。 因此,每個分析查詢都可以拆分為多個較小的查詢,以在這些機器之間并行執行(分而治之策略)
As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.
您可能現在已經猜到了,MySQL,PostgreSQL,MSSQL和Oracle數據庫旨在處理事務性工作負載,而數據倉庫旨在處理分析性工作負載。
那么,我可以使用普通SQL數據庫作為數據倉庫嗎? (So, can I use a normal SQL database as my data warehouse?)
Like we’ve said earlier, yes you can, but it depends.
就像我們之前說過的,可以,但是要視情況而定。
If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.
如果您只是從少量數據和很少的分析用例開始,那么最好選擇一個普通SQL數據庫作為您的數據倉庫(最受歡迎的是MySQL,PostgreSQL,MSSQL或Oracle)。 如果您的數據量相對較大,您仍然可以,但是需要進行適當的調整和配置。
That said, with the advent of low-cost data warehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.
也就是說,隨著BigQuery,Redshift等低成本數據倉庫的出現,我們建議您繼續使用數據倉庫。
However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics. We’ve also written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.
但是,如果必須選擇一個基于SQL的常規數據庫(例如,您的企業只允許您在自己的網絡中內部托管該數據庫),我們建議使用PostgreSQL,因為它具有分析支持的最多功能。 我們還在此處寫了一篇詳細的博客文章,討論了該主題: 為什么要在分析上使用PostgreSQL over MySQL 。
摘要 (Summary)
In this post, we zoomed in into data warehouse and spoke about:
在這篇文章中,我們放大了數據倉庫并談到了:
- Data warehouse is the central analytics database that stores & processes your data for analytics 數據倉庫是中央分析數據庫,用于存儲和處理數據以進行分析
- The 4 trigger points when you should get a data warehouse 獲取數據倉庫的4個觸發點
- A simple list of data warehouse technologies you can choose from 您可以選擇的簡單數據倉庫技術列表
- How a data warehouse is optimized for analytical workload vs traditional database for transactional workload. 與分析事務處理的傳統數據庫相比,如何優化數據倉庫以進行分析工作負載。
Originally published at The Analytics Setup Guidebook by Holistics: Understanding The Data Warehouse.
最初發布于 Holistics的《 Analytics設置指南》中:了解數據倉庫 。
翻譯自: https://towardsdatascience.com/what-is-a-data-warehouse-when-and-why-to-consider-one-2e826be68e95
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388917.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388917.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388917.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!