全棧入門
I advise a lot of people on how to build out their data stack, from tiny startups to enterprise companies that are moving to the cloud or from legacy solutions. There are many choices out there, and navigating them all can be tricky. Here’s a breakdown of your options, trade offs, pricing and some thinking points around which you can make your decision, as well as some personal thoughts on the options.
我為許多人提供了有關如何構建其數據堆棧的建議,從小型初創公司到正在遷移到云的企業公司,或者從舊解決方案開始。 有很多選擇,而將它們全部導航可能很棘手。 這是您的期權,權衡,定價和一些可以做出決定的思考點的細目分類,以及對期權的一些個人想法。
我的背景 (My background)
I’m CTO and co-founder at Dataform, I was previously an engineer at Google, where I spent most of my 6 years there building big data pipelines with internal tools similar to what is now Apache Beam. Dataform is a data modelling platform for cloud data warehouses, and while only one small part of the overall data stack, is often the glue that ties many things together and as a result, we spend a lot of time talking about overall data architecture with customers and prospective clients.
我是Dataform的首席技術官兼聯合創始人,我之前是Google的工程師,在這6年中的大部分時間里,我都在那里使用類似于現在的Apache Beam的內部工具構建大數據管道。 Dataform是用于云數據倉庫的數據建模平臺,雖然整個數據堆棧中只有一小部分,但通常是將許多事物聯系在一起的粘合劑,因此,我們花了大量時間與客戶討論整體數據架構和潛在客戶。
產品推薦方法 (Product recommendation methodology)
It’s impossible for me to give a completely fair trial to every product in this space. In general, I’ve chosen to highlight products that:
對于我這個領域的所有產品,我都不可能進行完全公正的試用。 通常,我選擇突出顯示以下產品:
- Have generally high adoption and awareness amongst startups 在初創企業中普遍具有較高的采用率和知名度
- We generally hear our customers speak highly of 我們通常聽到客戶高度評價
- Fit into the ELT model of a data stack 適合數據棧的ELT模型
- Innovative new products that may not tick the above boxes, I personally believe are worth a mention 我個人認為值得一提的創新產品可能無法在上述選項中打勾
Where I have significant experience with a product, I’ll let you know and provide more detail on why. Similarly in one or two cases I’ve shared my reasons for not recommending them.
如果我在產品方面有豐富的經驗,我會告知您并提供原因的更多詳細信息。 同樣,在一兩種情況下,我也分享了我不推薦它們的理由。
總覽 (Overview)
There is a prevailing model of a data stack that we consistently see the world moving toward, that’s probably best summed up by this diagram. This is an ELT architecture (extract, load, transform) as opposed to a more traditional ETL architecture, and can support companies of all sizes (perhaps with the exception of extremely large enterprises).
有一個流行的數據堆棧模型,我們可以一目了然地看到世界正在朝著這個方向發展,這也許可以最好地用這張圖來概括。 與更傳統的ETL體系結構相反,這是一種ELT體系結構(提取,加載,轉換),并且可以支持各種規模的公司(也許,超大型企業除外)。
事件數據收集 (Event data collection)
How do you collect event data from across all of your different applications, web, app, backend services and send them to other systems or your data warehouse.
如何從所有不同的應用程序,Web,應用程序,后端服務收集事件數據,并將其發送到其他系統或數據倉庫。
Conceptually straightforward, so not much to say here! Event based analytics is usually the easiest place to start and most off the shelf solutions are built around this.
從概念上講簡單明了,因此在這里無需多說! 基于事件的分析通常是最容易開始的地方,并且大多數現成的解決方案都是以此為基礎的。
Tracking everything that you want to use for analytics in events avoids needing to join in other data sources at analysis time and lends itself well to product analytics where ordering of events is important to consider.
跟蹤要在事件中用于分析的所有內容,從而避免了在分析時需要加入其他數據源的情況,并且非常適合需要考慮事件順序的產品分析。
資料整合 (Data integration)
How do you move data between databases and services? There is some overlap with collection here. Typically you need to move data between various places such as:
您如何在數據庫和服務之間移動數據? 這里的收集有些重疊。 通常,您需要在各個位置之間移動數據,例如:
- SaaS services > Data warehouse SaaS服務>數據倉庫
- Production DBs > Data warehouse 生產數據庫>數據倉庫
- Event collection > Data warehouse / SaaS tools / CRMs 活動收集>數據倉庫/ SaaS工具/ CRM
- Data warehouse > SaaS tools / CRMs 數據倉庫> SaaS工具/ CRM
For the rest of the article we’ll consider these as two different data integration problems:
對于本文的其余部分,我們將把它們視為兩個不同的數據集成問題:
- Data integration to the warehouse 數據集成到倉庫
- Data integration to other SaaS products 數據集成到其他SaaS產品
數據倉庫 (Data warehouses)
Where you move all your data to so you can query it together.
將所有數據移至的位置,以便可以一起查詢。
A lot about data warehousing has changed over the last 10 years, data warehouses now scale to unprecedented levels. Before Snowflake and BigQuery, organizations with truly massive data would have avoided them due to limited scale, and instead opt for solutions such as Apache Spark, Dataflow, or Hadoop MapReduce like systems.
在過去的十年中,有關數據倉庫的許多事情發生了變化,數據倉庫現在可以擴展到前所未有的水平。 在使用Snowflake和BigQuery之前,擁有真正海量數據的組織會因規模有限而避免使用它們,而是選擇諸如Apache Spark,Dataflow或Hadoop MapReduce之類的解決方案。
Warehouses and SQL have many benefits and the scalability limits are (mostly) gone. Additionally with the rise of engineering inspired data modelling tools (such as Dataform), it’s possible to manipulate data via SQL in a well tested, reproducible way.
倉庫和SQL有很多好處,可伸縮性限制(大多數)已經消失了。 此外,隨著受工程啟發的數據建模工具(例如Dataform)的興起,可以通過SQL以一種經過良好測試,可重現的方式處理數據。
We’ve written about this change if you’d like more information on why we think the shift towards SQL based warehousing is the right one and how it can help you move quickly, especially as a startup!
如果您想了解為什么我們認為轉向基于SQL的倉庫是正確的選擇,以及它如何幫助您快速遷移,尤其是作為一家初創公司,那么我們已經寫了有關此更改的信息。
資料建模 (Data modelling)
How do you actually transform data from many different sources into a set of clean, well tested data sets?
您實際上如何將來自許多不同來源的數據轉換為一組干凈,經過良好測試的數據集?
ELT introduces a new problem, you end up with a data warehouse full of messy datasets from your newly set up data integration tools and no idea how to use them. This is where data modelling comes in, and if you are building a stack with a data warehouse at the center, it needs to be addressed.
ELT引入了一個新問題,最終您將得到一個數據倉庫,其中充滿了新建立的數據集成工具中混亂的數據集,卻不知道如何使用它們。 這就是數據建模的用武之地,如果您要建立一個以數據倉庫為中心的堆棧,則需要解決它。
數據可視化和分析 (Data visualization & analytics)
Once you sort out all of the above, how do you actually use that data to answer business questions or do advanced analytics?
整理完所有上述內容后,如何實際使用該數據回答業務問題或進行高級分析?
For any company, and particularly startups understanding how your users use your products, how much time they spend on your app, what your signup and activation rates are like is obviously important.
對于任何公司,特別是初創公司來說,了解您的用戶如何使用您的產品,他們在應用程序上花費了多少時間,您的注冊率和激活率如何,顯然很重要。
I’ll mostly cover the business intelligence and product analytics side of this, and avoid more advanced ML and data science applications as this usually comes afterward.
我將主要介紹這一方面的商業智能和產品分析,并避免使用更高級的ML和數據科學應用程序,因為這通常是在隨后出現的。
您需要數據倉庫嗎? (Do you need a data warehouse?)
Once you hit a certain level of complexity, or need complete control over how you join and mutate data before sending it to other systems you probably want to move to a model where the data warehouse becomes your source of truth for business data. This gives you the most power as once all your data is there, you can do pretty much anything with it.
一旦達到一定程度的復雜性,或者需要在將數據發送到其他系統之前完全控制如何聯接和變異數據,您可能希望轉移到一個模型,在該模型中,數據倉庫將成為業務數據的真實來源。 一旦所有數據都存在,您就可以發揮最大的威力。
For very early stage startups, I recommend that you avoid this initially. Off the shelf product analytics tools will provide you with the insights you need without the extra work during the early stages.
對于非常早期的創業公司, 建議您一開始避免這種情況 。 現成的產品分析工具將為您提供所需的見解,而無需在早期階段進行額外的工作。
In this model, you typically require just two components:
在此模型中,通常只需要兩個組件:
- Data collection (events) 數據收集(事件)
- Visualization / analytics 可視化/分析
A concrete example of this would be something like:
具體示例如下:
Segment > Mixpanel / Amplitude / Heap / Google Analytics
細分>混合面板/幅度/堆/ Google Analytics(分析)
Using a product like Segment gives you the flexibility to move that data to your warehouse in the future, if not immediately.
使用細分市場之類的產品,您可以靈活地將數據將來(如果不是立即)移至倉庫。
As this is probably something you’ll need to do at some point, I’d recommend using something that can do this from day one, as moving from something all in one such as Google Analytics to something custom built around a warehouse can be difficult or expensive (Google will charge you a lot of money to move your raw GA data into BigQuery).
因為這可能是您有時需要做的事情,所以我建議您使用從第一天起就可以做到的事情,因為從諸如Google Analytics(分析)之類的全部內容遷移到圍繞倉庫定制的內容可能很困難。或價格昂貴(Google會向您收取大量費用,以將原始GA數據轉移到BigQuery中)。
數據采集 (Data collection)
Data collection products allow you to track events from various apps, and capture user activity, pageviews, clicks, sessions etc. This is not about collecting data from e.g. your production DB (see data integration).
數據收集產品使您可以跟蹤來自各種應用程序的事件,并捕獲用戶活動,綜合瀏覽量,點擊次數,會話次數等。這與從生產數據庫中收集數據無關(請參閱數據集成)。
分割 (Segment)
The market leader, with some good out the box data integration solutions.
市場領導者,提供一些出色的現成數據集成解決方案。
Segment is great, and we use it ourselves. Segment is more than just data collection. Their somewhat open-source analytics.js provides unified APIs for tracking events in pretty much any system.
區隔很棒,我們自己使用。 細分不僅僅是數據收集。 他們的開放源代碼analytics.js提供了用于在幾乎所有系統中跟蹤事件的統一API。
Events from Segment can be sent to your warehouse, but can also be sent straight to other systems, for example Google/Twitter/Facebook/Quora Ads, most major CRMs.
來自細分市場的事件可以發送到您的倉庫,也可以直接發送到其他系統,例如Google / Twitter / Facebook / Quora Ads,大多數主要的CRM。
Segment can get really expensive really quickly, particularly for B2C companies with high numbers of monthly tracked users. However as their core APIs are open-source, it’s possible to migrate to your own infrastructure.
細分市場很快就會變得非常昂貴,尤其是對于擁有大量每月跟蹤用戶的B2C公司而言。 但是,由于其核心API是開源的,因此可以遷移到您自己的基礎架構。
Definitely worth mentioning RudderStack here, an open source host it yourself alternative to Segment that uses the same open-source APIs. When your Segment costs start to exceed an engineering salary, this is probably the time to consider such alternatives.
在這里絕對值得一提RudderStack ,這是您自己的開源托管產品,可以替代使用相同開源API的Segment。 當您的部門成本開始超過工程薪水時,可能??是時候考慮采用這種替代方法了。
雪犁 (Snowplow)
Event data collection done right, and it’s open-source.
事件數據收集正確完成,并且是開源的。
Snowplow excels at event data collection, period. It lacks the out of the box data integration solutions of Segment, but arguably has a much more rich feature set when it comes to actual event tracking, such as validation of event schemas.
Snowplow在事件數據收集,期間方面表現出色。 它缺少Segment的開箱即用的數據集成解決方案,但是在涉及實際的事件跟蹤(例如驗證事件模式)時,可以說具有更豐富的功能集。
It’s open-source, so you can run and manage it yourself if you want to.
它是開源的,因此您可以根據需要自己運行和管理它。
Snowplow won’t help you push data to your CRM or other SaaS products, but there are other options here we discuss below.
Snowplow不能幫助您將數據推送到CRM或其他SaaS產品,但是我們在下面討論了其他選項。
多合一解決方案 (All in one solutions)
Segment and Snowplow are primarily designed for data collection. There are other tools that will help you collect data too, but they are part of what I’ll call “all in one” data analytics packages. I won’t cover these here, but have mentioned them below as part of Data visualization and analytics.
Segment和Snowplow主要設計用于數據收集。 還有其他工具也可以幫助您收集數據,但是它們是我稱之為“多合一”數據分析軟件包的一部分。 我不會在這里介紹這些內容,但是下面在數據可視化和分析中提到了它們。
數據集成到倉庫 (Data integration to the warehouse)
Data integration products allow you to move data from one source to another. This section covers products that help you move data to your warehouse.
數據集成產品使您可以將數據從一個源移動到另一個源。 本部分介紹可幫助您將數據移至倉庫的產品。
Some of these products also act as data transformation tools (traditional ETL). We don’t recommend this approach, preferring a more software engineering (SQL / code based) approach that will scale as you grow out your data team (see data modelling). As a result, our recommendations are for data integration products that are designed with an ELT model in mind.
其中一些產品還充當數據轉換工具(傳統ETL)。 我們不建議您使用這種方法,而建議您采用更多的軟件工程方法(基于SQL /代碼),這種方法會隨著您的數據團隊的成長而擴展(請參閱數據建模 )。 因此,我們的建議是針對考慮了ELT模型設計的數據集成產品。
These products move data from other data sources such as CRMs, Stripe, most popular databases (Mongo, MySQL, Postgres etc) into your warehouse, so you have everything in one place and can join it all together and perform advanced analytics queries on the results.
這些產品將數據從其他數據源(例如CRM,Stripe,最流行的數據庫(Mongo,MySQL,Postgres等))移至您的倉庫中,因此您將所有內容集中在一個地方,可以將它們組合在一起并對結果執行高級分析查詢。
縫 (Stitch)
The best option for early startups, teams who want to write their own integrations, or value open-source.
對于早期創業公司,想要編寫自己的集成或重視開源的團隊而言,這是最佳選擇。
Recently acquired by Talend, of which I’ve seen little impact so far — for better or worse. We use them ourselves, and they serve the majority of our integration needs very well.
最近被塔倫德(Talend)收購,到目前為止,我對它的影響不大,無論好壞。 我們自己使用它們,它們很好地滿足了我們大多數的集成需求。
Simple self service onboarding, reasonable usage based pricing. Open-source core, you can write your own singer taps, and run it yourself.
簡單的自助服務入門,合理的使用定價。 開源核心,您可以編寫自己的歌手水龍頭,然后自己運行。
Fivetran (Fivetran)
The best option for those who are willing to pay a little more, or with certain data sources.
對于那些愿意多付一些錢或使用某些數據源的人來說,這是最佳選擇。
Historically a more enterprise sales model, recently changed to variable/volume based pricing, and will build adapters for you if you pay them, but moving toward a self-service model. My understanding is Fivetran still nets out as a bit more expensive than Stitch, but they certainly build extremely high quality integrations.
從歷史上講,企業銷售模型更多,最近已更改為基于可變/批量的定價,如果您支付適配器的價格,它將為您構建適配器,但將轉向自助服務模型。 我的理解是,Fivetran仍然比Stitch貴一些,但是它們確實可以構建極高質量的集成。
Fivetran makes considerable efforts to normalize data coming from source systems into a more friendly format, whereas Stitch integrations are arguably a bit less intelligent.
Fivetran為將來自源系統的數據規范化為更友好的格式而付出了巨大的努力,而Stitch集成可以說是不太智能。
數據集成到SaaS產品 (Data integration to SaaS products)
There is another aspect of data integration, and that is how do you get data to your SaaS services rather than from them. For example, showing recent activity or orders in your CRM to help your sales or support teams.
數據集成還有另一個方面,那就是如何將數據從 SaaS服務而不是從其中獲取。 例如,在CRM中顯示最近的活動或訂單以幫助您的銷售或支持團隊。
As mentioned above, Segment (and RudderStack) have support for this, Segment also recently added support for transforming data before sending it, but this is only available to some customer tiers and somewhat limited in what it can do.
如上所述, Segment (和RudderStack)對此提供了支持,Segment最近還添加了在發送數據之前對其進行轉換的支持,但這僅適用于某些客戶層,并且在功能上有所限制。
Typically I’ve seen the majority of our customers (including ourselves) building custom solutions here. Products like Zapier, AWS Lambda, Google Google Cloud functions or PubSub like setups where data can be transformed and sent elsewhere, either from sources directly, or via the warehouse.
通常,我在這里看到大多數客戶(包括我們自己)在構建自定義解決方案。 諸如Zapier , AWS Lambda , Google Google Cloud功能或PubSub之類的產品,都可以直接從數據源或通過倉庫轉換和發送數據到其他地方。
We’ve written about how we do this ourselves in this blog post — sending dataform BigQuery to Intercom, and this general approach could be applied for most destinations, or built in other tools like Zapier too.
在此博客文章中,我們已經寫了關于如何進行此操作的信息- 將數據形式的BigQuery發送到Intercom ,并且這種通用方法可以應用于大多數目的地,也可以內置于其他工具(例如Zapier)中。
人口普查 (Census)
The only dedicated Warehouse to SaaS integration tool (that I know of) on the market.
市場上唯一的專用倉庫到SaaS集成工具(據我所知)。
While we haven’t used it ourselves, Census has a dedicated solution for this which fits in well with the rest of the ELT architecture proposed in this post. It’s definitely worth checking out if you are thinking of transitioning from what e.g. Segment provides out the box to something more custom.
雖然我們自己還沒有使用過它,但Census為此提供了專用的解決方案,它與本文中提出的其余ELT體系結構非常吻合。 如果您正在考慮從細分(Segment)提供的框框過渡到更自定義的框框,那絕對值得一試。
數據倉庫 (Data warehouses)
There’s a lot to discuss about the options here, but typically there are just 2 market leaders right now that I would recommend. I’ve worked directly with all of these options myself, as well as speaking to many customers who use them, and have provided slightly stronger opinions on which ones I think are best.
這里有很多關于這些選項的討論,但是通常我現在建議只有兩名市場領導者。 我本人直接使用所有這些選項,并與許多使用它們的客戶進行了交談,并對我認為最好的選項提供了更強有力的意見。
When it comes to pricing for warehouses, it’s worth noting that what seems to matter in practice here is the cost of compute, not storage. Storage is generally cheap, and not the first thing to start hurting.
當談到倉庫的定價時,值得注意的是,實際上在這里重要的是計算成本,而不是存儲成本。 存儲通常很便宜,而不是首先受到傷害的東西。
大查詢 (BigQuery)
The best option for early stage startups, or enterprises that are willing to adopt Google Cloud, and are OK with a more self service experience and dont have custom requirements around security, or e.g running on premise.
對于早期創業公司或愿意采用Google Cloud并且可以提供更多自助服務體驗并且對安全性沒有自定義要求(例如在內部運行)的企業來說,這是最佳選擇。
Pay as you go pricing, for startups it will likely be a while before you incur any cost on BigQuery thanks to their free tier that allows you to process up to 1TB/month at no cost.
隨用隨付定價,對于初創公司而言,由于BigQuery的免費層級,您可以免費處理每月高??達1TB的費用,因此可能需要一段時間才能在BigQuery上承擔任何費用。
In my opinion, having written a lot of SQL for all warehouses here, BigQuery has the best SQL experience. BigQuery’s standard SQL is elegant and powerful, and they are rolling out improvements continuously.
我認為,BigQuery在這里為所有倉庫編寫了很多SQL,因此擁有最好SQL經驗。 BigQuery的標準SQL優雅而強大,并且正在不斷推出改進功能。
Unprecedented, on demand scale. BigQuery scales extremely well, and with products such as BI Engine, can provide blazing fast query performance.
空前,按需規模。 BigQuery的伸縮性非常好,并且使用BI Engine等產品可以提供出色的快速查詢性能。
雪花 (Snowflake)
The best choice for enterprises with custom requirements such as SSO, on-premise, or who are tied in to AWS/Azure. A good option for startups thanks due to a partially pay-as-you-go, on demand pricing model.
對于具有自定義要求的企業(例如SSO,內部部署或與AWS / Azure捆綁在一起)的最佳選擇。 由于部分按需付費,按需定價模式,這對初創公司來說是一個不錯的選擇。
Snowflake separates storage and compute, like BigQuery, meaning it has the capacity for unbounded enterprise scale unlike e.g. Redshift.
Snowflake與BigQuery一樣,將存儲和計算分開,這意味著它具有與Redshift不同的無限企業規模的能力。
Pricing model is hybrid pay as you go. You pay for resources/minute, but clusters can be automatically turned down when inactive.
定價模式為混合支付。 您為分鐘/分鐘付費,但是當群集不活動時,群集會自動關閉。
Supports structured JSON data (like BigQuery), where e.g. Redshift does not, generally a nice SQL experience, a couple of quirks to get used to.
支持結構化的JSON數據(如BigQuery),例如Redshift不,這通常是一種不錯SQL體驗,但要習慣一些怪癖。
紅移 (Redshift)
The choice if you are heavily invested in AWS and don’t want to add something new to your stack.
如果您在AWS上投入了大量資金,并且不想在堆棧中添加新內容,則可以選擇。
I’ll be frank here, I would not personally recommend Redshift unless you have no other choice, but it’s so popular it needs to be included. Redshift was one of the first “modern” warehouses in GA, but it’s built on a foundation that is in my opinion fundamentally limiting. While Amazon is working to correct some of these issues, some fixes still appear to be a long way off.
坦率地說,除非您別無選擇,否則我個人不會推薦Redshift,但它是如此受歡迎,因此必須包含在內。 Redshift是GA最早的“現代”倉庫之一,但它建立在我認為從根本上限制的基礎上。 盡管亞馬遜正在努力糾正其中的一些問題,但仍有一些修補程序還有很長的路要走。
- Limited support for working with unstructured data, and a limited SQL dialect based on postgres. 對使用非結構化數據的支持有限,并且基于postgresSQL方言有限。
- Requires much more management than Snowflake or BigQuery which are more hands off operationally. 與Snowflake或BigQuery相比,它們需要的管理要多得多,而Snowflake或BigQuery在操作上會更多。
- Has scale limits, although some new features coming out address this. 盡管有一些新功能可以解決此問題,但它具有規模限制。
其他提及 (Other mentions)
Azure / Synapse SQL data warehouse — similar issues to Redshift, no on demand pricing, great if you already know how to work with SQL Server and variants, can get very expensive due to limited on-demand pricing options. Presto/Athena — powerful, distributed queries, but not a general purpose warehouse, as a result can be hard to operationalize. It’s not easy to create new datasets with Athena.
Azure / Synapse SQL數據倉庫 -與Redshift類似的問題,沒有按需定價,如果您已經知道如何使用SQL Server及其變體,那就太好了,因為按需定價選項有限,因此價格會非常昂貴。 Presto / Athena-功能強大的分布式查詢,但不是通用倉庫,因此可能難以操作。 使用Athena創建新數據集并不容易。
資料建模 (Data modelling)
It’s impossible for me to write an impartial comparison of the options here as this is the product vertical Dataform is in, so I’ll refrain from doing so. However, as this is a fairly new part of the stack that arises from the shift to ELT, I’ll explain a bit more about what it is and why we think it’s important.
我不可能在這里對這些選項進行公正的比較,因為這是垂直數據窗體所在的產品,因此,我將避免這樣做。 但是,由于這是過渡到ELT產生的一個相當新的部分,因此我將詳細解釋它的含義以及為什么我們認為它很重要。
Data modelling is what your data team probably spends 50% of their time doing — turning your raw data into reliable, tested, accurate and up to date assets that can power your companies analytics.
數據建模可能是您的數據團隊花費50%的時間進行的工作-將原始數據轉換為可靠,經過測試,準確和最新的資產,可以為公司的分析提供支持。
When data lands in your warehouse it’s usually a bit of a mess — you will have hundreds of source tables with different schema structures, different data formats and types, primary keys, so on. Writing a query to join this all together is tricky, especially if you have to do it every time you want to answer any question.
當數據降落到倉庫中時,通常會有些混亂–您將擁有數百個具有不同架構結構,不同數據格式和類型,主鍵的源表。 編寫查詢以將所有這些結合在一起非常棘手,特別是如果您每次想回答任何問題時都必須這樣做。
When data modelling is done right, you should end up with a clear well defined set of tables that can be used for analytics and visualization, and encapsulate all of your business logic to create a clean and well tested schema that can be consumed elsewhere, visualization tools, or sent to other applications.
正確完成數據建模后,您應該最終獲得一組定義清晰的表,這些表可用于分析和可視化,并封裝所有業務邏輯以創建一個干凈且經過良好測試的架構,該架構可在其他地方使用(可視化)工具,或發送給其他應用程序。
數據可視化和分析 (Data visualization and analytics)
There are a lot of options here too, and your mileage may vary. I’ve tried to summarize the position they capture in the space where I can, and I am generally less able to provide strong opinions here as there are so many.
這里也有很多選擇,您的里程可能會有所不同。 我試圖總結它們在我所能占據的空間中所占據的位置,并且由于這里有太多的人,我通常無法提供強有力的意見。
For our startup customers building a stack around a data warehouse, we probably mostly see: Looker, Metabase, Google Data Studio, Chartio, Tableau.
對于我們的初創客戶,他們圍繞數據倉庫構建堆棧,我們通常會看到: Looker , Metabase , Google Data Studio , Chartio , Tableau 。
Despite so many options, it’s fairly easy to play around and experiment between them particularly if you have a good semantic data modelling layer maintained in your warehouse or use tools like Segment for the out of the box solutions.
盡管有很多選擇,但是在它們之間進行測試還是很容易的,特別是如果您在倉庫中維護了一個良好的語義數據建模層,或者使用了諸如Segment之類的工具來實現現成的解決方案。
Most of these products fit into a few different categories:
這些產品大多數都屬于以下幾種類別:
Chart builders — select some dimensions, choose a chart type, customize the visualizations, put them in some dashboards (where they will eventually break, go out of date, and never be updated).
圖表構建器 -選擇某些維度,選擇圖表類型,自定義可視化效果,將其放置在某些儀表板中(它們最終將在其中中斷,過期,并且永遠不會更新)。
Self service BI solutions — tell these tools about how your data is structured and how to interpret it, and they’ll try to make it easy for anyone to quickly answer questions.
自助服務BI解決方案 -向這些工具介紹數據的結構和解釋方式,它們將使所有人都能輕松地快速回答問題。
Out of the box product analytics — generally these tools run on event data, have an opinionated schema and aren’t easily customized or generic, but they do what they do well and are generally self-service.
開箱即用的產品分析 —通常,這些工具運行在事件數據上,具有自以為是的模式,不容易定制或通用,但是它們做的很好,通常都是自助服務。
旁觀者 (Looker)
The (expensive) market leader. Best in class self-service, fully customizable BI.
(昂貴的)市場領導者。 一流的自助服務,可完全自定義的BI。
Looker users seem to have nothing but praise for the platform. It stands out because of a few things:
Looker用戶似乎對該平臺一無所獲。 之所以脫穎而出,是因為以下幾點:
- It’s designed to help you deliver a self-service portal to your entire company, enabling anyone to answer business questions. 它旨在幫助您為整個公司提供自助服務門戶,使任何人都能回答業務問題。
- While you can build charts in Looker, that’s not what you’re supposed to do. You teach Looker how to understand your data, and it makes answering questions a breeze. 雖然可以在Looker中構建圖表,但這不是您應該做的。 您可以教Looker如何理解您的數據,這使回答問題變得輕而易舉。
- It adopts engineering best practices such as version control. Your data team can collaborate using git based workflows, allowing you to scale to even hundreds of analysts. 它采用了工程最佳實踐,例如版本控制。 您的數據團隊可以使用基于git的工作流程進行協作,從而使您可以擴展到甚至數百名分析師。
Looker requires an investment of both time and money, but what you get out the end is something few other solutions provide.
Looker需要投入時間和金錢,但是最終獲得的收益是其他解決方案所無法提供的。
元數據庫 (Metabase)
The open-source self service BI leader.
開源自助服務BI主管。
Adopts the Looker concept of modelling data to make answering questions easy. Doesn’t adopt a git based workflow however. It’s open source and you’ll need to run it yourself on your own infrastructure. Seems to be a favourite amongst engineers.
采用建模數據的Looker概念,使回答問題變得容易。 但是不采用基于git的工作流程。 它是開源的,您需要自己在自己的基礎架構上運行它。 似乎是工程師的最愛。
If you wanted a Looker like experience but the price tag is too much, then this is probably your next best bet.
如果您想要類似Looker的體驗,但價格太多,那么這可能是您的下一個最佳選擇。
Like Looker, makes it easy for non-SQL users to answer questions without relying on the data team.
與Looker一樣,非SQL用戶可以輕松地在不依賴數據團隊的情況下回答問題。
資料室 (Data Studio)
A powerful and free chart builder.
強大而免費的圖表構建器。
Primarily a chart builder, but you can kind of make it work for self-service dashboards up to a point. While it’s worth a mention, unlikely to serve as your primary BI portal as your team grows, but a great place to start if you are already in the Google stack.
主要是一個圖表構建器,但是您可以使其在某種程度上適用于自助服務儀表板。 值得一提的是,隨著團隊的成長,它不太可能成為您的主要BI門戶,但是如果您已經在Google堆棧中,那么它是一個很好的起點。
畫面 (Tableau)
The chart builder incumbent, extremely powerful, loved by many.
現有的圖表構建器,功能極其強大,受到許多人的喜愛。
Tableau has limited data modelling capabilities, but is extremely powerful and able to build a huge range of visualizations, dashboards, so on. For better or for worse, you can do pretty much anything with it.
Tableau的數據建模功能有限,但是功能極其強大,并且能夠構建各種可視化,儀表板等。 不管是好是壞,您幾乎可以用它做任何事情。
If you don’t have existing Tableau experience then this probably isn’t the place to start, consider Metabase, Looker, Chartio instead.
如果您沒有Tableau的現有經驗,那么可能不是開始的地方,而應考慮使用Metabase,Looker和Chartio。
Chartio (Chartio)
Reasonably priced, self service BI and chart building.
價格合理,自助式BI和圖表構建。
Chartio makes it easy to build SQL queries without actually writing SQL. This makes it great for putting data in the hands of your whole team, and while it has some data modelling capabilities, it’s not quite in the same camp as looker though.
Chartio使您無需實際編寫SQL即可輕松構建SQL查詢。 這非常適合將數據交到整個團隊中,并且雖然它具有一些數據建模功能,但與查找者并沒有完全相同的陣營。
You can build complex SQL pipelines (multiple joins etc) through a UI interface which can be great for those who aren’t as comfortable writing SQL themselves.
您可以通過UI界面構建復雜SQL管道(多個聯接等),這對那些不那么喜歡自己編寫SQL的人來說非常有用。
Redash (Redash)
Open source chart builder, recently acquired by Databricks.
開源圖表構建器,最近被Databricks收購。
One of the first tools we used ourselves. Redash is open source and works with modern warehouses, but is primarily a chart builder and may serve you well at first, you’ll probably spend a lot of time fixing queries unless you heavily invest in a data modelling tool too.
我們使用自己的第一批工具之一。 Redash是開源的,可以與現代倉庫一起使用,但是主要是一個圖表構建器,一開始可能會為您提供良好的服務,除非您也大量投資數據建模工具,否則您可能會花費大量時間來修復查詢。
堆 /混合面板 / 幅度 / 指示 (Heap / Mixpanel / Amplitude / Indicative)
Out of the box event based customer and product analytics.
開箱即用的基于事件的客戶和產品分析。
I’ve put all these together, as they really all do a similar thing. They all have pros and cons but ultimately exist to solve the same problem. Send them events, and you’ll be able to quickly answer questions about your users behaviour, what actions they took in what order and with some of these tools do experimentation, optimization or even personalization.
我將所有這些放在一起,因為它們確實都做類似的事情。 它們都有優點和缺點,但最終存在是為了解決相同的問題。 向他們發送事件,您將能夠快速回答有關用戶行為,他們以什么順序采取的行動以及使用其中一些工具進行實驗,優化甚至個性化的問題。
Think Google Analytics but on steroids, and with a focus on user behaviour and engagement.
考慮使用Google Analytics(分析),但要考慮類固醇,重點是用戶行為和參與度。
If you want to start off warehouse-less, these are probably your best options to begin with but consider tagging with Segment from day one.
如果您想從無倉庫開始,那么這些也許是您的最佳選擇,但從第一天開始就考慮使用細分進行標記。
Indicative is a little different to the others, in that it’s primarily designed to be pointed at your data warehouse. If you want to transform event data in your warehouse and then analyse it in a UI, this is where Indicative shines and we hear great things about it, but probably not the place to start if you want an all-in-one out the box solution.
指示性與其他指示性有所不同,因為指示性設計主要針對您的數據倉庫。 如果您想轉換倉庫中的事件數據,然后在用戶界面中對其進行分析,這就是Indicative的亮點,我們聽到了很多有關它的信息,但是如果您想一開箱即用,那么可能不是開始的地方解。
谷歌分析 (Google Analytics)
The out of the box web and app analytics all-in-one incumbent, basically free.
開箱即用的網絡和應用程序分析功能一應俱全,基本上是免費的。
Despite having more advanced, custom solutions in place now, we still send data to GA because there are some questions that are just a lot easier to answer with it. Acquisition for example — data is enriched with geo locations, UTM tags are automatically grouped and categorized.
盡管現在有了更高級的自定義解決方案,但我們仍然將數據發送到GA,因為有些問題很容易回答。 例如,采集-數據充實了地理位置,UTM標簽被自動分組和分類。
The fundamental challenge with GA is that it doesn’t give you access to raw data. As we’ve said above, consider tagging with Segment instead to make sure you retain the raw event data.
Google Analytics(分析)的基本挑戰是無法讓您存取原始資料。 如上文所述,請考慮使用細分標記,以確保您保留原始事件數據。
Microsoft / Azure堆棧 (The Microsoft / Azure stack)
Worth a mention, Microsoft has their own entire set of solutions for most steps we’ve described above.
值得一提的是,對于上述大多數步驟,Microsoft具有自己的整套解決方案。
We see it less with startups, and it also doesn’t really follow the ELT model so I haven’t included it in the above. If you’re just starting out and want to get set up quickly, this probably isn’t the place to start.
我們在初創公司中很少看到它,它也沒有真正遵循ELT模型,因此我沒有在上面包括它。 如果您只是剛開始并且想快速設置,那可能不是開始的地方。
You’ll probably only want to go down this route if you’re already heavily invested in Azure and Microsoft products as an organization.
如果您已經作為組織已經在Azure和Microsoft產品上進行了大量投資,則可能只想走這條路。
As part of their offering, they have the following tools:
作為其產品的一部分,他們具有以下工具:
Azure Data factory — highly customizable ETL like workflows for data integration
Azure數據工廠 -高度可定制的ETL,例如用于數據集成的工作流
Azure / Synapse data warehouse, a more scalable sql-server based warehouse with some support for on-demand pricing
Azure / Synapse數據倉庫 ,一個基于sql-server的更具擴展性的倉庫,并支持按需定價
Power BI — a powerful data analytics product with some support for data modelling but no Git based workflows and some support for warehouses such as BigQuery and Snowflake.
Power BI —一種功能強大的數據分析產品,具有對數據建模的支持,但不支持基于Git的工作流,并且對諸如BigQuery和Snowflake之類的倉庫提供某些支持。
結論 (Conclusion)
Hopefully this overview of the product options and the core parts of the stack helps you and your team make a decision when it comes to setting up a data stack that will be able to scale as your data team and the complexity of the data problems you face grows.
希望對產品選項和堆棧核心部分的概述有助于您和您的團隊在建立數據堆棧方面做出決定,該數據堆棧可以隨著您的數據團隊的規模以及所面臨的數據問題的復雜性而擴展成長。
I believe regardless of the product options you choose above, following the ELT architecture outlined in this post should ensure that you are able to cope with new requirements as your stack evolves without hitting any major roadblocks.
我相信,無論您選擇以上哪種產品選項,遵循本文中概述的ELT架構都應確保隨著堆棧的發展,您能夠應對新的要求,而不會遇到任何主要障礙。
Originally published at https://dataform.co.
最初發布在 https://dataform.co 。
翻譯自: https://medium.com/dataform/the-startup-data-stack-starter-pack-2020-47fcb34aeb09
全棧入門
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/392447.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/392447.shtml 英文地址,請注明出處:http://en.pswp.cn/news/392447.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!