數據治理 主數據 元數據
Data governance is top of mind for many of my customers, particularly in light of GDPR, CCPA, COVID-19, and any number of other acronyms that speak to the increasing importance of data management when it comes to protecting user data.
數據治理是我許多客戶的首要考慮因素,尤其是考慮到GDPR,CCPA,COVID-19以及任何其他首字母縮寫詞,這些首字母縮寫詞表明了數據管理在保護用戶數據方面的重要性日益提高。
Over the past several years, data catalogs have emerged as a powerful tool for data governance, and I couldn’t be happier. As companies digitize and their data operations democratize, it’s important for all elements of the data stack, from warehouses to business intelligence platforms, and now, catalogs, to participate in compliance best practices.
在過去的幾年中, 數據目錄已成為一種強大的數據治理工具 ,我對此感到高興。 隨著公司數字化及其數據運營的民主化,從倉庫到商業智能平臺,再到現在的目錄,數據堆棧的所有元素都必須參與合規性最佳實踐。
But are data catalogs all we need to build a robust data governance program?
但是,構建強大的數據治理程序所需的所有數據目錄都是嗎?
數據目錄用于數據治理? (Data catalogs for data governance?)
Analogous to a physical library catalog, data catalogs serve as an inventory of metadata and give investors the information necessary to evaluate data accessibility, health, and location. Companies like Alation, Collibra, and Informatica tout solutions that not only keep tabs on your data, but also integrate with machine learning and automation to make data more discoverable, collaborative, and now, in compliance with organizational, industry-wide, or even government regulations.
類似于物理圖書館目錄, 數據目錄用作元數據清單,并向投資者提供評估數據可訪問性,健康狀況和位置所需的信息。 像Alation,Collibra和Informatica這樣的公司都在宣傳解決方案,這些解決方案不僅可以保留數據標簽,還可以與機器學習和自動化集成,從而使數據更易于發現,協作,并且現在符合組織,整個行業甚至政府的要求。規定。
Since data catalogs provide a single source of truth about a company’s data sources, it’s very easy to leverage data catalogs to manage the data in your pipelines. Data catalogs can be used to store metadata that gives stakeholders a better understanding of a specific source’s lineage, thereby instilling greater trust in the data itself. Additionally, data catalogs make it easy to keep track of where personally identifiable information (PII) can both be housed and sprawl downstream, as well as who in the organization has the permission to access it across the pipeline.
由于數據目錄提供有關公司數據源的唯一事實來源,因此利用數據目錄來管理管道中的數據非常容易。 數據目錄可用于存儲元數據,從而使利益相關者更好地了解特定來源的血統,從而在數據本身上建立起更大的信任。 此外,數據目錄使跟蹤個人身份信息(PII)可以存放和向下游蔓延的位置以及組織中的誰有權通過管道訪問變得容易。
什么適合我的組織? (What’s right for my organization?)
So, what type of data catalog makes the most sense for your organization? To make your life a little easier, I spoke with data teams in the field to learn about their data catalog solutions, breaking them down into three distinct categories: in-house, third-party, and open source.
那么,哪種類型的數據目錄最適合您的組織? 為了使您的生活更輕松,我與該領域的數據團隊進行了交談,以了解他們的數據目錄解決方案,并將它們分為三個不同的類別:內部,第三方和開源。
內部的 (In-house)
Some B2C companies — I’m talking the Airbnbs, Netflixs, and Ubers of the world — build their own data catalogs to ensure data compliance with state, country, and even economic union (I’m looking at you GDPR) level regulations. The biggest perk of in-house solutions is the ability to quickly spin up customizable dashboards, pulling out fields your team needs the most.
一些B2C公司(我正在談論全球的Airbnbs , Netflix和Uber)建立自己的數據目錄,以確保數據符合州,國家或經濟聯盟(我在看您的GDPR)級法規。 內部解決方案最大的好處是能夠快速啟動可定制的儀表板,從而拉出團隊最需要的領域。

While in-house tools make for quick customization, over time, such hacks can lead to a lack of visibility and collaboration, particularly when it comes to understanding data lineage. In fact, one data leader I spoke with at a food delivery startup noted that what was clearly missing from her in-house data catalog was a “single pane of glass.” If she had a single source of truth that could provide insight into how her team’s tables were being leveraged by other parts of the business, ensuring compliance would be easy.
盡管內部工具可以快速進行自定義,但隨著時間的流逝,此類黑客行為可能導致缺乏可見性和協作性,尤其是在了解數據沿襲時。 實際上,我在一家食品配送初創公司與之交談的一位數據負責人指出,她內部數據目錄中顯然缺少的是“一塊玻璃”。 如果她有一個真實的來源,可以洞察業務的其他部門如何利用她的團隊的表,那么確保合規將很容易。
On top of these tactical considerations, spending engineering time and resources building a multi-million dollar data catalog just doesn’t make sense for the vast majority of companies.
除了這些戰術上的考慮之外,花費大量的工程時間和資源來建立數百萬美元的數據目錄對于絕大多數公司來說都是沒有意義的。
第三方 (Third-party)
Since their founding in 2012, Alation has largely paved the way for the rise of the automated data catalog. Now, there are a whole host of ML-powered data catalogs on the market, including Collibra, Informatica, and others, many with pay-for-play workflow and repository-oriented compliance management integrations. Some cloud providers, like Google, AWS, and Azure, also offer data governance tooling integration at an additional cost.
自2012年成立以來, Alation在很大程度上為自動化數據目錄的興起鋪平了道路。 現在,市場上有大量基于ML的數據目錄,包括Collibra , Informatica等,其中許多具有按需付費工作流程和面向存儲庫的合規性管理集成。 一些云提供商,例如Google,AWS和Azure,還提供了額外的數據治理工具集成。
In my conversations with data leaders, one downside of these solutions came up time and again: usability. While nearly all of these tools have strong collaboration features, one Data Engineering VP I spoke with specifically called out his third-party catalog’s unintuitive UI.
在與數據負責人的對話中,這些解決方案的一個缺點一次又一次出現:可用性。 盡管幾乎所有這些工具都具有強大的協作功能,但與我交談的一位數據工程副總裁特別提到了他的第三方目錄的直觀用戶界面。
If data tools aren’t easy to use, how can we expect users to understand or even care whether they’re compliant?
如果數據工具不容易使用,我們如何期望用戶理解甚至關心他們是否合規?
開源的 (Open source)
In 2017, Lyft became an industry leader by open sourcing their data discovery and metadata engine, Amundsen, named after the famed Antarctic explorer. Other open source tools, such as Apache Atlas, Magda and CKAN, provide similar functionalities, and all three make it easy for development-savvy teams to fork an instance of the software and get started.
2017年,Lyft通過開源其數據發現和元數據引擎Amundsen成為行業領導者, Amundsen以著名的南極探險家的名字命名。 其他開放源代碼工具(例如Apache Atlas , Magda和CKAN )提供了相似的功能,而這三者使精通開發的團隊可以輕松地派生該軟件的實例并開始使用。

While some of these tools allow teams to tag metadata within to control user access, this is an intensive and often manual process that most teams just don’t have the time to tackle. In fact, a product manager at a leading transportation company shared that his team specifically chose not to use an open source data catalog because they didn’t have off-the-shelf support for all the data sources and data management tooling in their stack, making data governance extra challenging. In short, open source solutions just weren’t comprehensive enough.
盡管其中一些工具允許團隊在其中標記元數據來控制用戶訪問,但這是一個密集且通常是手動的過程,大多數團隊只是沒有時間解決。 實際上,一家領先的運輸公司的產品經理分享說,他的團隊特別選擇不使用開源數據目錄,因為他們沒有對堆棧中所有數據源和數據管理工具的現成支持,使數據治理更具挑戰性。 簡而言之,開源解決方案還不夠全面。
Still, there’s something critical to compliance that even the most advanced catalog can’t account for: data downtime.
盡管如此,即使對于最高級的目錄,也無法解決合規性方面的關鍵問題: 數據停機 。
缺少的鏈接:數據停機 (The missing link: data downtime)
Recently, I developed a simple metric for a customer that helps measure data downtime, in other words, periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. When applied to data governance, data downtime gives you a holistic picture of your organization’s data reliability. Without data reliability to power full discoverability, it’s impossible to know whether or not your data is fully compliant and usable.
最近,我為客戶開發了一個簡單的指標 ,該指標可以幫助您衡量數據停機時間 ,換句話說,就是您的數據不完整,錯誤,丟失或不準確時的時間段。 當應用于數據治理時,數據停機時間可以使您全面了解組織的數據可靠性。 沒有數據可靠性來增強完全可發現性,就無法知道您的數據是否完全合規和可用。
Data catalogs solve some, but not all, of your data governance problems. To start, mitigating governance gaps is a monumental undertaking, and it’s impossible to prioritize these without a full understanding of which data assets are actually being accessed by your company. Data reliability fills this gap and allows you to unlock your data ecosystem’s full potential.
數據目錄解決了部分但不是全部的數據治理問題。 首先,減輕治理差距是一項艱巨的任務,如果無法完全了解貴公司實際上正在訪問哪些數據資產,就不可能對這些差距進行優先排序。 數據可靠性填補了這一空白,并允許您釋放數據生態系統的全部潛力。
Additionally, without real-time lineage, it’s impossible to know how PII or other regulated data sprawls. Think about it for a second: even if you’re using the fanciest data catalog on the market, your governance is only as good as your knowledge about where that data goes. If your pipelines aren’t reliable, neither is your data catalog.
此外,如果沒有實時沿襲,就不可能知道PII或其他受監管的數據是如何蔓延的。 仔細考慮一下:即使您使用的是市場上最高級的數據目錄,您的治理也僅取決于您對數據去向的了解。 如果管道不可靠,那么數據目錄也不可靠。
Owing to their complementary features, data catalogs and data reliability solutions work hand-in-hand to provide an engineering approach to data governance, no matter the acronyms you need to meet.
由于具有互補功能,因此數據目錄和數據可靠性解決方案可以協同工作,從而為數據治理提供一種工程方法,無論您需要使用首字母縮寫詞如何。
Personally, I’m excited for what the next wave of data catalogs have in store. And trust me: it’s more than just data.
就個人而言,我對下一波數據目錄的存儲感到興奮。 相信我:這不僅僅是數據。
If you want to learn more, reach out to Barr Moses.
如果您想了解更多信息,請聯系 Barr Moses 。
翻譯自: https://towardsdatascience.com/what-we-got-wrong-about-data-governance-365555993048
數據治理 主數據 元數據
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388796.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388796.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388796.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!