數據治理主數據元數據

Data governance is top of mind for many of my customers, particularly in light of GDPR, CCPA, COVID-19, and any number of other acronyms that speak to the increasing importance of data management when it comes to protecting user data.

數據治理是我許多客戶的首要考慮因素，尤其是考慮到GDPR，CCPA，COVID-19以及任何其他首字母縮寫詞，這些首字母縮寫詞表明了數據管理在保護用戶數據方面的重要性日益提高。

Over the past several years, data catalogs have emerged as a powerful tool for data governance, and I couldn’t be happier. As companies digitize and their data operations democratize, it’s important for all elements of the data stack, from warehouses to business intelligence platforms, and now, catalogs, to participate in compliance best practices.

在過去的幾年中，數據目錄已成為一種強大的數據治理工具，我對此感到高興。隨著公司數字化及其數據運營的民主化，從倉庫到商業智能平臺，再到現在的目錄，數據堆棧的所有元素都必須參與合規性最佳實踐。

But are data catalogs all we need to build a robust data governance program?
但是，構建強大的數據治理程序所需的所有數據目錄都是嗎？

數據目錄用于數據治理？ (Data catalogs for data governance?)

Analogous to a physical library catalog, data catalogs serve as an inventory of metadata and give investors the information necessary to evaluate data accessibility, health, and location. Companies like Alation, Collibra, and Informatica tout solutions that not only keep tabs on your data, but also integrate with machine learning and automation to make data more discoverable, collaborative, and now, in compliance with organizational, industry-wide, or even government regulations.

類似于物理圖書館目錄，數據目錄用作元數據清單，并向投資者提供評估數據可訪問性，健康狀況和位置所需的信息。像Alation，Collibra和Informatica這樣的公司都在宣傳解決方案，這些解決方案不僅可以保留數據標簽，還可以與機器學習和自動化集成，從而使數據更易于發現，協作，并且現在符合組織，整個行業甚至政府的要求。規定。

Since data catalogs provide a single source of truth about a company’s data sources, it’s very easy to leverage data catalogs to manage the data in your pipelines. Data catalogs can be used to store metadata that gives stakeholders a better understanding of a specific source’s lineage, thereby instilling greater trust in the data itself. Additionally, data catalogs make it easy to keep track of where personally identifiable information (PII) can both be housed and sprawl downstream, as well as who in the organization has the permission to access it across the pipeline.

由于數據目錄提供有關公司數據源的唯一事實來源，因此利用數據目錄來管理管道中的數據非常容易。數據目錄可用于存儲元數據，從而使利益相關者更好地了解特定來源的血統，從而在數據本身上建立起更大的信任。此外，數據目錄使跟蹤個人身份信息(PII)可以存放和向下游蔓延的位置以及組織中的誰有權通過管道訪問變得容易。

什么適合我的組織？ (What’s right for my organization?)

So, what type of data catalog makes the most sense for your organization? To make your life a little easier, I spoke with data teams in the field to learn about their data catalog solutions, breaking them down into three distinct categories: in-house, third-party, and open source.

那么，哪種類型的數據目錄最適合您的組織？為了使您的生活更輕松，我與該領域的數據團隊進行了交談，以了解他們的數據目錄解決方案，并將它們分為三個不同的類別：內部，第三方和開源。

內部的 (In-house)

Some B2C companies — I’m talking the Airbnbs, Netflixs, and Ubers of the world — build their own data catalogs to ensure data compliance with state, country, and even economic union (I’m looking at you GDPR) level regulations. The biggest perk of in-house solutions is the ability to quickly spin up customizable dashboards, pulling out fields your team needs the most.

一些B2C公司(我正在談論全球的Airbnbs ， Netflix和Uber)建立自己的數據目錄，以確保數據符合州，國家或經濟聯盟(我在看您的GDPR)級法規。內部解決方案最大的好處是能夠快速啟動可定制的儀表板，從而拉出團隊最需要的領域。

Image for post — **Uber’s Databook lets data scientists easily search for tables.** **Uber的數據手冊可讓數據科學家輕松搜索表格。** *Image courtesy of* *圖片由* *Uber EngineeringUber Engineering提供*.。

While in-house tools make for quick customization, over time, such hacks can lead to a lack of visibility and collaboration, particularly when it comes to understanding data lineage. In fact, one data leader I spoke with at a food delivery startup noted that what was clearly missing from her in-house data catalog was a “single pane of glass.” If she had a single source of truth that could provide insight into how her team’s tables were being leveraged by other parts of the business, ensuring compliance would be easy.

盡管內部工具可以快速進行自定義，但隨著時間的流逝，此類黑客行為可能導致缺乏可見性和協作性，尤其是在了解數據沿襲時。實際上，我在一家食品配送初創公司與之交談的一位數據負責人指出，她內部數據目錄中顯然缺少的是“一塊玻璃”。如果她有一個真實的來源，可以洞察業務的其他部門如何利用她的團隊的表，那么確保合規將很容易。

On top of these tactical considerations, spending engineering time and resources building a multi-million dollar data catalog just doesn’t make sense for the vast majority of companies.

除了這些戰術上的考慮之外，花費大量的工程時間和資源來建立數百萬美元的數據目錄對于絕大多數公司來說都是沒有意義的。

第三方 (Third-party)

Since their founding in 2012, Alation has largely paved the way for the rise of the automated data catalog. Now, there are a whole host of ML-powered data catalogs on the market, including Collibra, Informatica, and others, many with pay-for-play workflow and repository-oriented compliance management integrations. Some cloud providers, like Google, AWS, and Azure, also offer data governance tooling integration at an additional cost.

自2012年成立以來， Alation在很大程度上為自動化數據目錄的興起鋪平了道路。現在，市場上有大量基于ML的數據目錄，包括Collibra ， Informatica等，其中許多具有按需付費工作流程和面向存儲庫的合規性管理集成。一些云提供商，例如Google，AWS和Azure，還提供了額外的數據治理工具集成。

In my conversations with data leaders, one downside of these solutions came up time and again: usability. While nearly all of these tools have strong collaboration features, one Data Engineering VP I spoke with specifically called out his third-party catalog’s unintuitive UI.

在與數據負責人的對話中，這些解決方案的一個缺點一次又一次出現：可用性。盡管幾乎所有這些工具都具有強大的協作功能，但與我交談的一位數據工程副總裁特別提到了他的第三方目錄的直觀用戶界面。

If data tools aren’t easy to use, how can we expect users to understand or even care whether they’re compliant?
如果數據工具不容易使用，我們如何期望用戶理解甚至關心他們是否合規？

開源的 (Open source)

In 2017, Lyft became an industry leader by open sourcing their data discovery and metadata engine, Amundsen, named after the famed Antarctic explorer. Other open source tools, such as Apache Atlas, Magda and CKAN, provide similar functionalities, and all three make it easy for development-savvy teams to fork an instance of the software and get started.

2017年，Lyft通過開源其數據發現和元數據引擎Amundsen成為行業領導者， Amundsen以著名的南極探險家的名字命名。其他開放源代碼工具(例如Apache Atlas ， Magda和CKAN )提供了相似的功能，而這三者使精通開發的團隊可以輕松地派生該軟件的實例并開始使用。

While some of these tools allow teams to tag metadata within to control user access, this is an intensive and often manual process that most teams just don’t have the time to tackle. In fact, a product manager at a leading transportation company shared that his team specifically chose not to use an open source data catalog because they didn’t have off-the-shelf support for all the data sources and data management tooling in their stack, making data governance extra challenging. In short, open source solutions just weren’t comprehensive enough.

盡管其中一些工具允許團隊在其中標記元數據來控制用戶訪問，但這是一個密集且通常是手動的過程，大多數團隊只是沒有時間解決。實際上，一家領先的運輸公司的產品經理分享說，他的團隊特別選擇不使用開源數據目錄，因為他們沒有對堆棧中所有數據源和數據管理工具的現成支持，使數據治理更具挑戰性。簡而言之，開源解決方案還不夠全面。

Still, there’s something critical to compliance that even the most advanced catalog can’t account for: data downtime.
盡管如此，即使對于最高級的目錄，也無法解決合規性方面的關鍵問題：數據停機。

缺少的鏈接：數據停機 (The missing link: data downtime)

Recently, I developed a simple metric for a customer that helps measure data downtime, in other words, periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. When applied to data governance, data downtime gives you a holistic picture of your organization’s data reliability. Without data reliability to power full discoverability, it’s impossible to know whether or not your data is fully compliant and usable.

最近，我為客戶開發了一個簡單的指標 ，該指標可以幫助您衡量數據停機時間 ，換句話說，就是您的數據不完整，錯誤，丟失或不準確時的時間段。當應用于數據治理時，數據停機時間可以使您全面了解組織的數據可靠性。沒有數據可靠性來增強完全可發現性，就無法知道您的數據是否完全合規和可用。

Data catalogs solve some, but not all, of your data governance problems. To start, mitigating governance gaps is a monumental undertaking, and it’s impossible to prioritize these without a full understanding of which data assets are actually being accessed by your company. Data reliability fills this gap and allows you to unlock your data ecosystem’s full potential.

數據目錄解決了部分但不是全部的數據治理問題。首先，減輕治理差距是一項艱巨的任務，如果無法完全了解貴公司實際上正在訪問哪些數據資產，就不可能對這些差距進行優先排序。數據可靠性填補了這一空白，并允許您釋放數據生態系統的全部潛力。

Additionally, without real-time lineage, it’s impossible to know how PII or other regulated data sprawls. Think about it for a second: even if you’re using the fanciest data catalog on the market, your governance is only as good as your knowledge about where that data goes. If your pipelines aren’t reliable, neither is your data catalog.

此外，如果沒有實時沿襲，就不可能知道PII或其他受監管的數據是如何蔓延的。仔細考慮一下：即使您使用的是市場上最高級的數據目錄，您的治理也僅取決于您對數據去向的了解。如果管道不可靠，那么數據目錄也不可靠。

Owing to their complementary features, data catalogs and data reliability solutions work hand-in-hand to provide an engineering approach to data governance, no matter the acronyms you need to meet.
由于具有互補功能，因此數據目錄和數據可靠性解決方案可以協同工作，從而為數據治理提供一種工程方法，無論您需要使用首字母縮寫詞如何。

Personally, I’m excited for what the next wave of data catalogs have in store. And trust me: it’s more than just data.

就個人而言，我對下一波數據目錄的存儲感到興奮。相信我：這不僅僅是數據。

If you want to learn more, reach out to Barr Moses.

如果您想了解更多信息，請聯系 Barr Moses 。

翻譯自: https://towardsdatascience.com/what-we-got-wrong-about-data-governance-365555993048

數據治理主數據元數據

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/388796.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/388796.shtml
英文地址，請注明出處：http://en.pswp.cn/news/388796.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

mysql 選擇前4個_mysql從4個表中選擇

不要認為GROUP BY是必需的 . 雖然如果一個孩子有2個父記錄，你可能想用它來將2個父母分組到一行 - 但不確定這是否是你的要求 . 因為如果一個孩子有2個父母，那么將為該孩子返回的父母是未定義的 .假設所有孩子都有父母，所有父母都會有姓&#…

提高機器學習質量的想法_如何提高機器學習的數據質量？

提高機器學習質量的想法The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might …

mysql 集群實踐_MySQL Cluster集群探索與實踐

MySQL集群是一種在無共享架構(SNA，Share Nothing Architecture)系統里應用內存數據庫集群的技術。這種無共享的架構可以使得系統使用低廉的硬件獲取高的可擴展性。MySQL集群是一種分布式設計，目標是要達到沒有任何單點故障點。因此，任何組成部…

Python基礎：搭建開發環境（1）

1.Python語言簡介 2.Python環境 Python環境產品存在多個。 2.1 CPython CPython是Python官方提供的。一般情況下提到的Python就是指CPython，CPython是基于C語言編寫的。 CPython實現的解釋器將源代碼編譯為字節碼（ByteCode），再由虛…

python數據結構之隊列（一）

隊列概念隊列（queue）是只允許在一端進行插入操作，而在另一端進行刪除操作的線性表。隊列是一種先進先出的（First In First Out）的線性表，簡稱FIFO。允許插入的一端為隊尾，允許刪除的一端為隊頭。…

Android實現圖片放大縮小

Android實現圖片放大縮小 package com.min.Test_Gallery; import Android.app.Activity; import android.graphics.Bitmap; import android.graphics.BitmapFactory; import android.graphics.Color; import android.graphics.Matrix; import android.os.Bun…

matlab散點圖折線圖_什么是散點圖以及何時使用

matlab散點圖折線圖When you were learning algebra back in high school, you might not have realized that one day you would need to create a scatter plot to demonstrate real-world results.當您在高中學習代數時，您可能沒有意識到有一天需要創建一個散點圖…

java判斷題_【Java判斷題】請大神們進來看下、這些判斷題你都知道多少~

該樓層疑似違規已被系統折疊隱藏此樓查看此樓、判斷改錯題(每題2分，共20分)(正確的打√，錯誤的打并說明原因)1、 Java系統包提供了很多預定義類,我們可以直接引用它們而不必從頭開始編寫程序。 (　)2、程序可以用字符‘*’替代一個TextField中的每個字…

PoPo數據可視化第8期

PoPo數據可視化聚焦于Web數據可視化與可視化交互領域，發現可視化領域有意思的內容。不想錯過可視化領域的精彩內容, 就快快關注我們吧 :) 微信訂閱號：popodv_com谷歌決定關閉云可視化服務 Fusion Tables谷歌宣布即將關閉其 Fusion Tables 云服務&#x…

AC自動機題單

AC自動機題目真的超級感謝xzy 真的幫到我很多題單 [X] [luogu3808]【模板】AC自動機（簡單版） https://www.luogu.org/problemnew/show/P3808[X] [luogu3796]【模板】AC自動機（加強版）https://www.luogu.org/problemnew/show/P37…

java list用法_Java List 用法詳解及實例分析

Java List 用法詳解及實例分析Java中可變數組的原理就是不斷的創建新的數組，將原數組加到新的數組中,下文對Java List用法做了詳解。List:元素是有序的(怎么存的就怎么取出來，順序不會亂)，元素可以重復(角標1上有個3，角標2上也可以…

python字符串和List：索引值以 0 為開始值，-1 為從末尾的開始位置；值和位置的區別哦...

String（字符串）Python中的字符串用單引號或雙引號 " 括起來，同時使用反斜杠 \ 轉義特殊字符。字符串的截取的語法格式如下： 變量[頭下標:尾下標]索引值以 0 為開始值，-1 為從末尾的開始位置。[一個是值&#x…

邏輯回歸 python_深入研究Python的邏輯回歸

邏輯回歸 pythonClassification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, b…