掌握大數據數據分析師嗎?
Either you are a data scientist, a data engineer, or someone enthusiastic about data, understanding your data is one thing you don’t want to overlook. We usually regard data as numbers, texts, or images, but data is more than that.
?ither你是一個數據科學家,數據工程師,還是有人熱衷于數據, 了解你的數據是你不想忽視的一件事。 我們通常將數據視為數字,文本或圖像,但數據不僅限于此。
We should consider data as an independent entity. Data can make self-introduction, tell stories, and visualize trends. To reach those outcomes, you must understand your data first. Not only about how it was formed or its origin, but how it’ll change over time and its usability. Some of this information is what we call metadata.
我們應該將數據視為一個獨立的實體。 數據可以自我介紹, 講故事和可視化趨勢。 為了獲得這些結果,您必須首先了解您的數據。 不僅是關于它的形成方式或起源,還包括它隨著時間的變化及其可用性的變化。 其中一些信息就是我們所說的元數據。
Why is metadata so important? And why must we master metadata before we master data? Today I’ll show you how we can leverage metadata in our data business.
為什么元數據如此重要? 為何我們在掌握數據之前必須掌握元數據? 今天,我將向您展示如何在數據業務中利用元數據。
到底什么是元數據? (What is metadata, exactly?)
According to Wikipedia, metadata is “data that provides information about other data”. It’s “data about data”. That sounds straightforward, doesn’t it? All data contains information about a specific thing. For metadata, that specific thing is another data.
根據維基百科 ,元數據是“ 提供有關其他數據的信息的數據 ”。 這是“關于數據的數據” 。 這聽起來很簡單,不是嗎? 所有數據都包含有關特定事物的信息。 對于元數據,那個特定的東西是另一種數據。
However, metadata also varies in the definition per se. It can be the name of the dataset, creation information, or statistical distribution of data points. It can be anything related to the data properties. With that said, all data must possess for it the metadata. But that’s not always the exhaustive case.
但是,元數據本身的定義也有所不同。 它可以是數據集的名稱,創建信息或數據點的統計分布 。 它可以是與數據屬性有關的任何內容。 話雖如此,所有數據都必須擁有元數據。 但這并不總是窮舉。
Data without metadata is always incomplete.
沒有元數據的數據總是不完整的。

We use data with the hope of extracting useful insights, and the purpose of data comprehension. Metadata helps us to assert the data integrity, to verify the source of truth, or to maintain stable data quality.
我們使用數據的目的是希望提取有用的見解以及數據理解的目的。 元數據可幫助我們維護數據完整性,驗證真相來源或保持穩定的數據質量。

However, in some cases, data users ignore the effect of metadata. They view it as just labels and the value it brings to the table is limited. We’ll see next how metadata is related to another critical aspect of data: Data quality.
但是,在某些情況下,數據用戶會忽略元數據的影響。 他們將其視為標簽,并且它帶給表的價值是有限的。 接下來,我們將看到元數據與數據的另一個關鍵方面如何相關: 數據質量 。
資料品質 (Data quality)
Again, Wikipedia says: “Data quality refers to the state of qualitative or quantitative pieces of information.” In general, data is said to have high quality when “it fits the intended use case regardless of data users”.
維基百科再次說:“ 數據質量是指定性或定量信息的狀態 。” 通常,當數據“適合預期的使用情況而與數據用戶無關”時,數據被認為具有高質量。
Data is a valuable source of information, but nobody wants to use a piece of crap. The more you desire to extract from data, the more significant is data quality. In the world of Big Data, this also becomes a bottleneck.
數據是有價值的信息來源,但是沒有人愿意使用這些廢話。 您希望從數據中提取的內容越多,數據質量就越重要。 在大數據世界中,這也成為瓶頸。

As data grows bigger, so does metadata. We are not used to handling a great amount of metadata. Since it needs a special kind of treatment, we must consider it is at the same time data and not data. Metadata is not an independent piece of information but rather an attachment to our data. We have the possibility to extend that to become an assessment of the data quality.
隨著數據的增長,元數據也隨之增長。 我們不習慣處理大量的元數據。 由于它需要一種特殊的處理方式,因此必須同時考慮它是數據而不是數據。 元數據不是獨立的信息,而是數據的附件。 我們有可能將其擴展為對數據質量的評估。
Data is a valuable source of information, but nobody wants to use a piece of crap
數據是有價值的信息來源,但是沒有人愿意使用廢話
In a common effort of cultivating a high data quality in Big data pipelines, tech companies are paying lots of attention to this newish subject. From detecting anomalies to automatic alerting systems, we wish to limit the impact of erroneous data as little as possible. We can’t do this without data comprehension, or precisely without metadata.
為了在大數據管道中培養高質量的數據,技術公司一直在關注這一新話題。 從檢測異常到自動警報系統,我們希望盡可能減少錯誤數據的影響。 沒有數據理解,或者沒有元數據,我們就無法做到這一點。
Data quality reflects via many aspects, but most often is the correctness of values. Imagine you plot a histogram of university students’ grades within a semester. The histogram is a statistical representation of those values, and it describes your data. It becomes metadata. What you might interpret is the distribution of the grades, then you can conclude whether it will fit your use case.
數據質量可以通過許多方面反映出來,但最常見的是值的正確性。 想象一下,您繪制了一個學期內大學生成績的直方圖 。 直方圖是這些值的統計表示形式,它描述了您的數據。 它成為元數據。 您可能會解釋的是成績的分布,然后可以得出結論是否適合您的用例。

There are many questions to be asked about data values beforehand. Are those values stable overtime? Are there any outliers? If yes, what should we do with those outliers? By answering these questions, we extract some insights, not information-wise but data-wise. We can create metadata, useful metadata. That’s just a primitive step in asserting data quality via metadata. We’ll have a good look at the next section on how we can leverage metadata that we could generate.
事先有很多關于數據值的問題。 這些值在超時后是否穩定? 有離群值嗎? 如果是,我們應該如何處理這些異常值? 通過回答這些問題,我們可以得出一些見解,而不是信息方面的見解,而是數據方面的見解。 我們可以創建元數據,有用的元數據。 這只是通過元數據聲明數據質量的原始步驟。 我們將在下一節中很好地介紹如何利用我們可以生成的元數據。
如何利用元數據 (How to leverage metadata)
Some people might be overwhelmed by the various statistical representations we can extract from a dataset. Others might as well ignore that additional information thinking it is useless. It’s true that we don’t need to draw a histogram every time working with data, but it helps. To leverage the insightful metadata, data users must first answer three important questions:
我們可能從數據集中提取的各種統計表示可能會讓某些人不知所措。 其他人可能會以為多余的信息無用,而忽略了這些信息。 的確,我們不需要每次處理數據時都繪制直方圖,但這很有用。 要利用有見地的元數據,數據用戶必須首先回答三個重要問題:
What: What do you want to verify the quality of your data? Some data requires strict stability while some need attention whether it’s righteous. For each kind of data, we adapt the information extracted as metadata. Statistical distribution, trends over time, discrepancies, etc. This is what we call the metadata strategy. We are limited in storage and human resources while working with both data and metadata. Therefore, we must think cautiously about where to focus.
什么: 您想驗證什么數據質量? 有些數據需要嚴格的穩定性,而有些則需要注意其是否合理。 對于每種數據,我們將提取的信息調整為元數據。 統計分布,隨時間的趨勢,差異等。這就是我們所說的元數據策略 。 在處理數據和元數據時,我們在存儲和人力資源上受到限制。 因此,我們必須謹慎考慮應將重點放在哪里。
How: How do we measure data quality? These actions follow the metadata strategy. We could choose to measure the whole database, or some tables, or a specific set of columns. The total number of values, the maximum/minimum length of a string, the proportion of missing data. What we decide to measure depends on how we use those data to produce outcomes.
如何: 我們如何衡量數據質量? 這些操作遵循元數據策略。 我們可以選擇測量整個數據庫,某些表或一組特定的列。 值的總數,字符串的最大/最小長度,丟失數據的比例。 我們決定衡量的內容取決于我們如何使用這些數據來產生結果。
When: Data changes over time. When we extract insights via metadata, we are tracking those transitions. When do we track the metadata? Every day? Every hour? Every quarter? It depends on how much granularity is sufficient to address data quality. We adapt our measure to how quickly the data can change. For example, stock market data needs to be tracked every single minute or second. Weather data changes every hour while aerospatial data can take months or years to shift.
時間:數據隨時間變化。 當我們通過元數據提取見解時,我們正在跟蹤這些過渡。 我們何時跟蹤元數據? 每天? 每隔一小時? 每個季度? 這取決于多少粒度足以解決數據質量。 我們會根據數據變化的速度調整指標。 例如,需要每隔一分鐘或一秒鐘跟蹤一次股市數據。 天氣數據每小時都會變化,而航空數據可能要花費數月或數年才能變化。

Metadata has its long history, but we have just recently discovered its contribution to data management, or especially data quality. Metadata itself can’t change the outcomes of data, but it adds a security and management layer between our raw data and its usage. You might even use metadata to discover your data without realizing it.
元數據具有悠久的歷史,但我們最近才發現它對數據管理 (特別是數據質量)的貢獻。 元數據本身無法更改數據的結果,但會在原始數據及其使用之間增加安全性和管理層。 您甚至可能使用元數據來發現數據而沒有意識到。
Data quality might be insignificant when your data is small, but it becomes critical when working with a bigger amount. Metadata helps us keep track of that growth, and make sure the data evolves as it should be. By failing to leverage metadata, we fail to understand your data.
當您的數據較小時,數據質量可能微不足道,但在處理大量數據時就變得至關重要。 元數據可幫助我們跟蹤增長情況,并確保數據按預期發展。 由于未能利用元數據,我們無法理解您的數據。
我該如何處理元數據? (What should I do with metadata?)
If you wish to master your data, you should start to treat metadata systematically. Base on the framework we have seen above, you choose for yourself a suitable data strategy. There’s nothing fancy about it yet. It starts with how you wish to use your data and how you control the quality of its usage. Everything starts with a goal.
如果您希望掌握數據,則應該開始系統地處理元數據。 在上面我們看到的框架的基礎上,您可以自己選擇合適的數據策略。 對此還沒有幻想。 它從您希望如何使用數據以及如何控制其使用質量開始。 一切始于目標。
There’s one phase in the ETL process called Exploratory Data Analysis. I find it quite interesting to know more about the statistical aspect of your data. It seems to be close to what we would like to know via metadata.
ETL過程中有一個階段稱為“ 探索性數據分析” 。 我發現對您的數據的統計方面的更多了解非常有趣。 它似乎與我們希望通過元數據知道的內容接近。
I always see my data scientists and/or data analysts friends start with EDA before doing anything with their raw data. So I’ve figured out it must be an important step and I wondered how it’s linked to my metadata framework. They turn out to share quite a lot of things in common.
我總是看到我的數據科學家和/或數據分析師朋友從EDA開始,然后再處理原始數據。 因此,我認為這必須是重要的一步,我想知道它如何與我的元數據框架鏈接。 他們竟然分享了很多共同點。
First comes the purpose. The “exploratory” part in EDA somehow coincides with the discovery objective of metadata. Second is how they both look at the statistical side of data to evaluate its future usage. With all that said, EDA is actually a must-to-have step due to its similarity to metadata-based assessment on data quality.
首先是目的。 EDA中的“探索性”部分在某種程度上與元數據的發現目標相吻合。 其次是他們倆都如何看待數據的統計方面來評估其未來使用情況。 綜上所述,EDA實際上是必不可少的步驟,因為它與基于元數據的數據質量評估相似。
You have the data strategy, the data evaluation, now it’s the time for you to decide what to proceed with all that information. How the data will be used decides whether it’s righteous and trustworthy under the eyes of a data quality control.
您有了數據策略,數據評估,現在是時候決定如何處理所有信息。 在數據質量控制的眼中,如何使用數據將決定其是否合理和可信賴。
Key takeaways:- Build your data strategy based on data usability
- Apply an EDA - Exploratory Data Analysis to evaluate the data
- Decide on whether you have a solid confidence on your data
結論 (Conclusion)
I’ve shared some of my points of view on metadata. For me, it has as much value as the data itself. Those who take advantage of these values are the ones who understand their data. It’s easier to misuse something we don’t comprehend. Metadata gives us a clearer view of the data, and furthermore data quality, integrity, and usability.
我已經分享了一些有關元數據的觀點。 對我來說,它與數據本身一樣有價值。 那些利用這些價值的人就是了解他們的數據的人。 濫用我們不理解的東西會更容易。 元數據為我們提供了更清晰的數據視圖,以及數據質量,完整性和可用性。
My name’s Nam Nguyen, and I write (mostly) about Big Data. Enjoy your reading? Follow me on Medium and Twitter for more updates.
我叫Nam Nguyen,(主要)寫有關大數據的文章。 喜歡閱讀嗎? 在Medium和Twitter上關注我以獲取更多更新。
翻譯自: https://towardsdatascience.com/want-to-master-your-data-heres-why-you-should-care-about-metadata-8fcd7754c3b8
掌握大數據數據分析師嗎?
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391843.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391843.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391843.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!