照顧好自己才能照顧好別人

I am pretty sure that on your data journey you came across some courses, videos, articles, maybe use cases where someone takes some data, builds a classification/regression model, shows you great results, you learn how that model works and why it works that way and not another and everything seems to be fine. You think you just learned a new thing (and you did), you are happy about that (yes, you are ! I am not kidding around here, you’re doing great!) and you continue to the next piece of content.

我很確定在您的數據之旅中，您遇到了一些課程，視頻，文章，也許是用例，其中有人獲取了一些數據，建立了分類/回歸模型，向您展示了出色的結果，您了解了該模型的工作原理以及工作原理這樣，而不是另一種，一切似乎都很好。您以為自己剛剛學到了新事物(并且確實做到了)，對此感到很高興(是的，您是！我在這里不是在開玩笑，您做得很好！)，并且您繼續閱讀下一篇內容。

But later on you start to ask additional questions (everyone has different length of that “later on”), like: where did that data come from? and if I have more data, will that model run so smoothly as it did during the demonstration? does the data in real world exist in such format? can I get similar data and if I can will it be so easy to process? what did the results of that model mean? can I present that data in prettier way? and so on and so on and so on.

但是稍后，您開始提出其他問題(每個人的“以后”的時長都不同)，例如：數據來自何處？如果我有更多數據，該模型是否會像演示期間那樣平穩運行？現實世界中的數據是否以這種格式存在？我可以得到類似的數據嗎？如果可以的話，它會很容易處理嗎？該模型的結果是什么意思？我可以用更漂亮的方式顯示這些數據嗎？等等，依此類推。

When I started to learn about data analytics, data science, world of data in general I was always amused by the results people will get after processing some piece of data, or after running a machine learning model or after getting keys from word buckets etc. But every time I would try to do something on my own it will always appear a new obstacle: the data I would like to analyze is too much or not enough, one model will run with one piece of data, but it won’t with another etc etc.

當我開始學習數據分析，數據科學，一般的數據領域時，我總是對人們在處理某些數據，運行機器學習模型或從單詞存儲桶中獲取鍵等后所獲得的結果感到很滿意。但是每次我嘗試自己做某事時，總會出現新的障礙：我想分析的數據太多或不夠，一個模型將只處理一個數據，而不會其他等等

After having all these difficulties and learning to deal with them the hard way I would like to share with the essential 5 Vs of data that you have to have taken care of before you start your data project/solution.

在經歷了所有這些困難并學會了用困難的方式解決這些問題之后，我想與您在開始數據項目/解決方案之前必須要處理的5個基本數據進行分享。

1st V —音量 (1st V — Volume)

When we talk “volume” in regards of data we have to be aware of amount of data that has to be handled in the project — should we use several servers to handle that volume and distribute the load between them? or maybe our computer with our own hard disk is quite enough to solve the problem?

當我們談論數據的“卷”時，我們必須知道項目中必須處理的數據量-我們是否應該使用多個服務器來處理該卷并在它們之間分配負載？還是我們擁有自己的硬盤的計算機足以解決問題？

2nd V —速度 (2nd V — Velocity)

Velocity is the speed with which data travels through our model/project/solution. The speed with which it is ingested, processed and delivered to the end client. We have to be aware if this is real-time data, near real-time or maybe this is just historic data which is not going anywhere soon and we can talk her out slowly and efficiently 😉

速度是數據在我們的模型/項目/解決方案中傳播的速度。攝取，處理并交付給最終客戶端的速度。我們必須知道這是實時數據還是接近實時數據，或者這僅僅是歷史數據，很快就不會流傳了，我們可以慢慢有效地與她交談talk

3rd V —綜藝 (3rd V — Variety)

Data comes from various sources, in various types, structured, semi-structured and not structured at all (officially unstructured XD) and boy, I’ve got burned on it a lot. When my pipeline will expect one data type (because I tested it with the sample and it worked) and then it will give me an error because there is additional type or structure that is not yet supported by my solution. This kind of things has to be defined in the beginning, you have to know the levels of variety of the data you are working with.

數據來自各種類型，結構化，半結構化和根本沒有結構化(官方非結構化XD)的各種來源，而我對此非常著迷。當我的管道需要一種數據類型時(因為我已經用示例對其進行了測試并且可以工作)，然后它會給我一個錯誤，因為我的解決方案尚不支持其他類型或結構。這類事情必須在一開始就定義好，您必須知道所使用的各種數據的級別。

4th V —準確性 (4th V — Veracity)

Is the data I am working with is worth trusting? Is it trustworthy? Is it still correct after all the manipulations and cleanings? Was the pipe of transformation correct? These are the questions we ask when we talk about the veracity of the data. We can collect all the data we need and it won’t be that difficult, but will it be accurate and consistent, won’t it be falsely altered — that’s another challenge. We all aware that in order to get insights from the data we have to perform a little of preprocessing and we have to make sure that process does not skew the data.

我正在使用的數據值得信任嗎？值得信賴嗎？經過所有的操作和清潔后，它仍然正確嗎？轉型的管道正確嗎？這些是我們談論數據準確性時要提出的問題。我們可以收集所需的所有數據，并不會那么困難，但是它將是準確且一致的，不會被錯誤地更改，這是另一個挑戰。我們都知道，為了從數據中獲得見解，我們必須執行一些預處理，并且必須確保過程不會使數據傾斜。

5th V —價值 (5th V — Value)

And the last V goes for value. Because in the end of the day the whole point of all this is to get value from data. That includes creating reports and dashboards, finding useful insights that can improve business, highlighting critical areas to make more informed decisions.

最后一個V表示價值。因為歸根結底，這一切的全部目的都是從數據中獲取價值。這包括創建報告和儀表板，發現可以改善業務的有用見解，突出顯示關鍵區域以做出更明智的決策。

You may object that those are 5 Vs of big data and you will be right. Yes, those are 5 Vs of big data, but not only. Any data project has to deal with these 5 Vs. Big data project will have it more complicated to handle, small data project will be just easier to manage all 5 of Vs.

您可能會反對說那是大數據的5 V，您將是對的。是的，那是大數據的5 V，但不僅如此。任何數據項目都必須處理這5個V。大數據項目將使其更復雜，小數據項目將更易于管理所有5個V。

For example, I was working on a data solution for the HR department and in the beginning we had to address the 5 Vs of the data. Even though we didn’t have terabytes of data, we had a lot of small Excel files were the data was previously stored and distributed (volume). There were 3 different sources of data to collect from: Excel files, corporate DB and corporate CRM (variety). The data would be updated on a daily basis and users would want the actual data as quickly as possible with a maximum delay of 30 minutes — it’s not even close to real-time, but we still have to make sure that the pipeline is executed fast enough (velocity). Data coming from Excels would be always altered by the human at some point of time and there is always a dispute which actualization goes first, so we had to deal with that too (veracity). And in order to get value from the data we had to find a way to visualize it and create a possibility for the end user to explore it and make their own conclusions (value).

例如，我正在為人事部門開發數據解決方案，一開始我們必須處理5 V數據。即使我們沒有太字節的數據，我們還是有很多小的Excel文件，這些文件是以前存儲和分發(卷)的數據。有3種不同的數據來源可供收集：Excel文件，公司DB和公司CRM(品種)。數據將每天進行更新，并且用戶希望盡可能快地獲取實際數據，最大延遲為30分鐘-甚至不接近實時，但我們仍然必須確保管道能夠快速執行足夠(速度)。來自Excel的數據將始終在某個時間點被人類更改，并且始終存在首先實現的爭議，因此我們也必須處理(準確性)。為了從數據中獲得價值，我們必須找到一種可視化的方法，并為最終用戶創造一種探索它并得出自己的結論(價值)的可能性。

We invested our time in the beginning to find the solutions for every V with our data and having done that we were able to finish our project just in time — even with lovely documentation.

我們從一開始就投入了時間，使用我們的數據為每個V查找解決方案，并且這樣做即使沒有精美的文檔也能及時完成我們的項目。

So even though you are just going to process Titanic datasets, think of 5 Vs, it will take you 2 minutes, but you will be ready for the unpredictable. despite you know who’s gonna die there XD.

因此，即使您只是要處理Titanic數據集，以5 V為例，它也將花費您2分鐘的時間，但您已經為不可預測的事情做好了準備。盡管您知道誰會在那里死XD。

Originally published at https://sergilehkyi.com on August 10, 2020.

最初于 2020年8月10日 發布在 https://sergilehkyi.com 上。

翻譯自: https://medium.com/swlh/5-essential-data-vs-you-have-to-take-care-of-b4e03e8964c1

照顧好自己才能照顧好別人

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/388687.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/388687.shtml
英文地址，請注明出處：http://en.pswp.cn/news/388687.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

matlab數字仿真實驗,DVR+備用電源自動投入的MATLAB數字仿真實驗仿真實驗

一、動態電壓恢復器(DVR)的數字仿真實驗動態電壓恢復器(Dynamic Voltage Restorer，DVR)是一種基于電力電子技術的串聯補償裝置，通常安裝在電源與敏感負荷之間，其作用在于：保證電網供電質量，補償供電電網產生的電壓跌落…

照顧好自己才能照顧好別人_您必須照顧的5個基本數據

1st V —音量 (1st V — Volume)

2nd V —速度 (2nd V — Velocity)

3rd V —綜藝 (3rd V — Variety)

4th V —準確性 (4th V — Veracity)

5th V —價值 (5th V — Value)

相關文章