照顧好自己才能照顧好別人_您必須照顧的5個基本數據

照顧好自己才能照顧好別人

I am pretty sure that on your data journey you came across some courses, videos, articles, maybe use cases where someone takes some data, builds a classification/regression model, shows you great results, you learn how that model works and why it works that way and not another and everything seems to be fine. You think you just learned a new thing (and you did), you are happy about that (yes, you are ! I am not kidding around here, you’re doing great!) and you continue to the next piece of content.

我很確定在您的數據之旅中,您遇到了一些課程,視頻,文章,也許是用例,其中有人獲取了一些數據,建立了分類/回歸模型,向您展示了出色的結果,您了解了該模型的工作原理以及工作原理這樣,而不是另一種,一切似乎都很好。 您以為自己剛剛學到了新事物(并且確實做到了),對此感到很高興(是的,您是!我在這里不是在開玩笑,您做得很好!),并且您繼續閱讀下一篇內容。

But later on you start to ask additional questions (everyone has different length of that “later on”), like: where did that data come from? and if I have more data, will that model run so smoothly as it did during the demonstration? does the data in real world exist in such format? can I get similar data and if I can will it be so easy to process? what did the results of that model mean? can I present that data in prettier way? and so on and so on and so on.

但是稍后,您開始提出其他問題(每個人的“以后”的時長都不同),例如:數據來自何處? 如果我有更多數據,該模型是否會像演示期間那樣平穩運行? 現實世界中的數據是否以這種格式存在? 我可以得到類似的數據嗎?如果可以的話,它會很容易處理嗎? 該模型的結果是什么意思? 我可以用更漂亮的方式顯示這些數據嗎? 等等,依此類推。

When I started to learn about data analytics, data science, world of data in general I was always amused by the results people will get after processing some piece of data, or after running a machine learning model or after getting keys from word buckets etc. But every time I would try to do something on my own it will always appear a new obstacle: the data I would like to analyze is too much or not enough, one model will run with one piece of data, but it won’t with another etc etc.

當我開始學習數據分析,數據科學,一般的數據領域時,我總是對人們在處理某些數據,運行機器學習模型或從單詞存儲桶中獲取鍵等后所獲得的結果感到很滿意。但是每次我嘗試自己做某事時,總會出現新的障礙:我想分析的數據太多或不夠,一個模型將只處理一個數據,而不會其他等等

After having all these difficulties and learning to deal with them the hard way I would like to share with the essential 5 Vs of data that you have to have taken care of before you start your data project/solution.

在經歷了所有這些困難并學會了用困難的方式解決這些問題之后,我想與您在開始數據項目/解決方案之前必須要處理的5個基本數據進行分享。

1st V —音量 (1st V — Volume)

When we talk “volume” in regards of data we have to be aware of amount of data that has to be handled in the project — should we use several servers to handle that volume and distribute the load between them? or maybe our computer with our own hard disk is quite enough to solve the problem?

當我們談論數據的“卷”時,我們必須知道項目中必須處理的數據量-我們是否應該使用多個服務器來處理該卷并在它們之間分配負載? 還是我們擁有自己的硬盤的計算機足以解決問題?

2nd V —速度 (2nd V — Velocity)

Velocity is the speed with which data travels through our model/project/solution. The speed with which it is ingested, processed and delivered to the end client. We have to be aware if this is real-time data, near real-time or maybe this is just historic data which is not going anywhere soon and we can talk her out slowly and efficiently 😉

速度是數據在我們的模型/項目/解決方案中傳播的速度。 攝取,處理并交付給最終客戶端的速度。 我們必須知道這是實時數據還是接近實時數據,或者這僅僅是歷史數據,很快就不會流傳了,我們可以慢慢有效地與她交談talk

3rd V —綜藝 (3rd V — Variety)

Data comes from various sources, in various types, structured, semi-structured and not structured at all (officially unstructured XD) and boy, I’ve got burned on it a lot. When my pipeline will expect one data type (because I tested it with the sample and it worked) and then it will give me an error because there is additional type or structure that is not yet supported by my solution. This kind of things has to be defined in the beginning, you have to know the levels of variety of the data you are working with.

數據來自各種類型,結構化,半結構化和根本沒有結構化(官方非結構化XD)的各種來源,而我對此非常著迷。 當我的管道需要一種數據類型時(因為我已經用示例對其進行了測試并且可以工作),然后它會給我一個錯誤,因為我的解決方案尚不支持其他類型或結構。 這類事情必須在一開始就定義好,您必須知道所使用的各種數據的級別。

4th V —準確性 (4th V — Veracity)

Is the data I am working with is worth trusting? Is it trustworthy? Is it still correct after all the manipulations and cleanings? Was the pipe of transformation correct? These are the questions we ask when we talk about the veracity of the data. We can collect all the data we need and it won’t be that difficult, but will it be accurate and consistent, won’t it be falsely altered — that’s another challenge. We all aware that in order to get insights from the data we have to perform a little of preprocessing and we have to make sure that process does not skew the data.

我正在使用的數據值得信任嗎? 值得信賴嗎? 經過所有的操作和清潔后,它仍然正確嗎? 轉型的管道正確嗎? 這些是我們談論數據準確性時要提出的問題。 我們可以收集所需的所有數據,并不會那么困難,但是它將是準確且一致的,不會被錯誤地更改,這是另一個挑戰。 我們都知道,為了從數據中獲得見解,我們必須執行一些預處理,并且必須確保過程不會使數據傾斜。

5th V —價值 (5th V — Value)

And the last V goes for value. Because in the end of the day the whole point of all this is to get value from data. That includes creating reports and dashboards, finding useful insights that can improve business, highlighting critical areas to make more informed decisions.

最后一個V表示價值。 因為歸根結底,這一切的全部目的都是從數據中獲取價值。 這包括創建報告和儀表板,發現可以改善業務的有用見解,突出顯示關鍵區域以做出更明智的決策。

You may object that those are 5 Vs of big data and you will be right. Yes, those are 5 Vs of big data, but not only. Any data project has to deal with these 5 Vs. Big data project will have it more complicated to handle, small data project will be just easier to manage all 5 of Vs.

您可能會反對說那是大數據的5 V,您將是對的。 是的,那是大數據的5 V,但不僅如此。 任何數據項目都必須處理這5個V。 大數據項目將使其更復雜,小數據項目將更易于管理所有5個V。

For example, I was working on a data solution for the HR department and in the beginning we had to address the 5 Vs of the data. Even though we didn’t have terabytes of data, we had a lot of small Excel files were the data was previously stored and distributed (volume). There were 3 different sources of data to collect from: Excel files, corporate DB and corporate CRM (variety). The data would be updated on a daily basis and users would want the actual data as quickly as possible with a maximum delay of 30 minutes — it’s not even close to real-time, but we still have to make sure that the pipeline is executed fast enough (velocity). Data coming from Excels would be always altered by the human at some point of time and there is always a dispute which actualization goes first, so we had to deal with that too (veracity). And in order to get value from the data we had to find a way to visualize it and create a possibility for the end user to explore it and make their own conclusions (value).

例如,我正在為人事部門開發數據解決方案,一開始我們必須處理5 V數據。 即使我們沒有太字節的數據,我們還是有很多小的Excel文件,這些文件是以前存儲和分發(卷)的數據。 有3種不同的數據來源可供收集:Excel文件,公司DB和公司CRM(品種)。 數據將每天進行更新,并且用戶希望盡可能快地獲取實際數據,最大延遲為30分鐘-甚至不接近實時,但我們仍然必須確保管道能夠快速執行足夠(速度)。 來自Excel的數據將始終在某個時間點被人類更改,并且始終存在首先實現的爭議,因此我們也必須處理(準確性)。 為了從數據中獲得價值,我們必須找到一種可視化的方法,并為最終用戶創造一種探索它并得出自己的結論(價值)的可能性。

We invested our time in the beginning to find the solutions for every V with our data and having done that we were able to finish our project just in time — even with lovely documentation.

我們從一開始就投入了時間,使用我們的數據為每個V查找解決方案,并且這樣做即使沒有精美的文檔也能及時完成我們的項目。

So even though you are just going to process Titanic datasets, think of 5 Vs, it will take you 2 minutes, but you will be ready for the unpredictable. despite you know who’s gonna die there XD.

因此,即使您只是要處理Titanic數據集,以5 V為例,它也將花費您2分鐘的時間,但您已經為不可預測的事情做好了準備。 盡管您知道誰會在那里死XD。

Originally published at https://sergilehkyi.com on August 10, 2020.

最初于 2020年8月10日 發布在 https://sergilehkyi.com 上。

翻譯自: https://medium.com/swlh/5-essential-data-vs-you-have-to-take-care-of-b4e03e8964c1

照顧好自己才能照顧好別人

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388687.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388687.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388687.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

matlab數字仿真實驗,DVR+備用電源自動投入的MATLAB數字仿真實驗仿真實驗

一、動態電壓恢復器(DVR)的數字仿真實驗動態電壓恢復器(Dynamic Voltage Restorer,DVR)是一種基于電力電子技術的串聯補償裝置,通常安裝在電源與敏感負荷之間,其作用在于:保證電網供電質量,補償供電電網產生的電壓跌落…

c#,xp系統,Matlab6.5

編譯環境:c#,xp系統,Matlab6.5 新建一個窗體項目,添加matlab引用。 然后試了四種方式調用matlab: 第一種 view plaincopy to clipboardprint?MLApp.MLAppClass matlab new MLApp.MLAppClass(); matlab.Visible 1;…

java script 對象

java script 對象 1.創建方式 1)通過字面量的形式創建 例;var stt{x:1,y:2,y:3}; 或;var stt{ x:1, y:2, for:3 } 注意關鍵字必須放到引號中間 2)通過new創建對象 例:var new stt(); stt.name 小魚; stt.age 20…

認識數據分析_認識您的最佳探索數據分析新朋友

認識數據分析Visualization often plays a minimal role in the data science and model-building process, yet Tukey, the creator of Exploratory Data Analysis, specifically advocated for the heavy use of visualization to address the limitations of numerical indi…

架構探險筆記10-框架優化之文件上傳

確定文件上傳使用場景 通常情況下,我們可以通過一個form(表單)來上傳文件,就以下面的“創建客戶”為例來說明(對應的文件名是customer_create.jsp),需要提供一個form,并將其enctype屬…

matlab飛行數據仿真,基于MATLAB的飛行仿真

收稿日期: 2005 - 05 - 15   第 23卷  第 06期 計  算  機  仿  真 2006年 06月    文章編號: 1006 - 9348 (2006) 06 - 0057 - 05 基于 MATLAB的飛行仿真 張鐳 ,姜洪洲 ,齊潘國 ,李洪人 (哈爾濱工業大學電液伺服仿真及試驗系統研究所 ,黑龍江 哈爾濱 150001) 摘要:該…

Windows Server 2003 DNS服務安裝篇

導讀-- DNS(Domain Name System,域名系統)是一種組織成層次結構的分布式數據庫,里面包含有從DNS域名到各種數據類型(如IP地址)的映射“貴有恒,何必三更起五更勤;最無益,只怕一日曝十日寒。”前一段時間巴哥因為一些生活瑣事而中止…

正則表達式matlab,正則表達式中一個word的匹配?@MATLAB - 優秀的Free?OS(Linux)版 - 北大未名BBS...

我目前想做的就是判斷一個str是否可以被認為是有效的MATLAB index。最好的方法是直接運行,然后看運行結果或報錯類型,但是我不打算在不知道是什么類型的東西之前運行它,所以可以預先parse一下,簡單判斷是否“長得跟有效的MATLAB i…

arima模型怎么擬合_7個統計測試,用于驗證和幫助擬合ARIMA模型

arima模型怎么擬合什么是ARIMA? (What is ARIMA?) ARIMA models are one of the most classic and most widely used statistical forecasting techniques when dealing with univariate time series. It basically uses the lag values and lagged forecast error…

jQuery禁止Ajax請求緩存

一 現象 get請求在有些瀏覽器中會緩存。瀏覽器不會發送請求,而是使用上次請求獲取到的結果。 post請求不會緩存。每次都會發送請求。 二 解決 jQuery提供了禁止Ajax請求緩存的方法: $.ajax({type: "get",url: "http://www.baidu.com?_&…

python 實例

參考 http://developer.51cto.com/art/201804/570408.htm 轉載于:https://www.cnblogs.com/artesian0526/p/9552510.html

[WPF]ListView點擊列頭排序功能實現

[WPF]ListView點擊列頭排序功能實現 這是一個非常常見的功能,要求也很簡單,在Column Header上顯示一個小三角表示表示現在是在哪個Header上的正序還是倒序就可以了。微軟的MSDN也已經提供了實現方式。微軟的方法中,是通過ColumnHeader Templ…

天池幸福感的數據處理_了解幸福感與數據(第1部分)

天池幸福感的數據處理In these exceptional times, the lockdown left many of us with a lot of time to think. Think about the past and the future. Think about our way of life and our achievements. But most importantly, think about what has been and would be ou…

標線markLine的用法

series: [{markLine: {itemStyle: {normal: { lineStyle: { type: solid, color:#000 },label: { show: true, position:left } }},data: [{name: 平均線,// 支持 average, min, maxtype: average},{name: Y 軸值為 100 的水平線,yAxis: 100},[{// 起點和終點的項會共用一個 na…

php pfm 改端口,羅馬2ESF和PFM 修改建筑 軍團 派系 兵種等等等很多東西的教程

本帖最后由 clueber 于 2013-10-5 12:30 編輯本人是個羅馬死忠加修改黨,恩,所以分享一下自己的修改心得修改工具為ESF1.0.7和PFM3.0.3首先是ESF修改。ESF可以用來改開局設定和存檔,修改開局設定是startpos.esf文件,在存檔在我這里…

紅草綠葉

從小到大喜歡陰天,喜歡下雨,喜歡那種潮濕的感覺。卻又絲毫容不得腳上有一絲的水汽,也極其討厭穿涼鞋。小時候特別喜歡去山上玩,偷桃子柿子,一切一切都成了美好的回憶,長大了,那些事情就都不復存…

wpf listview 使用

單列&#xff1a; <ListView Grid.Column"1" Height"284" HorizontalAlignment"Left" Margin"64,73,0,0" Name"listView1" VerticalAlignment"Top" Width"310" > <ListView.Items…

php 獲取當天到23 59,js 獲取當天23點59分59秒 時間戳 (最簡單的方法)

原生Ajax 和Jq Ajax前言:這次介紹的是利用ajax與后臺進行數據交換的小例子,所以demo必須通過服務器來打開.服務器環境非常好搭建,從網上下載wamp或xampp,一步步安裝就ok,然后再把寫好的頁面放在服務器中指定的 ...『TCP&sol;IP詳解——卷一&#xff1a;協議』讀書筆記——1…

詹森不等式_注意詹森差距

詹森不等式背景 (Background) In Kaggle’s M5 Forecasting — Accuracy competition, the square root transformation ruined many of my team’s forecasts and led to a selective patching effort in the eleventh hour. Although it turned out well, we were reminded t…

【轉載】儒林外史人物——荀玫

寫在前面&#xff1a;本博客內容為轉載&#xff0c;原文URL&#xff1a;http://blog.sina.com.cn/s/blog_9132ac5b0101iukw.html 說完周進&#xff0c;本應順著說范進&#xff0c;但我覺得荀玫他們村的事情過于喜感&#xff0c;想先說荀玫。 荀玫簡直是儒林中的某類標桿人物&am…