數據科學家 數據工程師_數據科學家應該對數據進行版本控制的4個理由

數據科學家 數據工程師

While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

在軟件項目中工作時,它是非常普遍的,實際上是立即開始版本控制代碼的標準,對于軟件社區來說,好處已經非常明顯:它跟蹤特定代碼存儲庫中對代碼的每次修改。 如果有任何錯誤,開發人員可以隨時瀏覽并比較早期版本的代碼,以解決問題,同時最大程度地減少對所有團隊成員的破壞。 軟件項目代碼是最寶貴的資產,因此必須不惜一切代價保護它!

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?

好吧,對于數據科學項目,數據也可以被視為皇冠上的明珠,那么為什么我們作為數據科學家不通過版本控制將其視為地球上最寶貴的東西呢?

For those familiar with Git, you might be thinking, “Git cannot handle large files and directories.. at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

對于熟悉Git的人來說,您可能會想: “ Git無法處理大文件和目錄。至少,它不能具有與處理小代碼文件相同的性能。 那么如何以與版本控制代碼相同的舊版本來控制數據呢?”。 嗯,這已經成為可能,而且很容易,只需鍵入git clone并查看保存在工作區中的數據文件和ML模型文件,并且所有這些魔力都可以通過DVC來實現。

DVC快速入門 (Quick start with DVC)

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

首先,我們必須在計算機中安裝DVC。 這非常簡單,您可以按照以下步驟進行操作 。

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

正如我已經提到的那樣,用于數據版本控制的工具(例如DVC)使構建大型項目成為可能,同時又可以重現管道。 使用DVC,將數據集添加到git存儲庫非常簡單,而我的意思很簡單,就像鍵入以下行一樣:

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

無論數據集的大小如何,數據都會添加到存儲庫中。 假設我們也想將數據集推送到云中,也可以使用以下命令:

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

DVC開箱即用,支持許多云存儲服務,例如S3,Google Storage,Azure Blob,Google Drive等。由于數據集是通過版本控制系統推送到云的,因此如果我將項目克隆到另一臺計算機上,我可以使用以下命令下載數據或任何其他工件:

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

好了,既然您知道如何開始使用DVC,我建議您繼續研究該工具或類似工具。 作為數據科學家,版本控制應該是您最好的朋友,因為它們不僅允許版本數據集,而且允許創建可復制的管道,同時保持所有開發的可追溯性和可復制性。

If this hasn’t yet convinced, next I’ll tell why you must start versioning control your data!!

如果尚未確定,接下來我將告訴為什么必須開始版本控制您的數據!

為什么要開始使用數據版本控制? (Why should I start using data version control?)

1.保存并復制所有數據實驗 (1. Save and reproduce all of your data experiments)

As Data Scientists we know that to develop a Machine Learning model, is not all about code, but also about data and the right parameters. A lot of times, in order to find the perfect match, experimentation is required, which makes the process highly iterative and extremely important to keep track of the changes made as well as their impacts on the end results. This becomes even more important in a complex environment where multiple data scientists are collaborating. In that sense, if we are able to have a snapshot of the data used to develop a certain version of the model and have it versioned, it makes the process of iteration and model development not only easier but also trackable.

作為數據科學家,我們知道開發機器學習模型不僅與代碼有關,而且與數據和正確的參數有關。 很多時候,為了找到完美的匹配,需要進行實驗,這使得該過程具有高度的重復性,并且對于跟蹤所做的更改及其對最終結果的影響非常重要。 在由多個數據科學家協作的復雜環境中,這一點變得更加重要。 從這個意義上講,如果我們能夠擁有用于開發模型的特定版本的數據的快照并對其進行版本化,那么它不僅使迭代和模型開發過程變得更加容易而且可跟蹤。

2.調試和測試 (2. Debugging and testing)

While playing around in Kaggle competitions many times we do not understand the real challenges inherent to the development of an ML-based solution while working with production systems. In fact, one of the biggest challenges is to deal with the variety of data sources and the amount of data that we’ve available. Sometimes can be a bit daunting to reproduce the results of experimentation if we are not even able to retrieve the exact dataset that has been used. Data version control can ease these issues and make the process of machine learning solutions development must simpler, organized, and reproducible.

當多次參加Kaggle比賽時,我們不了解在與生產系統一起工作時開發基于ML的解決方案所固有的真正挑戰。 實際上,最大的挑戰之一是處理各種數據源和我們可用的數據量。 如果我們甚至無法檢索已使用的確切數據集,有時要重現實驗結果可能會有些艱巨。 數據版本控制可以緩解這些問題,并使機器學習解決方案的開發過程必須更簡單,更有條理并且可重現。

3.合規與審計 (3. Compliance and auditing)

Privacy regulations, such as GDPR, already request companies and organizations to demonstrate compliance and history of the available data sources. The ability to track data version provided by version control tools is the first step to have companies data sources ready for compliance, and an essential step in maintaining a strong and robust audit train and risk management processes around data.

隱私法規(例如GDPR)已經要求公司和組織證明合規性和可用數據源的歷史記錄。 跟蹤版本控制工具提供的數據版本的能力是使公司數據源準備好合規的第一步,并且是維持圍繞數據的強大而強大的審核培訓和風險管理流程的重要步驟。

4.協調軟件和數據科學團隊 (4. Align software and data science teams)

Sometimes, to have Data Science and Software teams talking the same language can be quite challenging and can highly depend on the profiles involved in the interactions between the teams. To start implementing some of the good practices from the software into the data science processes, can help not only to align the work between the teams involved, but also to accelerate the development and integration of the solutions.

有時,讓數據科學和軟件團隊說相同的語言可能會非常具有挑戰性,并且在很大程度上取決于團隊之間交互所涉及的配置文件。 從軟件到數據科學流程開始實施一些良好實踐,不僅可以幫助使相關團隊之間的工作保持一致,還可以加快解決方案的開發和集成。

結論 (Conclusions)

Data science is had to productize, and one of the main reasons for that is because there are too many mutable elements, such as data. The concept of versioning for data science applications can be interpreted in many possible ways, from models to data versioning. This article aimed to cover the importance and benefits of versioning data for the data science teams, but there are many more aspects that we should pay attention to as Data Scientists. In the end, keeping an eye on continuous delivery principles is very important for the success of ML-based solutions!

數據科學必須進行生產,其主要原因之一是因為可變元素(例如數據)太多。 從模型到數據版本控制,可以采用許多可能的方式來解釋數據科學應用程序的版本控制概念。 本文旨在介紹對數據科學團隊進行數據版本控制的重要性和好處,但是作為數據科學家,我們還有許多方面應注意。 最后,密切注意連續交付原則對于基于ML的解決方案的成功非常重要!

Fabiana Clemente is CDO at YData.

Fabiana Clemente YData的 CDO

Improved data for AI

改善AI數據

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.

YData為數據科學家提供了以數據為中心的開發平臺,以處理高質量和合成數據。

翻譯自: https://medium.com/swlh/4-reasons-why-data-scientists-should-version-data-672aca5bbd0b

數據科學家 數據工程師

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389219.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389219.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389219.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

JDK 下載相關資料

所有版本JDK下載地址: http://www.oracle.com/technetwork/java/archive-139210.html 下載賬戶密碼: 2696671285qq.com Oracle123 轉載于:https://www.cnblogs.com/bg7c/p/9277729.html

商米

2019獨角獸企業重金招聘Python工程師標準>>> 今天看了一下商米的官網,發現他家的東西還真的是不錯。有錢了,想去體驗一下。 如果我妹妹還有開便利店的話,我會推薦他用這個。小巧便捷,非常方便。 轉載于:https://my.osc…

C#生成安裝文件后自動附加數據庫的思路跟算法

using System; using System.Collections.Generic; using System.Windows.Forms; using System.Data.SqlClient; using System.Data; using System.ServiceProcess; namespace AdminZJC.DataBaseControl { /// <summary> /// 數據庫操作控制類 /// </summary> …

python交互式和文件式_使用Python創建和自動化交互式儀表盤

python交互式和文件式In this tutorial, I will be creating an automated, interactive dashboard of Texas COVID-19 case count by county using python with the help of selenium, pandas, dash, and plotly. I am assuming the reader has some familiarity with python,…

不可不說的Java“鎖”事

2019獨角獸企業重金招聘Python工程師標準>>> 前言 Java提供了種類豐富的鎖&#xff0c;每種鎖因其特性的不同&#xff0c;在適當的場景下能夠展現出非常高的效率。本文旨在對鎖相關源碼&#xff08;本文中的源碼來自JDK 8&#xff09;、使用場景進行舉例&#xff0c…

數據可視化 信息可視化_可視化數據以幫助清理數據

數據可視化 信息可視化The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been…

VS2005 ASP.NET2.0安裝項目的制作(包括數據庫創建、站點創建、IIS屬性修改、Web.Config文件修改)

站點&#xff1a; 如果新建默認的Web安裝項目&#xff0c;那它將創建的默認網站下的一個虛擬應用程序目錄而不是一個新的站點。故我們只有創建新的安裝項目&#xff0c;而不是Web安裝項目。然后通過安裝類進行自定義操作&#xff0c;創建新站如下圖&#xff1a; 2、創建新的安項…

docker的基本命令

docker的三大核心&#xff1a;倉庫(repository),鏡像(image),容器(container)三者相互轉換。 1、鏡像(image) 鏡像&#xff1a;組成docker容器的基礎.類似安裝系統的鏡像 docker pull tomcat 通過pull來下載tomcat docker push XXXX 通過push的方式發布鏡像 2、容器(container)…

seaborn添加數據標簽_常見Seaborn圖的數據標簽快速指南

seaborn添加數據標簽In the course of my data exploration adventures, I find myself looking at such plots (below), which is great for observing trend but it makes it difficult to make out where and what each data point is.在進行數據探索的過程中&#xff0c;我…

使用python pandas dataframe學習數據分析

?? Note — This post is a part of Learning data analysis with python series. If you haven’t read the first post, some of the content won’t make sense. Check it out here.Note? 注意 -這篇文章是使用python系列學習數據分析的一部分。 如果您還沒有閱讀第一篇文…

實現TcpIp簡單傳送

private void timer1_Tick(object sender, EventArgs e) { IPAddress ipstr IPAddress.Parse("192.168.0.106"); TcpListener serverListener new TcpListener(ipstr,13);//創建TcpListener對象實例 ser…

SQLServer之函數簡介

用戶定義函數定義 與編程語言中的函數類似&#xff0c;SQL Server 用戶定義函數是接受參數、執行操作&#xff08;例如復雜計算&#xff09;并將操作結果以值的形式返回的例程。 返回值可以是單個標量值或結果集。 用戶定義函數準則 在函數中&#xff0c;將會區別處理導致語句被…

無向圖g的鄰接矩陣一定是_矩陣是圖

無向圖g的鄰接矩陣一定是To study structure,tear away all flesh soonly the bone shows.要研究結構&#xff0c;請盡快撕掉骨頭上所有的肉。 Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work …

移動pc常用Meta標簽

移動常用 <meta charset"UTF-8"><title>{$configInfos[store_title]}</title><meta content"widthdevice-width,minimum-scale1.0,maximum-scale1.0,shrink-to-fitno,user-scalableno,minimal-ui" name"viewport"><m…

前端繪制繪制圖表_繪制我的文學風景

前端繪制繪制圖表Back when I was a kid, I used to read A LOT of books. Then, over the last couple of years, movies and TV series somehow stole the thunder, and with it, my attention. I did read a few odd books here and there, but not with the same ferocity …

Rapi

本頁內容 ●引言●SMARTPHONE SDK API 庫●管理設備中的目錄文件●取系統信息●遠程操作電話和短信功能 Windows Mobile日益成熟&#xff0c;開發者隊伍也越來越壯大。作為一個10年的計算機熱愛者和程序員&#xff0c;我也經受不住新技術的誘惑&#xff0c;倒騰起Mobile這個玩具…

android 字符串特殊字符轉義

XML轉義字符 以下為XML標志符的數字和字符串轉義符 " ( 或 &quot;) ( 或 &apos;) & ( 或 &amp;) lt(<) (< 或 <) gt(>) (> 或 >) 如題&#xff1a; 比如&#xff1a;在string.xml中定義如下一個字符串&#xff0c;…

如何描繪一個vue的項目_描繪了一個被忽視的幽默來源

如何描繪一個vue的項目Source)來源 ) Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing …

數據存儲加密和傳輸加密_將時間存儲網絡應用于加密預測

數據存儲加密和傳輸加密I’m not going to string you along until the end, dear reader, and say “Didn’t achieve anything groundbreaking but thanks for reading ;)”.親愛的讀者&#xff0c;我不會一直待到最后&#xff0c;然后說&#xff1a; “沒有取得任何開創性的…