數據科學家數據工程師

While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

在軟件項目中工作時，它是非常普遍的，實際上是立即開始版本控制代碼的標準，對于軟件社區來說，好處已經非常明顯：它跟蹤特定代碼存儲庫中對代碼的每次修改。如果有任何錯誤，開發人員可以隨時瀏覽并比較早期版本的代碼，以解決問題，同時最大程度地減少對所有團隊成員的破壞。軟件項目代碼是最寶貴的資產，因此必須不惜一切代價保護它！

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?
好吧，對于數據科學項目，數據也可以被視為皇冠上的明珠，那么為什么我們作為數據科學家不通過版本控制將其視為地球上最寶貴的東西呢？

For those familiar with Git, you might be thinking, “Git cannot handle large files and directories.. at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

對于熟悉Git的人來說，您可能會想： “ Git無法處理大文件和目錄。至少，它不能具有與處理小代碼文件相同的性能。 那么如何以與版本控制代碼相同的舊版本來控制數據呢？”。 嗯，這已經成為可能，而且很容易，只需鍵入git clone并查看保存在工作區中的數據文件和ML模型文件，并且所有這些魔力都可以通過DVC來實現。

DVC快速入門 (Quick start with DVC)

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

首先，我們必須在計算機中安裝DVC。這非常簡單，您可以按照以下步驟進行操作。

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

正如我已經提到的那樣，用于數據版本控制的工具(例如DVC)使構建大型項目成為可能，同時又可以重現管道。使用DVC，將數據集添加到git存儲庫非常簡單，而我的意思很簡單，就像鍵入以下行一樣：

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

無論數據集的大小如何，數據都會添加到存儲庫中。假設我們也想將數據集推送到云中，也可以使用以下命令：

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

DVC開箱即用，支持許多云存儲服務，例如S3，Google Storage，Azure Blob，Google Drive等。由于數據集是通過版本控制系統推送到云的，因此如果我將項目克隆到另一臺計算機上，我可以使用以下命令下載數據或任何其他工件：

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

好了，既然您知道如何開始使用DVC，我建議您繼續研究該工具或類似工具。作為數據科學家，版本控制應該是您最好的朋友，因為它們不僅允許版本數據集，而且允許創建可復制的管道，同時保持所有開發的可追溯性和可復制性。

If this hasn’t yet convinced, next I’ll tell why you must start versioning control your data!!

如果尚未確定，接下來我將告訴您為什么必須開始版本控制您的數據！

為什么要開始使用數據版本控制？ (Why should I start using data version control?)

1.保存并復制所有數據實驗 (1. Save and reproduce all of your data experiments)

As Data Scientists we know that to develop a Machine Learning model, is not all about code, but also about data and the right parameters. A lot of times, in order to find the perfect match, experimentation is required, which makes the process highly iterative and extremely important to keep track of the changes made as well as their impacts on the end results. This becomes even more important in a complex environment where multiple data scientists are collaborating. In that sense, if we are able to have a snapshot of the data used to develop a certain version of the model and have it versioned, it makes the process of iteration and model development not only easier but also trackable.

作為數據科學家，我們知道開發機器學習模型不僅與代碼有關，而且與數據和正確的參數有關。很多時候，為了找到完美的匹配，需要進行實驗，這使得該過程具有高度的重復性，并且對于跟蹤所做的更改及其對最終結果的影響非常重要。在由多個數據科學家協作的復雜環境中，這一點變得更加重要。從這個意義上講，如果我們能夠擁有用于開發模型的特定版本的數據的快照并對其進行版本化，那么它不僅使迭代和模型開發過程變得更加容易而且可跟蹤。

2.調試和測試 (2. Debugging and testing)

While playing around in Kaggle competitions many times we do not understand the real challenges inherent to the development of an ML-based solution while working with production systems. In fact, one of the biggest challenges is to deal with the variety of data sources and the amount of data that we’ve available. Sometimes can be a bit daunting to reproduce the results of experimentation if we are not even able to retrieve the exact dataset that has been used. Data version control can ease these issues and make the process of machine learning solutions development must simpler, organized, and reproducible.

當多次參加Kaggle比賽時，我們不了解在與生產系統一起工作時開發基于ML的解決方案所固有的真正挑戰。實際上，最大的挑戰之一是處理各種數據源和我們可用的數據量。如果我們甚至無法檢索已使用的確切數據集，有時要重現實驗結果可能會有些艱巨。數據版本控制可以緩解這些問題，并使機器學習解決方案的開發過程必須更簡單，更有條理并且可重現。

3.合規與審計 (3. Compliance and auditing)

Privacy regulations, such as GDPR, already request companies and organizations to demonstrate compliance and history of the available data sources. The ability to track data version provided by version control tools is the first step to have companies data sources ready for compliance, and an essential step in maintaining a strong and robust audit train and risk management processes around data.

隱私法規(例如GDPR)已經要求公司和組織證明合規性和可用數據源的歷史記錄。跟蹤版本控制工具提供的數據版本的能力是使公司數據源準備好合規的第一步，并且是維持圍繞數據的強大而強大的審核培訓和風險管理流程的重要步驟。

4.協調軟件和數據科學團隊 (4. Align software and data science teams)

Sometimes, to have Data Science and Software teams talking the same language can be quite challenging and can highly depend on the profiles involved in the interactions between the teams. To start implementing some of the good practices from the software into the data science processes, can help not only to align the work between the teams involved, but also to accelerate the development and integration of the solutions.

有時，讓數據科學和軟件團隊說相同的語言可能會非常具有挑戰性，并且在很大程度上取決于團隊之間交互所涉及的配置文件。從軟件到數據科學流程開始實施一些良好實踐，不僅可以幫助使相關團隊之間的工作保持一致，還可以加快解決方案的開發和集成。

結論 (Conclusions)

Data science is had to productize, and one of the main reasons for that is because there are too many mutable elements, such as data. The concept of versioning for data science applications can be interpreted in many possible ways, from models to data versioning. This article aimed to cover the importance and benefits of versioning data for the data science teams, but there are many more aspects that we should pay attention to as Data Scientists. In the end, keeping an eye on continuous delivery principles is very important for the success of ML-based solutions!

數據科學必須進行生產，其主要原因之一是因為可變元素(例如數據)太多。從模型到數據版本控制，可以采用許多可能的方式來解釋數據科學應用程序的版本控制概念。本文旨在介紹對數據科學團隊進行數據版本控制的重要性和好處，但是作為數據科學家，我們還有許多方面應注意。最后，密切注意連續交付原則對于基于ML的解決方案的成功非常重要！

Fabiana Clemente is CDO at YData.

Fabiana Clemente 是 YData的 CDO 。

Improved data for AI

改善AI數據

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.

YData為數據科學家提供了以數據為中心的開發平臺，以處理高質量和合成數據。

翻譯自: https://medium.com/swlh/4-reasons-why-data-scientists-should-version-data-672aca5bbd0b

數據科學家數據工程師

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389219.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389219.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389219.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！