數據清理最終實現了自動化

蘋果 | GOOGLE | 現貨 | 其他 (APPLE | GOOGLE | SPOTIFY | OTHERS)

Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris. Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

編者按：邁向數據科學播客的“攀登數據科學階梯”系列由杰里米·哈里斯(Jeremie Harris)主持。 杰里米(Jeremie)幫助運營一家名為 SharpestMinds 的數據科學指導創業公司 。 您可以收聽以下播客：

It’s cliché to say that data cleaning accounts for 80% of a data scientist’s job, but it’s directionally true.

俗話說，數據清理工作占數據科學家工作的80％，但這在方向上是正確的。

That’s too bad, because fun things like data exploration, visualization and modelling are the reason most people get into data science. So it’s a good thing that there’s a major push underway in industry to automate data cleaning as much as possible.

太糟糕了，因為諸如數據探索，可視化和建模之類的有趣事物是大多數人進入數據科學的原因。因此，業界正在大力推動盡可能自動執行數據清理的一件好事。

One of the leaders of that effort is Ihab Ilyas, a professor at the University of Waterloo and founder of two companies, Tamr and Inductiv, both of which are focused on the early stages of the data science lifecycle: data cleaning and data integration. Ihab knows an awful lot about data cleaning and data engineering, and has some really great insights to share about the future direction of the space — including what work is left for data scientists, once you automate away data cleaning.

這項工作的領導者之一是滑鐵盧大學的教授，兩家公司Tamr和Inductiv的創始人Ihab Ilyas，??這兩家公司都致力于數據科學生命周期的早期階段：數據清理和數據集成。艾哈布(Ihab)對數據清理和數據工程知識非常了解，并且對于共享空間的未來方向具有真正的深刻見解，包括一旦您將數據清理自動化后將為數據科學家留下的工作。

Here were some of my biggest takeaways from the conversation:

以下是這次對話中我最大的收獲：

Data cleaning involves a lot of things, one of which is dealing with missing values. Historically, missing values have often been filled in manually by subject matter experts who can make educated guesses about the data, but automated techniques can work well (and usually do better) at scale.
數據清理涉及很多事情，其中??之一就是處理缺失的值。從歷史上看，缺少的值通常是由主題專家手動填充的，他們可以對數據進行有根據的猜測，但是自動化技術可以很好地發揮作用(并且通常做得更好)。
These automated strategies can range from fairly naive approaches (e.g. replacing a value with the median or average value of other points in the dataset), to more sophisticated techniques (e.g. using a predictive model to guess at missing values).
這些自動化策略的范圍從相當幼稚的方法(例如，用數據集中其他點的中位數或平均值替換一個值)到更復雜的技術(例如，使用預測模型來猜測缺失值)。
The distinction between different parts of the data science lifecycle are often arbitrary, but clearly defining the boundaries between data cleaning, data exploration and modelling is nonetheless essential to ensure that problems can be solved in a contained and modular fashion. This idea is one part of the data science best practices that make up DataOps, a topic we’ve discussed on the podcast before.
數據科學生命周期的不同部分之間的區分通常是任意的，但是清楚地定義數據清理，數據探索和建模之間的界限對于確保可以以封閉和模塊化的方式解決問題至關重要。這個想法是構成DataOps的數據科學最佳實踐的一部分，這是我們之前在播客上討論的主題。
It’s clear that data cleaning, like modelling, is not immune to automation. As a result, it’s likely that data scientists will find themselves leaning more and more into their subject matter expertise, communication and engineering skills in the future, rather than spending their time on dealing with missing values, hyperparameter optimization or model selection.
顯然，數據清理與建模一樣，也無法避免自動化。結果，數據科學家很可能會發現自己將來會越來越傾向于主題專業知識，溝通和工程技能，而不是將時間花在處理缺失值，超參數優化或模型選擇上。

You can follow Ihab on Twitter here and you can follow me on Twitter here.

您可以遵循埃哈卜的Twitter在這里，你可以按照我的Twitter 這里。

翻譯自: https://towardsdatascience.com/data-cleaning-is-finally-being-automated-8cc964ea2e12

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/387995.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/387995.shtml
英文地址，請注明出處：http://en.pswp.cn/news/387995.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！