數據中臺是下一代大數據

重點 (Top highlight)

Data science has been an eye-catching field for many years now to young individuals having formal education with a bachelors, masters or Ph.D. in computer science, statistics, business analytics, engineering management, physics, maths, or obviously data science. However, there are a lot of myths that people presume about data science. It’s no more just machine learning and statistics. Over the years, I have spoken to a lot of data science aspirants about breaking into this field. Why is there all the hype about data science? Is it still statistics and machine learning that can help you break into this field? Is it still going to be the future? Even I was in the same boat as you all, but I am now experiencing how the demand has molded currently for the next generation of data scientists breaking into this field. I am not going to teach you how to get into data science as many people on the internet are already doing it.

多年來，數據科學一直是受過本科學歷，碩士或博士學位的年輕人的引人注目的領域。計算機科學，統計，業務分析，工程管理，物理，數學或顯然是數據科學。但是，人們對數據科學有很多神話。不僅僅是機器學習和統計。多年來，我已經與許多數據科學領域的有志之士談論了進入該領域的問題。為什么會有關于數據科學的所有炒作？仍然是統計數據和機器學習可以幫助您進入這一領域嗎？仍然是未來嗎？甚至我和你們都在同一條船上，但是我現在正在經歷目前對進入該領域的下一代數據科學家的需求如何形成。我不會教你如何進入數據科學領域，因為互聯網上已經有很多人這樣做了。

Image for post — Image by shutterstock from Datanami

為什么會有關于數據科學的所有炒作？ (Why is there all the hype about Data Science?)

Everyone around the corner wants to get into data science. A few years ago, there was a demand-supply problem in the field: supply of data scientists was less, and demand was more after Dr. DJ Patil and Jeff Hammerbacher tossed the term Data Science. But now, in 2020, the situation has turned around. The inflow of formally/MOOCs educated data science enthusiasts has increased, and the demand has grown too, but not to that extent. The term has evolved broader and broader to incorporate most of the supporting functionalities that one needs to do data science. I would like to quote one of my favorite quotes from KD nuggets:

每個角落的人都希望進入數據科學領域。幾年前，該領域存在供需問題：數據科學家的供應量減少了，而DJ Patil博士和Jeff Hammerbacher拋棄了數據科學一詞后，需求增加了。但是現在，到2020年，情況有所好轉。受正規/ MOOC受過教育的數據科學愛好者的流入量有所增加，需求也有所增加，但并未達到這種程度。該術語已發展得越來越廣泛，以包含人們進行數據科學所需的大多數支持功??能。我想引述我最喜歡的KD礦塊之一：

“Data Science is like Teenage Sex: Everyone talks about it, No body really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”
“數據科學就像十幾歲的性行為：每個人都在談論它，沒有人真正知道如何做，每個人都認為其他人正在做，所以每個人都聲稱自己正在做。”

Jokes apart, These are some of the things which I feel why data science has taken over all the hype:

開個玩笑，這些是我認為數據科學接管所有炒作的原因：

The mystery behind the title data scientist
標題數據科學家背后的謎團
High job satisfaction
高工作滿意度
Huge business impact
巨大的業務影響
Many job sites rating it as the hottest Job (last 3 years as hottest Job in the US by Glassdoor)
許多工作網站將其評為最熱門的工作(最近3年被Glassdoor評為美國最熱門的工作)
Cutting edge developments
前沿發展
Increasing influx of data generation
越來越多的數據生成
Thanks to many great/not so great schools and boot camps providing degrees in data science
感謝許多提供數據科學學位的優秀/不太優秀的學校和新兵訓練營
data is beautiful! (Not literally :p)
數據真漂亮！ (從字面上不是：p)

自稱數據科學家的人？ (People who call themselves Data Scientists?)

Someone is going to say it, so let me spill some truth about the current industry situation. Due to increase in demand and prestige of the shiny Data Scientist title, Many companies have started switching data scientist titles with product analyst, business intelligence analyst, business analyst, supply chain analyst, data analyst, and statistician because people were leaving their jobs to get the data scientist titles at companies which were giving them for doing the same job. It’s all the matter of respect that many roles get due to this minor change in the words. So, companies have started twisting titles, in the same way, to make it more shiny and desirable like data scientist-analytics, product data scientist, data scientist-growth, data scientist-supply chain, data scientist-visualization, or data scientist - what not?.

有人會這么說，所以讓我就當前的行業狀況講一些真相。由于需求的增加和閃亮的數據科學家頭銜的聲望，許多公司已開始與產品分析師，商業情報分析師，業務分析師，供應鏈分析師，數據分析師和統計師交換數據科學家頭銜，因為人們離開工作崗位來獲得數據科學家在那些給他們做相同工作的公司的頭銜。尊重的問題是，由于單詞的微小變化，許多角色都得到了尊重。因此，公司已經開始以相同的方式扭曲標題，以使其更閃亮和更令人期望，例如數據科學家分析，產品數據科學家，數據科學家增長，數據科學家供應鏈，數據科學家可視化或數據科學家-什么不？

Most people pursuing education/online training have a misconception that all data scientists build fancy machine learning models, but that’s not always true. At least that was the case with me when I started pursuing my masters in applied data science, I assumed that most data scientists do machine learning but when I entered the internship and job market in the US, that’s when I came to know about the real truth. The force driving people towards pursuing data science is due to the hype around artificial intelligence and its business impact.

大多數追求教育/在線培訓的人都有一個誤解，認為所有數據科學家都建立了精美的機器學習模型，但這并不總是正確的。至少當我開始攻讀應用數據科學的碩士時，我就是這樣，我以為大多數數據科學家都是機器學習的，但是當我進入美國的實習和工作市場時，那才是我真正的知識所在。真相。推動人們走向數據科學的力量歸因于對人工智能及其業務影響的炒作。

下一代數據科學家-機器學習 (Next Generation of Data Scientists — Machine Learning)

For people who want to do applied machine learning as a Data Scientist-ML(That’s how I am going to name the title because it’s not data scientist-analytics :p)in 2020 without a Ph.D., there’s a lot more to it now instead of just knowing to apply machine learning to datasets which almost anyone today can do. There are a few other crucial things which I figured out from my experience, which can help you nail the data scientist role hunting for the interview process or even to get shortlisted:

對于想要以數據科學家-ML的身份進行應用機器學習的人(這就是我要命名的標題，因為它不是數據科學家-分析：p)在沒有博士學位的情況下，還有更多的東西現在，不僅僅是知道將機器學習應用于如今幾乎任何人都可以做的數據集。我從經驗中發現了其他一些關鍵問題，可以幫助您確定在采訪過程中甚至入圍的數據科學家的角色：

Distributed Data Processing/Machine Learning: Getting hold of hands-on experience with technologies such as Apache Spark, Apache Hadoop, Dask, etc. can help you prove that you can create Data/ML pipelines at scale. Having experience with anyone of them should be good to go, but I would recommend Apache Spark(either in Python or Scala) the go-to.
分布式數據處理/機器學習 ：掌握諸如Apache Spark ，Apache Hadoop，Dask等技術的動手經驗，可以幫助您證明可以大規模創建數據/ ML管道。與任何人都有經驗應該是不錯的選擇，但是我還是建議使用Apache Spark(使用Python或Scala)。
Production ML/Data Pipelines: If you can get hands-on experience with Apache Airflow, a standard open-source job orchestration tool for creating data and machine learning pipelines. This is currently used in the industry so, it’s recommended to learn and get some projects around it.
生產ML /數據管道 ：如果您可以親身體驗Apache Airflow ，這是一種用于創建數據和機器學習管道的標準開源作業編排工具。目前，該行業已在使用它，因此建議學習并圍繞它進行一些項目。
DevOps/Cloud: DevOps is very much neglected by most of the data science aspirants. If you don’t have an infrastructure, how would you build ML pipelines? It’s not as easy as we do in the coursework to build notebooks or code that run on your local machine. The code that you write should be scalable across infrastructure that you or other folks might create on your team. Many companies might not have the ML infrastructure already laid out and might be looking for someone to start with. Getting familiar with Docker, Kubernetes, and building ML applications with frameworks like Flask should be your standard practice even during your coursework. I love Docker as it’s scalable and you can build infrastructure images and replicate the same things on servers/cloud on Kubernetes clusters.
DevOps / Cloud ：大多數數據科學的追求者都非常忽略DevOps。如果您沒有基礎架構，您將如何構建ML管道？要構建在本地計算機上運行的筆記本或代碼，并不像我們在課程中所做的那樣容易。您編寫的代碼應可跨您或其他人可能在團隊中創建的基礎結構進行擴展。許多公司可能尚未布局ML基礎架構，并且可能正在尋找入門人員。即使在課程學習中，熟悉Docker ， Kubernetes和使用Flask之類的框架構建ML應用程序也應該是您的標準做法。我喜歡Docker，因為它具有可擴展性，您可以構建基礎架構映像，并在Kubernetes集群上的服務器/云上復制相同的內容。
Databases: Knowing databases and query languages is a must. SQL is very much neglected, but It’s still the industry standard, be it on any cloud platform or databases. Start practicing complex SQLs on leetcode, which is gonna help you with some part of coding interviews in DS profiles as you will be responsible for bringing in data from warehouses with on-the-go preprocessing, which will ease up your job on preprocessing before running ML models. Most of the feature engineering can be done on-the-go while getting the data to your models with SQL, which is an aspect many people neglect.
數據庫 ：必須了解數據庫和查詢語言。盡管SQL非常被忽略，但是無論在任何云平臺或數據庫上，它仍然是行業標準。開始在leetcode上練習復雜SQL，這將幫助您在DS概要文件中進行部分編碼采訪，因為您將負責通過正在進行的預處理從倉庫中導入數據，這將簡化您在運行前進行預處理的工作ML模型。大多數功能工程可以隨時隨地完成，而使用SQL將數據傳輸到模型中時，這是很多人忽略的一個方面。
Programming Languages: The recommended programming languages for data science are Python, R, Scala, and Java. Knowing anyone of them is fine and can do the trick. For ML kind of roles, there’s going to be live coding rounds in the interview process so you need to practice wherever you are comfortable — Leetcode, Hackerrank, or anything you prefer.
編程語言 ：推薦用于數據科學的編程語言是Python，R，Scala和Java。了解他們中的任何一個都可以，并且可以解決問題。對于ML角色，在面試過程中將進行現場編碼回合，因此您需要在任何舒適的地方練習-Leetcode，Hackerrank或您喜歡的任何東西。

So, This is the time when knowing only Machine Learning or Statistics is not gonna get you into data science to do ML unless you are lucky, have some great connections in the industry(you should obviously do networking which is very important!) or have an exceptional research record already in your name. Business applications and domain knowledge tends to come with experience and can’t be learned beforehand other than doing internships in relevant industries.

因此，這是時候僅了解機器學習或統計學并不能讓您進入數據科學領域去學習ML的時候，除非您很幸運，在行業中有一些重要的聯系(顯然應該進行非常重要的聯網！)或擁有以您的名字命名的卓越研究記錄。業務應用程序和領域知識往往帶有經驗，除了在相關行業進行實習以外，是無法事先學習的。

我怎么了 (What’s up with me?)

Two months back, I joined the media power-house ViacomCBS as a Data Scientist straight out of grad school without any prior full-time industry experience except research assistantships and internships. My responsibilities here include building ML Products from ideation?—?development?—?production where I use most of the things listed above. I hope this will be helpful for all the aspiring Data Scientists and Machine Learning Engineers who are trying to break into this field.

兩個月前，我以數據科學家的身份加入了媒體巨頭維亞康姆廣播公司( ViacomCBS) ，直接從研究生院畢業，除了研究助理和實習生以外，沒有任何以前的全職行業經驗。我在這里的職責包括從構想(開發)到生產ML產品，在這些產品中，我使用了上面列出的大多數內容。我希望這將對所有有志于進軍這一領域的有抱負的數據科學家和機器學習工程師有所幫助。

Shoot your questions on [myLastName][myFirstName] at gmail dot com or let’s connect on LinkedIn.

在gmail點com上的[myLastName] [myFirstName]上提問，或者在LinkedIn上連接。

翻譯自: https://towardsdatascience.com/full-stack-data-science-the-next-gen-of-data-scientists-cohort-82842399646e

數據中臺是下一代大數據

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389293.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389293.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389293.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

net如何判斷瀏覽器的類別

回復：.net如何判斷瀏覽器的類別?瀏覽器型號：Request.Browser.Type 瀏覽器名稱：Request.Browser.browser 瀏覽器版本：Request.Browser.Version 瀏覽器Cookie：Request.Browser.Cookies 你的操作系統：Request…

AVS 端能力模塊

Mark 轉載于:https://www.cnblogs.com/clxye/p/9939333.html

pwn學習之四

本來以為應該能出一兩道ctf的pwn了，結果又被sctf打擊了一波。 bufoverflow_a 做這題時libc和堆地址都泄露完成了，卡在了unsorted bin attack上，由于delete會清0變量導致無法寫，一直沒構造出unsorted bin attack，后面根…

優化算法的簡潔實現

動量法思想： 動量法使用了指數加權移動平均的思想。它將過去時間步的梯度做了加權平均，且權重按時間步指數衰減。代碼： 在Gluon中，只需要在Trainer實例中通過momentum來指定動量超參數即可使用動量法。 d2l.train_gluon_ch7…

北方工業大學gpa計算_北方大學聯盟倉庫的探索性分析

北方工業大學gpa計算This is my firts publication here and i will start simple.這是我的第一篇出版物，這里我將簡單介紹。 I want to make an exploratory data analysis of UFRN’s warehouse and answer some questions about the data using Python and Pow…

泰坦尼克數據集預測分析_探索性數據分析-泰坦尼克號數據集案例研究（第二部分）

泰坦尼克數據集預測分析Data is simply useless until you don’t know what it’s trying to tell you.除非您不知道數據在試圖告訴您什么，否則數據將毫無用處。 With this quote we’ll continue on our quest to find the hidden secrets of the Titanic. ‘The …

各種數據庫連接的總結

SQL數據庫的連接 return new SqlConnection("server127.0.0.1;databasepart;uidsa;pwd;"); oracle連接字符串 OracleConnection oCnn new OracleConnection("Data SourceORCL_SERVER;USERM70;PASSWORDmmm;");oledb連接數據庫return new OleDbConnection…

關于我

我是誰？ Who am I？這是個哲學問題。。簡單來說，我是Light，一個靠前端吃飯，又不想單單靠前端吃飯的Coder。用以下幾點稍微給自己打下標簽： 工作了兩三年，對，我是16年畢業的90后一直…

L1和L2正則

https://blog.csdn.net/jinping_shi/article/details/52433975轉載于:https://www.cnblogs.com/zyber/p/9257843.html

基于PyTorch搭建CNN實現視頻動作分類任務代碼詳解

數據及具體講解來源： 基于PyTorch搭建CNN實現視頻動作分類任務 import torch import torch.nn as nn import torchvision.transforms as T import scipy.io from torch.utils.data import DataLoader,Dataset import os from PIL import Image from torch.autograd…

missforest_missforest最佳丟失數據插補算法

missforestMissing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work wel…

華碩猛禽1080ti_F-22猛禽動力回路的視頻分析

華碩猛禽1080tiThe F-22 Raptor has vectored thrust. This means that the engines don’t just push towards the front of the aircraft. Instead, the thrust can be directed upward or downward (from the rear of the jet). With this vectored thrust, the Raptor can …