成為一名真正的數據科學家有多困難

Data Science and Machine Learning are hard sports to play. It’s difficult enough to motivate yourself to sit down and learn some maths, let alone to becoming an expert on the matter.

數據科學和機器學習是一項艱巨的運動。 激勵自己坐下來學習一些數學知識是非常困難的,更不用說要成為這方面的專家了。

I began my journey into machine learning with a prediction problem. I was tasked with predicting a variable, but had around 100 other variables I could use. As a fresh graduate, I understandably took this as a regression problem and despite my colleagues being seemingly impressed, in all honesty, my result was pretty bad. I knew I could do better.

我以預測問題開始了機器學習之旅。 我的任務是預測變量,但是我可以使用大約100個其他變量。 作為一名應屆畢業生,我可以理解為這是一個回歸問題,盡管我的同事們似乎印象深刻,但老實說,我的成績很差。 我知道我可以做得更好。

From there, I read, I experimented, I read some more, then experimented some more, and this led to a bit of a journey where I quit that job, went back into education, then back into industry, and along the way I’ve been lucky enough to work with people who’ve shape the field of Artificial Intelligence along the way.

從那里開始,我讀了書,做了實驗,又讀了一些書,然后再做了更多的實驗,這導致了一段旅程,我辭掉了工作,回到了教育領域,然后回到了行業,并一路走來。我們很幸運地與在整個過程中塑造了人工智能領域的人們一起工作。

In what follows, I present 5 difficulties that Machine Learning practitioners and Data Scientists deal with on a daily basis. I offer sympathy to those who need it!

接下來,我提出了機器學習從業者和數據科學家每天要解決的5個困難。 我向需要幫助的人表示同情!

困難1:適應問題領域 (Difficulty 1: Adapting to the Problem Domain)

How many mathematicians study Linguistics? How many mathematicians study Healthcare? So why are we any good at solving problems in these fields?

多少位數學家學習語言學? 多少數學家學習醫療保健? 那么,為什么我們擅長解決這些領域的問題呢?

The art of being a Mathematician comes from the ability to abstract a problem in a manner that makes it solvable. In Linguistics, we can treat each “phone” as a discrete variable and create a model that determines the joint distribution between each phone. In Healthcare, we can build a model that picks up latent features in X-rays that discern a disease.

成為數學家的藝術源于以解決問題的方式抽象問題的能力。 在語言學中,我們可以將每個“電話”視為離散變量,并創建一個模型來確定每個電話之間的聯合分布。 在醫療保健領域,我們可以建立一個模型,該模型可以拾取識別疾病的X射線中的潛在特征。

但是伙計,這很難。 (But dude, it’s tough.)

To be a successful machine learning researcher you have to really be willing to put the time and effort into fully immersing yourself in the domain knowledge. Many of the successful game-changers in the field have broken ground in fields that they had experience in. Deepminds founder Demis Hassabis ran a games company before returning to UCL to study Neuroscience, ultimately leading to his developments in Reinforcement Learning and leading to his advances in games like Atari and Go.

要成為一名成功的機器學習研究人員,您必須真正愿意花費時間和精力將自己完全浸入領域知識中。 在該領域中許多成功的游戲規則改變者在他們經驗豐富的領域都取得了突破。Deepminds創始人Demis Hassabis在回到UCL學習神經科學之前經營著一家游戲公司 ,最終導致他在強化學習方面的發展并取得了進步在Atari和Go等游戲中。

Not all of us are as fortunate as Demis in having a background in a field that we’re trying to revolutionise. Often we’ll be at work and a project comes up that we have to try figure out: and next week we may have another task. Project switching has its pro’s and con’s, but ultimately you suffer on the level of depth you go to.

在試圖革新的領域擁有背景知識的人,并不是所有人都像Demis一樣幸運。 通常,我們會在工作,需要提出一個項目,我們必須設法弄清楚:下周,我們可能還有另一項任務。 項目切換有其優點和缺點,但最終您會陷入深度學習。

It definitely helps if you know a little bit about your niche before you apply some ML, but for what it’s worth, I sympathise with your struggles.

如果您在應用ML之前對利基有所了解,肯定會有所幫助,但是對于它的價值,我很同情您的努力。

Image for post
Photo by Joshua Gresham on Unsplash
Joshua Gresham在Unsplash上拍攝的照片

難度2:識別和忽略噪音 (Difficulty 2: Identifying and Ignoring the Noise)

Noise is second to none in statistics, machine learning and data science. Honestly, it’s everywhere. From dirty data, to rogue data points, to literature built on weak foundations, to models capturing latent bias: noise is literally everywhere.

在統計,機器學習和數據科學中,噪聲是首屈一指的。 老實說,它無處不在。 從臟數據到流氓數據點,再到建立在薄弱基礎上的文獻,再到捕獲潛在偏差的模型:噪聲無處不在。

Machine Learning models generally perform by minimising the squared sum of errors (or some form of misclassification measure) but when you’re researching a new topic or getting feedback from a colleague, noise can be pretty hard to define — the last thing you want to do is be chasing down the rabbit hole.

機器學習模型通常通過最小化誤差的平方和(或某種形式的錯誤分類度量)來執行,但是當您研究新主題或從同事那里獲得反饋時,很難定義噪音,這是您想要做的最后一件事要做的就是追逐兔子洞。

There are a few ways to get around it:

有幾種解決方法:

  • Speak to reliable people often, keep them close

    經常與可信賴的人交談,保持親密接觸
  • Learn how to spot nonsense, keep it at a distance

    了解如何發現廢話,保持一定距離
  • Fail often, fail quick.

    經常失敗,很快就會失敗。

Experiment more, speak to people more, try more things and eventually you’ll begin to recognise and ‘smell’ noise. You’ll avert it, and progress quicker.

多做實驗,多與人交流,嘗試更多事情,最終您將開始認識并“聞”到噪音。 您將避免它,并加快進度。

As an example: many algorithms have a high accuracy rating because the dependant variable happens so infrequently. E.g. a model which predicts how many people in London get struck by lightening on a daily basis will almost certainly be 99.9999% correct without any training. The “noise” is recognising that people don’t get struck by lightening that often, and by adjusting your model for it.

例如:許多算法具有很高的準確度,因為因變量很少發生。 例如,一個模型,每天預測在倫敦有多少人被閃電擊中,幾乎可以肯定在沒有任何培訓的情況下正確率為99.9999%。 “噪音”是指人們不會因為經常減輕重量和調整模型而受到打擊。

困難三:接受良好的教育 (Difficulty 3: Getting Good Education)

Image for post
Photo by Markus Leo on Unsplash
Markus Leo在Unsplash上拍攝的照片

Education is so important in this field because the domain of knowledge required is so broad. From computer science, to maths, to algorithms, to statistics: there’s a lot to cover in a relatively short amount of time.

在這一領域,教育是如此重要,因為所需的知識領域如此廣泛。 從計算機科學,數學,算法到統計數據:在相對較短的時間內涵蓋了很多內容。

Formal education (like University) is one thing but education in machine learning really surpasses that. Practitioners have to develop an ability to quickly learn things themselves and be able to implement them well.

正規教育(例如大學)是一回事,但是機器學習方面的教育確實超越了正規教育。 從業者必須發展一種能力,以快速地自己學習事物并能夠很好地實施它們。

The reason why this is so important (and so difficult) is that it’s tempting at times to find a github repository where someone else has spent some time solving the same problem you have, pulling their code and applying it to your problem. The solution make look ok but plenty of things can just get missed in between all of it and there’s no comparison to having the fundamental understanding.

之所以如此重要(而且如此困難),是因為有時會很想找到一個github存儲庫,讓其他人花了一些時間解決您遇到的相同問題,提取他們的代碼并將其應用于您的問題。 該解決方案看起來不錯,但是在所有解決方案之間可能會遺漏很多東西,并且與擁有基本理解沒有可比之處。

難度4:發布負面結果 (Difficulty 4: Publishing Negative Results)

Negative results happen all the time, they’re hard, but they happen. You have to recognise that negative results are also results and that they should be welcomed.

負面結果一直在發生,很難,但確實會發生。 您必須認識到負面結果也是結果,應該歡迎他們。

Machine Learning has two sides to it: the theoretical and the applied side. Theorists will publish less frequently with the hope of making a bigger splash and applied academics will tend to publish more often but solve bigger problems.

機器學習有兩個方面:理論方面和應用方面。 理論家們將減少發表頻率,以期引起更大的轟動,而應用學者則傾向于增加發表頻率,但解決更大的問題。

However in the pursuit of experimentation or in the pursuit of publishing, a lot of negative results are often put to the side and not overly discussed. This then leads to other practitioners repeating these same experiments and at the aggregate, a lot of time is wasted. This inefficiency also breeds a form of ego where people are respected by only the ‘positive’ results they’ve discovered, rather than the results they can confirm to be simply incomplete.

但是,在進行實驗或出版時,常常會帶來很多負面結果,而不會進行過多討論。 然后,這導致其他從業者重復這些相同的實驗,并且總的來說浪費了很多時間。 這種低效率也滋生了一種自我的形式,在這種自我中,人們僅受到他們發現的“積極”結果的尊重,而不是僅僅確認其不完全的結果。

Everyone benefits if we can classify problems better.

如果我們能夠更好地對問題進行分類,那么每個人都會受益。

Image for post
Photo by Francisco Moreno on Unsplash
弗朗西斯科·莫雷諾 ( Un Francisco) 攝

難題5:掌握研究 (Difficulty 5: Keeping on Top of the Research)

Did I mention that there’s a lot of it?

我是否提到過很多?

截至撰寫本文時,Google已在本年度出版了340多種出版物。 (Google has published over 340 publications THIS YEAR as of writing this article.)

Google don’t mess around either: their research is always very good. Let alone with all the publications and Universities in the world — how am I meant to keep on top of all this research?

Google也不搞混:他們的研究始終非常出色。 更不用說世界上所有的出版物和大學了-我要在所有這些研究中保持領先地位是什么意思?

You kind of…just…have to find a way.

您……只是……必須找到一種方法。

I read a lot and spend most of my day looking out for new approaches and methodologies to solve the problems I’m facing but at times, you can get lost in a swathe of research or even, not even find the right articles because there’s so much research that it’s hard to identify what’s useful.

我讀了很多書,花了整整一天的時間尋找解決我所面臨問題的新方法和方法,但有時,您可能會迷失于大量的研究中,甚至找不到合適的文章,因為許多研究表明,很難確定有用的東西。

Using citations is a great method to filter research and staying on top of the most cited papers every year definitely helps but in finding an ‘edge’ or in discovering ‘novel’ applications of models, you just have to do the leg work and read as much as you can.

使用引用是過濾研究的一種好方法,并且每年留在被引用最多的論文上肯定有幫助,但是在尋找模型的“優勢”或發現“新穎”應用時,您只需做些簡單的工作并閱讀盡你所能。

Ultimately and in my opinion, to be a successful Machine Learning Researcher or Data Scientist, you need to be able to teach yourself. You just have to find a reason to know how a neural-network works or why a Random Forest sucks in some cases, and use this to drive your understanding.

最終,以我的觀點,要成為成功的機器學習研究員或數據科學家,您需要能夠自學。 您只需要找到一個理由來了解神經網絡的工作原理,或者在某些情況下為什么會吸引隨機森林,并以此來加深您的理解。

The reason being is that it’s such a multi-disciplined subject that moves leaps and bounds every year. I graduated from my masters program in 2016 and since then the whole AI sphere has been reinvented 3 times over.

原因是,它是一個如此多學科的學科,每年都在飛躍發展。 我于2016年從碩士課程畢業,自那時以來,整個AI領域已被徹底改造了3次。

Thanks for reading! If you have any messages, please let me know!

謝謝閱讀! 如果您有任何留言,請告訴我!

Keep up to date with my latest articles here!

在這里了解我的最新文章!

翻譯自: https://medium.com/swlh/how-hard-is-it-to-be-a-real-data-scientist-85ab88f451f

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389470.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389470.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389470.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Ubuntu 裝機軟件

Ubuntu16.04 軟件商店閃退打不開 sudo apt-get updatesudo apt-get dist-upgrade# 應該執行一下更新就好,不需要重新安裝軟件中心 sudo apt-get install –reinstall software-center Ubuntu16.04 深度美化 https://www.jianshu.com/p/4bd2d9b1af41 Ubuntu18.04 美化…

數據分析中的統計概率_了解統計和概率:成為專家數據科學家

數據分析中的統計概率Data Science is a hot topic nowadays. Organizations consider data scientists to be the Crme de la crme. Everyone in the industry is talking about the potential of data science and what data scientists can bring in their BigTech and FinT…

Keras框架:Mobilenet網絡代碼實現

Mobilenet概念: MobileNet模型是Google針對手機等嵌入式設備提出的一種輕量級的深層神經網絡,其使用的核心思想便是depthwise separable convolution。 Mobilenet思想: 通俗地理解就是3x3的卷積核厚度只有一層,然后在輸入張量上…

clipboard 在 vue 中的使用

簡介 頁面中用 clipboard 可以進行復制粘貼&#xff0c;clipboard能將內容直接寫入剪切板 安裝 npm install --save clipboard 使用方法一 <template><span>{{ code }}</span><iclass"el-icon-document"title"點擊復制"click"co…

數據驅動開發_開發數據驅動的股票市場投資方法

數據驅動開發Data driven means that your decision are driven by data and not by emotions. This approach can be very useful in stock market investment. Here is a summary of a data driven approach which I have been taking recently數據驅動意味著您的決定是由數據…

前端之sublime text配置

接下來我們來了解如何調整sublime text的配置&#xff0c;可能很多同學下載sublime text的時候就是把它當成記事本來使用&#xff0c;也就是沒有做任何自定義的配置&#xff0c;做一些自定義的配置可以讓sublime text更適合我們的開發習慣。 那么在利用剛才的命令面板我們怎么打…

python 時間序列預測_使用Python進行動手時間序列預測

python 時間序列預測Time series analysis is the endeavor of extracting meaningful summary and statistical information from data points that are in chronological order. They are widely used in applied science and engineering which involves temporal measureme…

keras框架:目標檢測Faster-RCNN思想及代碼

Faster-RCNN&#xff08;RPN CNN ROI&#xff09;概念 Faster RCNN可以分為4個主要內容&#xff1a; Conv layers&#xff1a;作為一種CNN網絡目標檢測方法&#xff0c;Faster RCNN首先使用一組基礎的convrelupooling層提取 image的feature maps。該feature maps被共享用于…

算法偏見是什么_算法可能會使任何人(包括您)有偏見

算法偏見是什么在上一篇文章中&#xff0c;我們展示了當數據將情緒從動作中剝離時會發生什么 (In the last article, we showed what happens when data strip emotions out of an action) In Part 1 of this series, we argued that data can turn anyone into a psychopath, …

大數據筆記-0907

2019獨角獸企業重金招聘Python工程師標準>>> 復習: 1.clear清屏 2.vi vi xxx.log i-->edit esc-->command shift:-->end 輸入 wq 3.cat xxx.log 查看 --------------------------- 1.pwd 查看當前光標所在的path 2.家目錄 /boot swap / 根目錄 起始位置 家…

Tensorflow框架:目標檢測Yolo思想

Yolo-You Only Look Once YOLO算法采用一個單獨的CNN模型實現end-to-end的目標檢測&#xff1a; Resize成448448&#xff0c;圖片分割得到77網格(cell)CNN提取特征和預測&#xff1a;卷積部分負責提取特征。全鏈接部分負責預測&#xff1a;過濾bbox&#xff08;通過nms&#…

線性回歸非線性回歸_了解線性回歸

線性回歸非線性回歸Let’s say you’re looking to buy a new PC from an online store (and you’re most interested in how much RAM it has) and you see on their first page some PCs with 4GB at $100, then some with 16 GB at $1000. Your budget is $500. So, you es…

樸素貝葉斯和貝葉斯估計_貝葉斯估計收入增長的方法

樸素貝葉斯和貝葉斯估計Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works wi…

numpy統計分布顯示

import numpy as np from sklearn.datasets import load_iris dataload_iris()petal_lengthnumpy.array(list(len[2]for len in data[data]))#取出花瓣長度數據 print(np.max(petal_length))#花瓣長度最大值 print(np.mean(petal_length))#花瓣長度平均值 print(np.std(petal_l…

python數據結構:進制轉化探索

*********************************第一部分******************************************************************************************************************************************************************************************# 輸入excel的行號&#xff0c;…

Keras框架:人臉檢測-mtcnn思想及代碼

人臉檢測-mtcnn 概念&#xff1a; MTCNN&#xff0c;英文全稱是Multi-task convolutional neural network&#xff0c;中文全稱是多任務卷積神經網絡&#xff0c; 該神經網絡將人臉區域檢測與人臉關鍵點檢測放在了一起。 從工程實踐上&#xff0c;MTCNN是一種檢測速度和準確率…

python中格式化字符串_Python中所有字符串格式化的指南

python中格式化字符串Strings are one of the most essential and used datatypes in programming. It allows the computer to interact and communicate with the world, such as printing instructions or reading input from the user. The ability to manipulate and form…

Javassist實現JDK動態代理

提到JDK動態代理&#xff0c;相信很多人并不陌生。然而&#xff0c;對于動態代理的實現原理&#xff0c;以及如何編碼實現動態代理功能&#xff0c;可能知道的人就比較少了。接下一來&#xff0c;我們就一起來看看JDK動態代理的基本原理&#xff0c;以及如何通過Javassist進行模…

數據圖表可視化_數據可視化如何選擇正確的圖表第1部分

數據圖表可視化According to the World Economic Forum, the world produces 2.5 quintillion bytes of data every day. With so much data, it’s become increasingly difficult to manage and make sense of it all. It would be impossible for any person to wade throug…

Keras框架:實例分割Mask R-CNN算法實現及實現

實例分割 實例分割&#xff08;instance segmentation&#xff09;的難點在于&#xff1a; 需要同時檢測出目標的位置并且對目標進行分割&#xff0c;所以這就需要融合目標檢測&#xff08;框出目標的位置&#xff09;以及語義分割&#xff08;對像素進行分類&#xff0c;分割…