Data Science and Machine Learning are hard sports to play. It’s difficult enough to motivate yourself to sit down and learn some maths, let alone to becoming an expert on the matter.
數據科學和機器學習是一項艱巨的運動。 激勵自己坐下來學習一些數學知識是非常困難的,更不用說要成為這方面的專家了。
I began my journey into machine learning with a prediction problem. I was tasked with predicting a variable, but had around 100 other variables I could use. As a fresh graduate, I understandably took this as a regression problem and despite my colleagues being seemingly impressed, in all honesty, my result was pretty bad. I knew I could do better.
我以預測問題開始了機器學習之旅。 我的任務是預測變量,但是我可以使用大約100個其他變量。 作為一名應屆畢業生,我可以理解為這是一個回歸問題,盡管我的同事們似乎印象深刻,但老實說,我的成績很差。 我知道我可以做得更好。
From there, I read, I experimented, I read some more, then experimented some more, and this led to a bit of a journey where I quit that job, went back into education, then back into industry, and along the way I’ve been lucky enough to work with people who’ve shape the field of Artificial Intelligence along the way.
從那里開始,我讀了書,做了實驗,又讀了一些書,然后再做了更多的實驗,這導致了一段旅程,我辭掉了工作,回到了教育領域,然后回到了行業,并一路走來。我們很幸運地與在整個過程中塑造了人工智能領域的人們一起工作。
In what follows, I present 5 difficulties that Machine Learning practitioners and Data Scientists deal with on a daily basis. I offer sympathy to those who need it!
接下來,我提出了機器學習從業者和數據科學家每天要解決的5個困難。 我向需要幫助的人表示同情!
困難1:適應問題領域 (Difficulty 1: Adapting to the Problem Domain)
How many mathematicians study Linguistics? How many mathematicians study Healthcare? So why are we any good at solving problems in these fields?
多少位數學家學習語言學? 多少數學家學習醫療保健? 那么,為什么我們擅長解決這些領域的問題呢?
The art of being a Mathematician comes from the ability to abstract a problem in a manner that makes it solvable. In Linguistics, we can treat each “phone” as a discrete variable and create a model that determines the joint distribution between each phone. In Healthcare, we can build a model that picks up latent features in X-rays that discern a disease.
成為數學家的藝術源于以解決問題的方式抽象問題的能力。 在語言學中,我們可以將每個“電話”視為離散變量,并創建一個模型來確定每個電話之間的聯合分布。 在醫療保健領域,我們可以建立一個模型,該模型可以拾取識別疾病的X射線中的潛在特征。
但是伙計,這很難。 (But dude, it’s tough.)
To be a successful machine learning researcher you have to really be willing to put the time and effort into fully immersing yourself in the domain knowledge. Many of the successful game-changers in the field have broken ground in fields that they had experience in. Deepminds founder Demis Hassabis ran a games company before returning to UCL to study Neuroscience, ultimately leading to his developments in Reinforcement Learning and leading to his advances in games like Atari and Go.
要成為一名成功的機器學習研究人員,您必須真正愿意花費時間和精力將自己完全浸入領域知識中。 在該領域中許多成功的游戲規則改變者在他們經驗豐富的領域都取得了突破。Deepminds創始人Demis Hassabis在回到UCL學習神經科學之前經營著一家游戲公司 ,最終導致他在強化學習方面的發展并取得了進步在Atari和Go等游戲中。
Not all of us are as fortunate as Demis in having a background in a field that we’re trying to revolutionise. Often we’ll be at work and a project comes up that we have to try figure out: and next week we may have another task. Project switching has its pro’s and con’s, but ultimately you suffer on the level of depth you go to.
在試圖革新的領域擁有背景知識的人,并不是所有人都像Demis一樣幸運。 通常,我們會在工作,需要提出一個項目,我們必須設法弄清楚:下周,我們可能還有另一項任務。 項目切換有其優點和缺點,但最終您會陷入深度學習。
It definitely helps if you know a little bit about your niche before you apply some ML, but for what it’s worth, I sympathise with your struggles.
如果您在應用ML之前對利基有所了解,肯定會有所幫助,但是對于它的價值,我很同情您的努力。

難度2:識別和忽略噪音 (Difficulty 2: Identifying and Ignoring the Noise)
Noise is second to none in statistics, machine learning and data science. Honestly, it’s everywhere. From dirty data, to rogue data points, to literature built on weak foundations, to models capturing latent bias: noise is literally everywhere.
在統計,機器學習和數據科學中,噪聲是首屈一指的。 老實說,它無處不在。 從臟數據到流氓數據點,再到建立在薄弱基礎上的文獻,再到捕獲潛在偏差的模型:噪聲無處不在。
Machine Learning models generally perform by minimising the squared sum of errors (or some form of misclassification measure) but when you’re researching a new topic or getting feedback from a colleague, noise can be pretty hard to define — the last thing you want to do is be chasing down the rabbit hole.
機器學習模型通常通過最小化誤差的平方和(或某種形式的錯誤分類度量)來執行,但是當您研究新主題或從同事那里獲得反饋時,很難定義噪音,這是您想要做的最后一件事要做的就是追逐兔子洞。
There are a few ways to get around it:
有幾種解決方法:
- Speak to reliable people often, keep them close 經常與可信賴的人交談,保持親密接觸
- Learn how to spot nonsense, keep it at a distance 了解如何發現廢話,保持一定距離
- Fail often, fail quick. 經常失敗,很快就會失敗。
Experiment more, speak to people more, try more things and eventually you’ll begin to recognise and ‘smell’ noise. You’ll avert it, and progress quicker.
多做實驗,多與人交流,嘗試更多事情,最終您將開始認識并“聞”到噪音。 您將避免它,并加快進度。
As an example: many algorithms have a high accuracy rating because the dependant variable happens so infrequently. E.g. a model which predicts how many people in London get struck by lightening on a daily basis will almost certainly be 99.9999% correct without any training. The “noise” is recognising that people don’t get struck by lightening that often, and by adjusting your model for it.
例如:許多算法具有很高的準確度,因為因變量很少發生。 例如,一個模型,每天預測在倫敦有多少人被閃電擊中,幾乎可以肯定在沒有任何培訓的情況下正確率為99.9999%。 “噪音”是指人們不會因為經常減輕重量和調整模型而受到打擊。
困難三:接受良好的教育 (Difficulty 3: Getting Good Education)

Education is so important in this field because the domain of knowledge required is so broad. From computer science, to maths, to algorithms, to statistics: there’s a lot to cover in a relatively short amount of time.
在這一領域,教育是如此重要,因為所需的知識領域如此廣泛。 從計算機科學,數學,算法到統計數據:在相對較短的時間內涵蓋了很多內容。
Formal education (like University) is one thing but education in machine learning really surpasses that. Practitioners have to develop an ability to quickly learn things themselves and be able to implement them well.
正規教育(例如大學)是一回事,但是機器學習方面的教育確實超越了正規教育。 從業者必須發展一種能力,以快速地自己學習事物并能夠很好地實施它們。
The reason why this is so important (and so difficult) is that it’s tempting at times to find a github repository where someone else has spent some time solving the same problem you have, pulling their code and applying it to your problem. The solution make look ok but plenty of things can just get missed in between all of it and there’s no comparison to having the fundamental understanding.
之所以如此重要(而且如此困難),是因為有時會很想找到一個github存儲庫,讓其他人花了一些時間解決您遇到的相同問題,提取他們的代碼并將其應用于您的問題。 該解決方案看起來不錯,但是在所有解決方案之間可能會遺漏很多東西,并且與擁有基本理解沒有可比之處。
難度4:發布負面結果 (Difficulty 4: Publishing Negative Results)
Negative results happen all the time, they’re hard, but they happen. You have to recognise that negative results are also results and that they should be welcomed.
負面結果一直在發生,很難,但確實會發生。 您必須認識到負面結果也是結果,應該歡迎他們。
Machine Learning has two sides to it: the theoretical and the applied side. Theorists will publish less frequently with the hope of making a bigger splash and applied academics will tend to publish more often but solve bigger problems.
機器學習有兩個方面:理論方面和應用方面。 理論家們將減少發表頻率,以期引起更大的轟動,而應用學者則傾向于增加發表頻率,但解決更大的問題。
However in the pursuit of experimentation or in the pursuit of publishing, a lot of negative results are often put to the side and not overly discussed. This then leads to other practitioners repeating these same experiments and at the aggregate, a lot of time is wasted. This inefficiency also breeds a form of ego where people are respected by only the ‘positive’ results they’ve discovered, rather than the results they can confirm to be simply incomplete.
但是,在進行實驗或出版時,常常會帶來很多負面結果,而不會進行過多討論。 然后,這導致其他從業者重復這些相同的實驗,并且總的來說浪費了很多時間。 這種低效率也滋生了一種自我的形式,在這種自我中,人們僅受到他們發現的“積極”結果的尊重,而不是僅僅確認其不完全的結果。
Everyone benefits if we can classify problems better.
如果我們能夠更好地對問題進行分類,那么每個人都會受益。

難題5:掌握研究 (Difficulty 5: Keeping on Top of the Research)
Did I mention that there’s a lot of it?
我是否提到過很多?
截至撰寫本文時,Google已在本年度出版了340多種出版物。 (Google has published over 340 publications THIS YEAR as of writing this article.)
Google don’t mess around either: their research is always very good. Let alone with all the publications and Universities in the world — how am I meant to keep on top of all this research?
Google也不搞混:他們的研究始終非常出色。 更不用說世界上所有的出版物和大學了-我要在所有這些研究中保持領先地位是什么意思?
You kind of…just…have to find a way.
您……只是……必須找到一種方法。
I read a lot and spend most of my day looking out for new approaches and methodologies to solve the problems I’m facing but at times, you can get lost in a swathe of research or even, not even find the right articles because there’s so much research that it’s hard to identify what’s useful.
我讀了很多書,花了整整一天的時間尋找解決我所面臨問題的新方法和方法,但有時,您可能會迷失于大量的研究中,甚至找不到合適的文章,因為許多研究表明,很難確定有用的東西。
Using citations is a great method to filter research and staying on top of the most cited papers every year definitely helps but in finding an ‘edge’ or in discovering ‘novel’ applications of models, you just have to do the leg work and read as much as you can.
使用引用是過濾研究的一種好方法,并且每年留在被引用最多的論文上肯定有幫助,但是在尋找模型的“優勢”或發現“新穎”應用時,您只需做些簡單的工作并閱讀盡你所能。
Ultimately and in my opinion, to be a successful Machine Learning Researcher or Data Scientist, you need to be able to teach yourself. You just have to find a reason to know how a neural-network works or why a Random Forest sucks in some cases, and use this to drive your understanding.
最終,以我的觀點,要成為成功的機器學習研究員或數據科學家,您需要能夠自學。 您只需要找到一個理由來了解神經網絡的工作原理,或者在某些情況下為什么會吸引隨機森林,并以此來加深您的理解。
The reason being is that it’s such a multi-disciplined subject that moves leaps and bounds every year. I graduated from my masters program in 2016 and since then the whole AI sphere has been reinvented 3 times over.
原因是,它是一個如此多學科的學科,每年都在飛躍發展。 我于2016年從碩士課程畢業,自那時以來,整個AI領域已被徹底改造了3次。
Thanks for reading! If you have any messages, please let me know!
謝謝閱讀! 如果您有任何留言,請告訴我!
Keep up to date with my latest articles here!
在這里了解我的最新文章!
翻譯自: https://medium.com/swlh/how-hard-is-it-to-be-a-real-data-scientist-85ab88f451f
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389470.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389470.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389470.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!