批判性思維
As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.
正如亞歷山大·波普(Alexander Pope)所說,犯錯是人類。 按照這個指標,誰比我們的數據科學家更人性化? 我們不斷設計錯誤的假設,然后花時間研究它們,以找出我們的錯誤所在。
When looking at mistakes from an experiment, a data scientist needs to be critical, always on the lookout for something that others may have missed. But sometimes, in our day-to-day routine, we can easily get lost in little details. When this happens, we often fail to look at the overall picture, ultimately failing to deliver what the business wants.
在查看實驗中的錯誤時,數據科學家必須至關重要,始終在尋找其他人可能錯過的東西。 但是有時候,在我們的日常工作中,我們很容易在細節上迷失方向。 發生這種情況時,我們常常無法看清整體情況,最終無法交付業務所需的東西。
Our business partners have hired us to generate value. We won’t be able to generate that value unless we develop business-oriented critical thinking, including having a more holistic perspective of the business at hand. So here is some practical advice for your day-to-day work as a data scientist.
我們的商業伙伴已聘請我們創造價值。 除非我們發展面向業務的批判性思維,包括對手頭的業務有更全面的了解,否則我們將無法產生該價值。 因此,這是您作為數據科學家的日常工作的一些實用建議。
1.當心清潔數據綜合癥 (1. Beware of clean data syndrome)
Tell me how many times this has happened to you: You get a data set and start working on it straight away. You create neat visualizations and start building models. Maybe you even present automatically generated descriptive analytics to your business counterparts!
告訴我這件事發生了多少次:您得到一個數據集,并立即開始處理它。 您可以創建簡潔的可視化效果并開始構建模型。 甚至您甚至可以向業務對手展示自動生成的描述性分析!
But do you ever ask, “Does this data actually make sense?”
但是您是否曾經問過:“這些數據真的有意義嗎?”
Incorrectly assuming that the data is clean could lead you toward very wrong hypotheses. Not only that, but you’re also missing an important analytical opportunity with this assumption.
錯誤地假設數據是干凈的可能會導致您得出非常錯誤的假設。 不僅如此,這種假設還會使您失去重要的分析機會。
You can actually discern a lot of important patterns by looking at discrepancies in the data. For example, if you notice that a particular column has more than 50 percent of values missing, you might think about dropping the column. But what if the missing column is because the data collection instrument has some error? By calling attention to this, you could have helped the business to improve its processes.
通過查看數據中的差異,您實際上可以識別出許多重要的模式。 例如,如果您發現某個特定的列缺少超過50%的值,則可以考慮刪除該列。 但是,如果缺少列是因為數據收集工具有一些錯誤怎么辦? 通過引起對此的注意,您本可以幫助企業改進其流程。
Or what if you’re given a distribution of customers that shows a ratio of 90 percent men versus 10 percent women, but the business is a cosmetics company that predominantly markets its products to women? You could assume you have clean data and show the results as is, or you can use common sense and ask the business partner if the labels are switched.
或者,如果給您分配了90%的男性與10%的女性比率的客戶分布,但該企業是一家化妝品公司,主要向女性銷售產品,該怎么辦? 您可以假設您有干凈的數據并按原樣顯示結果,或者可以使用常識并詢問業務伙伴是否更換了標簽。
Such errors are widespread. Catching them not only helps the future data collection processes but also prevents the company from making wrong decisions by preventing various other teams from using bad data.
這種錯誤很普遍。 捕獲它們不僅有助于將來的數據收集過程,而且還可以防止其他團隊使用不良數據來防止公司做出錯誤的決定。
2.警惕異常。 (2. Be on the lookout for something out of the ordinary.)

You probably know fab.com. If you don’t, it’s a website that sells selected health and fitness items. But the site’s origins weren’t in e-commerce. Fab.com started as Fabulis.com, a social networking site for gay men. One of the site’s most popular features was called the “Gay Deal of the Day.”
?歐大概知道fab.com。 如果您不這樣做,那是一個出售選定健康和健身物品的網站。 但是該網站的起源不是電子商務。 Fab.com 最初是Fabulis.com(男同性戀者的社交網站)。 該網站最受歡迎的功能之一被稱為“每日同性戀交易”。
One day, the deal was for hamburgers. Half of the deal’s buyers were women, despite the fact that they weren’t the site’s target users. This fact caused the data team to realize that they had an untapped market for selling goods to women. So Fabulis.com changed its business model to serve this newfound market.
有一天,這筆交易是給漢堡包的。 盡管這不是該網站的目標用戶,但交易的買家中有一半是女性。 這一事實使數據團隊意識到,他們有一個尚未開發的向女性出售商品的市場。 因此Fabulis.com更改了其業務模式以服務于這個新發現的市場。
Be on the lookout for something out of the ordinary. Be ready to ask questions. If you see something in the data, you may have hit gold. Data can help a business to optimize revenue, but sometimes it has the power to change the direction of the company as well.
尋求與眾不同的東西。 準備問問題。 如果您看到數據中的某些內容,則可能是黃金。 數據可以幫助企業優化收入,但有時它也可以改變公司的發展方向。

Another famous example of this is Flickr, which started out as a multiplayer game. Only when the founders noticed that people were using it as a photo upload service did the company pivot to the photo-sharing app we know it as today.
另一個著名的例子是Flickr,它最初是一種多人游戲 。 只有當創始人注意到人們將其用作照片上傳服務時,公司才轉向我們今天所知的照片共享應用程序。
Try to see patterns that others would miss. Do you see a discrepancy in some buying patterns or maybe something you can’t seem to explain? That might be an opportunity in disguise when you look through a wider lens.
嘗試查看其他人會錯過的模式。 您是否發現某些購買模式存在差異,或者您似乎無法解釋? 當您從更大的角度看時,這可能是變相的機會。
3.關注正確的指標 (3. Focus on the right metrics)
What do we want to optimize for?
我們要優化什么?
Most businesses fail to answer this simple question.
大多數企業無法回答這個簡單的問題。
Every business problem is a little different and should, therefore, be optimized differently. For example, a website owner might ask you to optimize for daily active users. Daily active users is a metric defined as the number of people who open a product on a given day.
每個業務問題都稍有不同,因此應該以不同的方式進行優化。 例如,網站所有者可能會要求您針對每日活躍用戶進行優化。 每日活躍用戶是一個指標,定義為在特定日期打開產品的人數。
But is that the right metric? Maybe not. In reality, it’s just a vanity metric, meaning one that makes you look good but doesn’t serve any purpose when it comes to actionability. This metric will always increase if you are spending marketing dollars across various channels to bring more and more customers to your site.
但這是正確的指標嗎? 也許不會。 實際上,這只是一種虛榮感指標,這意味著它可以使您看起來不錯,但對于可操作性沒有任何作用。 如果您在各種渠道上花費營銷費用來吸引越來越多的客戶訪問您的網站,則該指標將始終保持增長。
Instead, I would recommend optimizing the percentage of users that are active to get a better idea of how my product is performing. A big marketing campaign might bring a lot of users to my site, but if only a few of them convert to active, the marketing campaign was a failure and my site stickiness factor is very low. You can measure the stickiness by the second metric and not the first one. If the percentage of active users is increasing, that must mean that they like my website.
相反,我建議優化活躍用戶的百分比,以更好地了解我的產品的性能。 大型的營銷活動可能會吸引很多用戶訪問我的網站,但是如果只有少數用戶轉換為活動用戶,則營銷活動將失敗,并且我的網站黏性系數非常低。 您可以通過第二個指標而不是第一個指標來衡量粘性。 如果活躍用戶的百分比在增加,那必須表示他們喜歡我的網站。
Another example of looking at the wrong metric happens when we create classification models. We often try to increase accuracy for such models. But do we really want accuracy as a metric of our model performance?
創建分類模型時,會出現另一個錯誤指標的例子。 我們經常嘗試提高此類模型的準確性。 但是,我們是否真的希望準確性作為衡量模型性能的指標?

Imagine that we’re predicting the number of asteroids that will hit the Earth. If we want to optimize for accuracy, we can just say zero all the time, and we will be 99.99 percent accurate. That 0.01 percent error could be hugely impactful, though. What if that 0.01 percent is a planet-killing-sized asteroid? A model can be reasonably accurate but not at all valuable. A better metric would be the F score, which would be zero in this case, because the recall of such a model is zero as it never predicts an asteroid hitting the Earth.
想象一下,我們正在預測將撞擊地球的小行星的數量。 如果我們要優化準確性,我們可以一直說零,那么我們將達到99.99%的準確性。 不過,該0.01%的錯誤可能會產生巨大影響。 如果那0.01%是殺死行星的小行星怎么辦? 模型可以相當準確,但根本沒有價值。 更好的度量標準是F分數,在這種情況下將為零,因為這種模型的召回率是零,因為它從未預測過小行星撞擊地球。
When it comes to data science, designing a project and the metrics we want to use for evaluation is much more important than modeling itself. The metrics themselves need to specify the business goal and aiming for a wrong goal effectively destroys the whole purpose of modeling. For example, F1 or PRAUC is a better metric in terms of asteroid prediction as they take into consideration both the precision and recall of the model. If we optimize for accuracy, our whole modeling effort could just be in vain.
在數據科學方面,設計項目和我們要用于評估的指標比建模本身更為重要。 度量標準本身需要指定業務目標,而針對錯誤的目標則有效地破壞了建模的整個目的。 例如,就小行星預測而言,F1或PRAUC是更好的指標,因為它們同時考慮了模型的精度和召回率。 如果我們針對準確性進行優化,那么整個建模工作將徒勞無功。
4.記住:統計有時會誤導 (4. Remember: Statistics mislead sometimes)
Be skeptical of any statistics that get quoted to you. Statistics have been used to lie in advertisements, in workplaces, and in a lot of other areas in the past. People will do anything to get sales or promotions.
懷疑引用給您的任何統計信息。 過去,統計信息已被用于廣告,工作場所以及許多其他領域。 人們會做任何事情來獲得銷售或促銷。

For example, do you remember Colgate’s claim that 80 percent of dentists recommended their brand? This statistic seems pretty good at first. If so many dentists use Colgate, I should too, right?
例如, 您還記得高露潔聲稱80%的牙醫推薦其品牌的說法嗎? 起初,這個統計數字看起來不錯。 如果有那么多牙醫使用高露潔,我也應該吧?
It turns out that during the survey, the dentists could choose multiple brands rather than just one. So other brands could be just as popular as Colgate.
事實證明,在調查期間,牙醫可以選擇多個品牌,而不僅僅是一個。 因此,其他品牌可能與高露潔一樣受歡迎。

Marketing departments are just myth creation machines. We often see such examples in our daily lives. Take, for example, this 1992 ad from Chevrolet. Just looking at just the graph and not at the axis labels, it looks like Nissan/Datsun must be dreadful truck manufacturers.
營銷部門只是神話創造的機器。 我們在日常生活中經常看到這樣的例子。 以1992年的雪佛蘭(Chevrolet)廣告為例。 只看圖表而不看軸標簽,看起來日產/ Datsun一定是可怕的卡車制造商。
In fact, the graph indicates that more than 95 percent of the Nissan and Datsun trucks sold in the previous 10 years were still running. And the small difference might just be due to sample sizes and the types of trucks sold by each of the companies. As a general rule, never trust a chart that doesn’t label the Y-axis.
實際上,該圖表明在過去10年中售出的日產和Datsun卡車中有95%仍在運行。 差異很小可能只是由于樣本量和每個公司出售的卡車的類型。 作為一般規則,否E版本的信任,不標注Y軸的圖表。
As a part of the ongoing pandemic, we’re seeing even more such examples with a lot of studies promoting cures for COVID-19. This past June in India, a man claimed to have made medicine for coronavirus that cured 100 percent of patients in seven days. This news predictably caused a big stir, but only after he was asked about the sample size did we understand what was actually happening here.
作為持續進行的大流行的一部分,我們通過許多促進COVID-19治愈的研究看到了更多這樣的例子。 今年六月在印度,一名男子聲稱自己制作了冠狀病毒藥物,在7天內治愈了100%的患者。 可以預見的是,這一消息引起了極大的轟動,但只有在詢問了他有關樣本量的信息后,我們才了解這里實際發生的情況。
With a sample size of 100, the claim was utterly ridiculous on its face.
樣本數量為100,該聲明的內容完全荒謬。
Worse, the way the sample was selected was hugely flawed. His organization selected asymptomatic and mildly symptomatic users with a mean age between 35 and 45 with no pre-existing conditions, I was dumbfounded — this was not even a random sample. So not only was the study useless, it was actually unethical.
更糟糕的是,樣本的選擇方式存在巨大缺陷。 他的組織選擇了無癥狀和輕度癥狀的使用者,他們的平均年齡在35至45歲之間,并且沒有既往疾病,我對此感到震驚-這甚至不是隨機樣本。 因此,這項研究不僅無用,而且實際上是不道德的。
When you see charts and statistics, remember to evaluate them carefully. Make sure the statistics were sampled correctly and are being used in an ethical, honest way.
當您看到圖表和統計數據時,請記住要仔細評估它們。 確保統計信息已正確采樣并以道德,誠實的方式使用。
5.不要屈服于謬論 (5. Don’t give in to fallacies)
During the summer of 1913 in a casino in Monaco, gamblers watched in amazement as the roulette wheel landed on black an astonishing 26 times in a row. And since the probability of red versus black is precisely half, they were confident that red was “due.” It was a field day for the casino and a perfect example of gambler’s fallacy, a.k.a. the Monte Carlo fallacy.
d uring 1913年夏天在摩納哥賭場,驚奇地輪盤賭的賭徒看著一排降落在黑色驚人的26倍。 而且由于紅色與黑色的概率恰好是一半,所以他們確信紅色是“應有的”。 這是賭場的野外活動日,也是賭徒謬論 (又稱蒙特卡洛謬論)的完美例證。
This happens in everyday life outside of casinos too. People tend to avoid long strings of the same answer. Sometimes they do so while sacrificing accuracy of judgment for the sake of getting a pattern of decisions that look fairer or more probable. For example, an admissions office may reject the next application they see if they have approved three applications in a row, even if the application should have been accepted on merit.
這也發生在賭場以外的日常生活中。 人們傾向于避免使用長串相同的答案 。 有時他們這樣做是在犧牲判斷準確性的同時,為了獲得看起來更公平或更可能的決策模式。 例如, 招生辦公室可以連續拒絕三個申請,即使他們本應被接受,也可以拒絕下一個申請。
The world works on probabilities. We are seven billion people, each doing an event every second of our lives. Because of that sheer volume, rare events are bound to happen. But we shouldn’t put our money on them.
世界靠概率工作。 我們有70億人,每個人每秒鐘都在做一件事情。 由于數量龐大,必將發生罕見的事件。 但是我們不應該把錢花在他們身上。
Think also of the spurious correlations we end up seeing regularly. This particular graph shows that organic food sales cause autism. Or is it the opposite? Just because two variables move together in tandem doesn’t necessarily mean that one causes the other. Correlation does not imply causation and as data scientists, it is our job to be on a lookout for such fallacies, biases, and spurious correlations. We can’t allow oversimplified conclusions to cloud our work.
還請考慮一下我們最終經常看到的虛假關聯。 此特殊圖表顯示,有機食品的銷售會導致自閉癥。 還是相反? 僅僅因為兩個變量串聯在一起并不一定意味著一個導致另一個。 關聯并不意味著因果關系 ,作為數據科學家,尋找此類謬論,偏差和虛假關聯是我們的工作。 我們不能允許過于簡單的結論使我們的工作蒙上陰影。
Data scientists have a big role to play in any organization. A good data scientist must be both technical as well as business-driven to perform the job’s requirements well. Thus, we need to make a conscious effort to understand the business’ needs while also polishing our technical skills.
數據科學家在任何組織中都可以發揮重要作用。 優秀的數據科學家必須具備技術和業務驅動才能很好地滿足工作要求。 因此,我們需要有意識地努力去了解業務需求,同時還要完善我們的技術技能。
繼續學習 (Continue learning)
If you want to learn more about how to apply Data Science in a business context, I would recommend AI for Everyone course by Andrew Ng which focusses on spotting opportunities to apply AI to problems in your own organization, working with an AI team and build an AI strategy in your company.
如果您想了解有關如何在業務環境中應用數據科學的更多信息,我將推薦Andrew Ng的“ 每個人的AI”課程 ,重點是發現機會將AI應用于您自己組織中的問題,與AI團隊合作并建立一個您公司的AI策略。
Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.
感謝您的閱讀。 我將來也會寫更多對初學者友好的文章。 在 Medium上 關注我, 或訂閱我的 博客 以了解有關它們的信息。 與往常一樣,我歡迎您提供反饋和建設性的批評,可以在Twitter @mlwhiz 上與我們 聯系 。
翻譯自: https://medium.com/swlh/why-critical-thinking-skills-are-essential-for-data-scientists-e9a16634ac8
批判性思維
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391507.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391507.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391507.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!