真實感人故事
Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute!
M之外的任何即將數據分析多情。 許多人喜歡matplotlib和Seaborn。 許多人喜歡設計和使用分類器。 我們很快就會獲取一個數據集并啟動Jupyter Notebook,導入熊貓和NumPy并開始工作。 但是等一下 !
We may be great narrators, but its important to check facts before we get on stage. In other words, you may be an excellent data wrangler and analyst, but poor quality data can lead you to poor quality observations. Now, what is Good Quality Data?
我們可能是出色的解說員,但在上臺之前檢查事實很重要。 換句話說,您可能是出色的數據爭奪者和分析師,但是質量低劣的數據可能會導致質量低劣的觀察結果。 現在,什么是優質數據?
There are many factors that measure and define Good Quality Data. Among them are Accuracy, Completeness, Timeliness, Reliability to name a few. Some may say a data set with no null values, missing data, or duplicate information is Good Quality Data. Today, I would like to draw your attention to easily overlooked yet very important questions. How well does the data set represent your problem? Is it free of bias?
有許多因素可以衡量和定義高質量數據。 其中包括準確性,完整性,及時性,可靠性等。 有人可能會說沒有空值,缺少數據或重復信息的數據集就是“高質量數據”。 今天,我想提請您注意那些容易忽視但非常重要的問題。 數據集如何很好地表示您的問題? 它沒有偏見嗎?
Let me explain with a quick example. You are trying to see whether both the genders are equally prone to Diabetes. They say, Diabetes is a lifestyle disease. Let us assume that the person who collected the data ended up reaching out to middle-aged women who do not indulge in any form of physical exercise and have unhealthy eating habits. Say 75 out of 100 of these women were Diabetic. This person also approached 50 men who work 8 hours a day in a construction site always on their toes. 5 out of 50 were Diabetic. As analysts, if we did not inspect the data well before working with it, this can be catastrophic. One can very easily state that 75 percent of the women were Diabetic while the number was 10 percent for men. In conclusion, Women are more prone to Diabetes than Men.
讓我用一個簡單的例子來解釋。 您正在嘗試查看兩種性別是否同樣容易患糖尿病。 他們說,糖尿病是一種生活方式疾病 。 讓我們假設收集數據的人最終接觸了不沉迷于任何形式的體育鍛煉且飲食習慣不健康的中年婦女。 假設其中100位女性中有75位是糖尿病患者。 此人還接近了50名每天要在建筑工地工作8小時的男人,他們總是用腳趾踩。 50名糖尿病患者中有5名。 作為分析人員,如果我們在使用數據之前沒有很好地檢查數據,這將是災難性的。 可以很容易地指出,有75%的女性是糖尿病患者,而男性的這一比例是10%。 總之,女性比男性更容易患糖尿病。
While I kept the data set very simple, we still have big take-aways from this. The data set should have included samples of people from diverse backgrounds for each gender. It should have included an equal number of samples for both the genders. Factors like Age, Income, Geography, Level of Physical Activity, Food Habits, Other Diagnosed Diseases among others could tell a different story. Each of these categories in isolation can tell a different tale. Depending on what your problem statement is, the right sample of data set should be chosen to arrive at meaningful and sound conclusions.
盡管我將數據集保持得非常簡單,但我們仍然可以從中獲得很大收獲。 數據集應包括每個性別背景不同的人的樣本。 對于兩個性別,應包括相等數量的樣本。 諸如年齡,收入,地理,體育活動水平,飲食習慣,其他診斷出的疾病等因素可能會講一個不同的故事。 這些類別中的每個類別都可以講述一個不同的故事。 根據問題陳述的內容,應選擇正確的數據集樣本以得出有意義且合理的結論。
Let me give another example of the K-Nearest Neighbor Classification Algorithm. For those of you who are not very familiar with the term, KNN algorithm helps classify an object with unknown class/type into one of the X categories in the data set. The algorithm is first trained on data points(objects) with known Class/Types and then used to classify new objects. How KNN classifies a point is by calculating the Euclidean distance from K(a given value) closest neighbors. The new object is assigned the Class/Type with more number of votes.
讓我再舉一個“ K最近鄰分類算法”的例子。 對于那些不太熟悉該術語的人,KNN算法可將類別/類型未知的對象分類為數據集中的X個類別之一。 該算法首先在具有已知類/類型的數據點(對象)上進行訓練, 然后用于對新對象進行分類。 KNN如何對點進行分類是通過計算距K(給定值)最近的鄰居的歐幾里得距離。 為新對象分配了更多票數的“類別/類型”。

In the above picture, we see that X should be classified as a Green Circle. If K=1, we get Class= Green Circle. When we set K=13, we see that inevitably, the object gets classified as Blue Square. While in some data sets it could be the right classification, in the above example it is not. Green Circle samples were less in number, which is why they were out-voted and the object was incorrectly classified.
在上圖中,我們看到X應該被分類為綠色圓圈。 如果K = 1,我們得到Class = Green Circle。 當我們設置K = 13時,我們不可避免地看到該對象被歸類為“藍色正方形”。 雖然在某些數據集中可能是正確的分類,但在上面的示例中卻不是。 Green Circle樣本的數量較少,這就是為什么要對它們進行投票并且對對象進行錯誤分類的原因。
In real life, the conclusions you draw, and the solutions or business decisions you propose based on your conclusions are make-or-break. Some decisions are highly critical, which makes drawing conclusions from well represented data more crucial than we realize.
在現實生活中,您得出的結論以及根據您的結論提出的解決方案或業務決策都是成敗的 。 有些決定至關重要,這使得從具有良好表現力的數據中得出結論比我們意識到的更為重要。
Disclaimer: Choosing the right K value is beyond the scope of this article.
免責聲明 :選擇合適的K值超出了本文的范圍。
翻譯自: https://medium.com/analytics-vidhya/does-your-data-let-you-tell-the-real-story-7c4c7d656a01
真實感人故事
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390701.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390701.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390701.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!