真實感人故事_您的數據可以告訴您真實故事嗎?

真實感人故事

Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute!

M之外的任何即將數據分析多情。 許多人喜歡matplotlib和Seaborn。 許多人喜歡設計和使用分類器。 我們很快就會獲取一個數據集并啟動Jupyter Notebook,導入熊貓和NumPy并開始工作。 但是等一下

We may be great narrators, but its important to check facts before we get on stage. In other words, you may be an excellent data wrangler and analyst, but poor quality data can lead you to poor quality observations. Now, what is Good Quality Data?

我們可能是出色的解說員,但在上臺之前檢查事實很重要。 換句話說,您可能是出色的數據爭奪者和分析師,但是質量低劣的數據可能會導致質量低劣的觀察結果。 現在,什么是優質數據?

There are many factors that measure and define Good Quality Data. Among them are Accuracy, Completeness, Timeliness, Reliability to name a few. Some may say a data set with no null values, missing data, or duplicate information is Good Quality Data. Today, I would like to draw your attention to easily overlooked yet very important questions. How well does the data set represent your problem? Is it free of bias?

有許多因素可以衡量和定義高質量數據。 其中包括準確性,完整性,及時性,可靠性等。 有人可能會說沒有空值,缺少數據或重復信息的數據集就是“高質量數據”。 今天,我想提請您注意那些容易忽視非常重要的問題。 數據集如何很好地表示您的問題? 它沒有偏見嗎?

Let me explain with a quick example. You are trying to see whether both the genders are equally prone to Diabetes. They say, Diabetes is a lifestyle disease. Let us assume that the person who collected the data ended up reaching out to middle-aged women who do not indulge in any form of physical exercise and have unhealthy eating habits. Say 75 out of 100 of these women were Diabetic. This person also approached 50 men who work 8 hours a day in a construction site always on their toes. 5 out of 50 were Diabetic. As analysts, if we did not inspect the data well before working with it, this can be catastrophic. One can very easily state that 75 percent of the women were Diabetic while the number was 10 percent for men. In conclusion, Women are more prone to Diabetes than Men.

讓我用一個簡單的例子來解釋。 您正在嘗試查看兩種性別是否同樣容易患糖尿病。 他們說,糖尿病是一種生活方式疾病 。 讓我們假設收集數據的人最終接觸了不沉迷于任何形式的體育鍛煉且飲食習慣不健康的中年婦女。 假設其中100位女性中有75位是糖尿病患者。 此人還接近了50名每天要在建筑工地工作8小時的男人,他們總是用腳趾踩。 50名糖尿病患者中有5名。 作為分析人員,如果我們在使用數據之前沒有很好地檢查數據,這將是災難性的。 可以很容易地指出,有75%的女性是糖尿病患者,而男性的這一比例是10%。 總之,女性比男性更容易患糖尿病。

While I kept the data set very simple, we still have big take-aways from this. The data set should have included samples of people from diverse backgrounds for each gender. It should have included an equal number of samples for both the genders. Factors like Age, Income, Geography, Level of Physical Activity, Food Habits, Other Diagnosed Diseases among others could tell a different story. Each of these categories in isolation can tell a different tale. Depending on what your problem statement is, the right sample of data set should be chosen to arrive at meaningful and sound conclusions.

盡管我將數據集保持得非常簡單,但我們仍然可以從中獲得很大收獲。 數據集應包括每個性別背景不同的人的樣本。 對于兩個性別,應包括相等數量的樣本。 諸如年齡,收入,地理,體育活動水平,飲食習慣,其他診斷出的疾病等因素可能會講一個不同的故事。 這些類別中的每個類別都可以講述一個不同的故事。 根據問題陳述的內容,應選擇正確的數據集樣本以得出有意義且合理的結論。

Let me give another example of the K-Nearest Neighbor Classification Algorithm. For those of you who are not very familiar with the term, KNN algorithm helps classify an object with unknown class/type into one of the X categories in the data set. The algorithm is first trained on data points(objects) with known Class/Types and then used to classify new objects. How KNN classifies a point is by calculating the Euclidean distance from K(a given value) closest neighbors. The new object is assigned the Class/Type with more number of votes.

讓我再舉一個“ K最近鄰分類算法”的例子。 對于那些不太熟悉該術語的人,KNN算法可將類別/類型未知的對象分類為數據集中的X個類別之一。 該算法首先在具有已知類/類型的數據點(對象)上進行訓練, 然后用于對新對象進行分類。 KNN如何對點進行分類是通過計算距K(給定值)最近的鄰居的歐幾里得距離。 為新對象分配了更多票數的“類別/類型”。

Image for post
K-Nearest Neighbor Classifier
最近鄰分類器

In the above picture, we see that X should be classified as a Green Circle. If K=1, we get Class= Green Circle. When we set K=13, we see that inevitably, the object gets classified as Blue Square. While in some data sets it could be the right classification, in the above example it is not. Green Circle samples were less in number, which is why they were out-voted and the object was incorrectly classified.

在上圖中,我們看到X應該被分類為綠色圓圈。 如果K = 1,我們得到Class = Green Circle。 當我們設置K = 13時,我們不可避免地看到該對象被歸類為“藍色正方形”。 雖然在某些數據集中可能是正確的分類,但在上面的示例中卻不是。 Green Circle樣本的數量較少,這就是為什么要對它們進行投票并且對對象進行錯誤分類的原因。

In real life, the conclusions you draw, and the solutions or business decisions you propose based on your conclusions are make-or-break. Some decisions are highly critical, which makes drawing conclusions from well represented data more crucial than we realize.

在現實生活中,您得出的結論以及根據您的結論提出的解決方案或業務決策都是成敗的 。 有些決定至關重要,這使得從具有良好表現力的數據中得出結論比我們意識到的更為重要。

Disclaimer: Choosing the right K value is beyond the scope of this article.

免責聲明 :選擇合適的K值超出了本文的范圍。

翻譯自: https://medium.com/analytics-vidhya/does-your-data-let-you-tell-the-real-story-7c4c7d656a01

真實感人故事

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390701.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390701.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390701.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

轉:防止跨站攻擊,安全過濾

轉:http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻擊 本文說一下SpringMVC如何防御CSRF(Cross-site request forgery跨站請求偽造)和XSS(Cross site script跨站腳本攻擊)。 說說CSRF 對CSRF來說,其實Spring…

Linux c編程

c語言標準 ANSI CPOSIX(提高UNIX程序可移植性)SVID(POSIX的擴展超集)XPG(X/Open可移植性指南)GNU C(唯一能編譯Linux內核的編譯器) gcc 簡介 名稱: GNU project C an…

html怎么注釋掉代碼_HTML注釋:如何注釋掉您HTML代碼

html怎么注釋掉代碼HTML中的注釋 (Comments in HTML) The comment tag is an element used to leave notes, mostly related to the project or the website. This tag is frequently used to explain something in the code or leave some recommendations about the project.…

k均值算法 二分k均值算法_使用K均值對加勒比珊瑚礁進行分類

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您見過加勒比礁嗎? 好吧,如果沒有,請做好準備。 Today, we will be answering a question that, at face value, appears quite sim…

您好,這是我的第一篇文章

您好我是CYL 這是一個辣雞博客 歡迎指教 轉載于:https://www.cnblogs.com/pigba/p/8823472.html

08_MySQL DQL_SQL99標準中的多表查詢(內連接)

# sql99語法/*語法: select 查詢列表 from 表1 別名 【連接類型】 join 表2 別名 on 連接條件 【where 篩選條件】 【group by 分組】 【having 分組后篩選】 【order by 排序列表】分類內連接(重點): inner外連接 左外&#xff0…

java中抽象類繼承抽象類_Java中的抽象類用示例解釋

java中抽象類繼承抽象類Abstract classes are classes declared with abstract. They can be subclassed or extended, but cannot be instantiated. You can think of them as a class version of interfaces, or as an interface with actual code attached to the methods.抽…

新建VUX項目

使用Vue-cli安裝Vux2 特別注意配置vux-loader。來自為知筆記(Wiz)

衡量試卷難度信度_我們可以通過數字來衡量語言難度嗎?

衡量試卷難度信度Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interactio…

Linux 題目總結

守護進程的工作就是打開一個端口,并且等待(Listen)進入連接。 如果客戶端發起一個連接請求,守護進程就創建(Fork)一個子進程響應這個連接,而主進程繼續監聽其他的服務請求。 xinetd能夠同時監聽…

《精通Spring4.X企業應用開發實戰》讀后感第二章

一、配置Maven\tomcat https://www.cnblogs.com/Miracle-Maker/articles/6476687.html https://www.cnblogs.com/Knowledge-has-no-limit/p/7240585.html 二、創建數據庫表 DROP DATABASE IF EXISTS sampledb; CREATE DATABASE sampledb DEFAULT CHARACTER SET utf8; USE sampl…

換了電腦如何使用hexo繼續寫博客

前言 我們知道,使用 Githubhexo 搭建一個個人博客確實需要花不少時間的,我們搭好博客后使用的挺好,但是如果我們有一天電腦突然壞了,或者換了系統,那么我們怎么使用 hexo 再發布文章到個人博客呢? 如果我們…

leetcode 525. 連續數組

給定一個二進制數組 nums , 找到含有相同數量的 0 和 1 的最長連續子數組,并返回該子數組的長度。 示例 1: 輸入: nums [0,1] 輸出: 2 說明: [0, 1] 是具有相同數量 0 和 1 的最長連續子數組。 示例 2: 輸入: nums [0,1,0] 輸出: 2 說明: [0, 1] (或 [1, 0]) 是…

實踐作業2:黑盒測試實踐(小組作業)每日任務記錄1

會議時間:2017年11月24日20:00 – 20:30 會議地點:在線討論 主 持 人:王晨懿 參會人員:王晨懿、余晨晨、鄭錦波、楊瀟、侯歡、汪元 記 錄 人:楊瀟 會議議題:軟件測試課程作業-黑盒測試實踐的啟動計劃 會議內…

視圖可視化 后臺_如何在單視圖中可視化復雜的多層主題

視圖可視化 后臺Sometimes a dataset can tell many stories. Trying to show them all in a single visualization is great, but can be too much of a good thing. How do you avoid information overload without oversimplification?有時數據集可以講述許多故事。 試圖在…

iam身份驗證以及訪問控制_如何將受限訪問IAM用戶添加到EKS群集

iam身份驗證以及訪問控制介紹 (Introduction) Elastic Kubernetes Service (EKS) is the fully managed Kubernetes service from AWS. It is deeply integrated with many AWS services, such as AWS Identity and Access Management (IAM) (for authentication to the cluste…

一步一步構建自己的管理系統①

2019獨角獸企業重金招聘Python工程師標準>>> 系統肯定要先選一個基礎框架。 還算比較熟悉Spring. 就選Spring boot postgres mybatis. 前端用Angular. 開始搭開發環境,開在window上整的。 到時候再放到服務器上。 自己也去整了個小服務器,…

面向對象面向過程

1、面向語句: 直接寫原生的sql語句,但是這樣代碼不容易維護。改一個方法會導致整個項目都要改動, 2、面向過程 定義一些函數,用的時候就調用不用就不調用。但是這也有解決不了的問題,如果要維護需要改動代碼&#xff0…

python邊玩邊學_邊聽邊學數據科學

python邊玩邊學Podcasts are a fun way to learn new stuff about the topics you like. Podcast hosts have to find a way to explain complex ideas in simple terms because no one would understand them otherwise 🙂 In this article I present a few episod…

react css多個變量_如何使用CSS變量和React上下文創建主題引擎

react css多個變量CSS variables are really cool. You can use them for a lot of things, like applying themes in your application with ease. CSS變量真的很棒。 您可以將它們用于很多事情,例如輕松地在應用程序中應用主題。 In this tutorial Ill show you …