pd種知道每個數據的類型_每個數據科學家都應該知道的5個概念

pd種知道每個數據的類型

意見 (Opinion)

目錄 (Table of Contents)

  1. Introduction

    介紹
  2. Multicollinearity

    多重共線性
  3. One-Hot Encoding

    一站式編碼
  4. Sampling

    采樣
  5. Error Metrics

    錯誤指標
  6. Storytelling

    評書
  7. Summary

    摘要

介紹 (Introduction)

I have written about common skills that Data Scientists can expect to use in their professional careers, so now I want to highlight some key concepts of Data Science that can be beneficial to know and later employ. I may be discussing some that you know already, and some that you do not know; my goal is to provide some professional explanation of why these concepts are beneficial regardless of what you do know now. Multicollinearity, one-hot encoding, undersampling and oversampling, error metrics, and lastly, storytelling, are the key concepts I think of first when thinking of a professional Data Scientist in their day-to-day. The last point, perhaps, is a combination of skill and a concept but wanted to highlight, still, its importance on your everyday work life as a Data Scientist. I will expound upon all of these concepts down below.

我已經寫了關于數據科學家可以在其職業生涯中期望使用的常見技能的文章,所以現在我想重點介紹一些數據科學的關鍵概念,這些知識可能有益于知識并在以后使用。 我可能正在討論您已經知道的一些,以及您不知道的一些。 我的目標是提供一些專業的解釋,說明無論您現在知道什么,這些概念為何都是有益的。 多重共線性,單次編碼,欠采樣和過采樣,錯誤度量,以及講故事,是我每天在考慮專業數據科學家時首先想到的關鍵概念。 最后一點也許是技巧和概念的結合,但仍然想強調它對您作為數據科學家的日常工作的重要性。 我將在下面詳細說明所有這些概念。

多重共線性 (Multicollinearity)

Image for post
Photo by The Creative Exchange on Unsplash [2].
圖片由The Creative Exchange在Unsplash上??提供[2]。

Although the word is somewhat long and hard to say, when you break it down, multicollinearity is simple. Multi meaning many, and collinearity meaning linearly related. Multicollinearity can be described as the situation when two or more explanatory variables explain similar information or are highly related in a regression model. There are a few reasons this concept can raise a concern.

盡管這個詞有點長且很難說,但將其分解時,多重共線性很簡單。 多含義很多,共線性含義線性相關。 多重共線性可以描述為當兩個或多個解釋變量解釋相似信息或在回歸模型中高度相關時的情況。 此概念引起關注的原因有幾個。

For some modeling techniques, it can cause overfitting and ultimately a decline in model performance.

對于某些建模技術,它可能導致過度擬合并最終導致模型性能下降。

The data becomes redundant and not each feature or attribute is needed in your model. Therefore, there are some ways to find out which features you should remove that constitute multicollinearity.

數據變得多余,并且模型中不需要每個功能或屬性。 因此,有一些方法可以找出應刪除構成多重共線性的特征。

  • variance inflation factor (VIF)

    方差膨脹因子(VIF)

  • correlation matrices

    相關矩陣

These two techniques are commonly used amongst Data Scientists, especially correlation matrices and plots — usually visualized with a heatmap of some sort, while VIF is lesser-known.

數據科學家通常使用這兩種技術,尤其是相關矩陣和圖-通常以某種形式的熱圖可視化,而VIF則鮮為人知。

The higher the VIF value, the less usable the feature is for your regression model.

VIF值越高,該功能對您的回歸模型的使用就越少。

A great, simple resource for VIF is [3]:

VIF的一個很好的簡單資源是[3]:

一站式編碼 (One-Hot Encoding)

This form of feature transformation in your model is called one-hot encoding. You want to represent your categorical features numerically by encoding them. Whereas the categorical features have text values themselves, one-hot encoding transposes that information so that each value becomes the feature and the observation in the row is either denoted as a 0 or 1. For example, if we have the categorical variable gender, the numerical representation after one-hot encoding would look like (gender before, and male/female after):

模型中這種形式的特征轉換稱為單次編碼。 您想通過編碼來以數字方式表示分類特征。 盡管分類要素本身具有文本值,但是一鍵編碼會轉置該信息,以便每個值都成為要素,并且該行中的觀察值將表示為0或1。例如,如果我們擁有分類變量sex ,則一鍵編碼后的數字表示看起來像( 性別之前和之后的男性/女性 ):

Image for post
Before and after one-hot encoding. Screenshot by Author [4].
一鍵編碼之前和之后。 作者[4]的屏幕截圖。

This transformation is useful when you are not just working with numerical features, and need to create that numerical representation with text/categorical features.

當您不僅要使用數字功能并且需要使用文本/分類功能創建該數字表示形式時,此轉換非常有用。

采樣 (Sampling)

When you do not have enough data, oversampling may be suggested as a form of compensation. Say you are working on a classification problem and you have a minority class like the example down below:

當您沒有足夠的數據時,建議使用過采樣作為補償。 假設您正在處理分類問題,并且有一個少數類,如下例所示:

class_1 = 100 rowsclass_2 = 1000 rowsclass_3 = 1100 rows

As you can see, class_1 has a small amount of data for its class, which means your dataset is imbalanced and will be referred to as the minority class. There are several oversampling techniques. One of them is called SMOTE [5], which stands for Synthetic Minority Over-sampling Technique. One of the ways that SMOTE works is by utilizing a K-neighbor method for finding the nearest neighbor to create synthetic samples. There are similar techniques that use the reverse method for undersampling.

如您所見, class_1的類中包含少量數據,這意味著您的數據集不平衡,將被稱為少數類。 有幾種過采樣技術。 其中之一稱為SMOTE [5],代表合成少數族裔過采樣技術 。 SMOTE工作的方法之一是利用K鄰域方法來找到最接近的鄰域以創建合成樣本。 有類似的技術使用反向方法進行欠采樣

These techniques are beneficial when you have outliers in your class or regression data even, and you want to ensure your sampling is the best representation of the data that your model will run on in the future.

當您的類或回歸數據中甚至有異常值時,并且您要確保采樣是模型將在將來運行的數據的最佳表示形式時,這些技術將非常有用。

錯誤指標 (Error Metrics)

There are plenty of error metrics used for both classification and regression models in Data Science. According to sklearn [6], here are some that you can use specifically for regression models:

在數據科學中,分類和回歸模型都有大量錯誤度量標準。 根據sklearn [6],以下是您可以專門用于回歸模型的一些信息:

metrics.explained_variance_score

metrics.explained_variance_score

metrics.max_error

metrics.max_error

metrics.mean_absolute_error

metrics.mean_absolute_error

metrics.mean_squared_error

metrics.mean_squared_error

metrics.mean_squared_log_error

metrics.mean_squared_log_error

metrics.median_absolute_error

metrics.median_absolute_error

metrics.r2_score

metrics.r2_score

metrics.mean_poisson_deviance

metrics.mean_poisson_deviance

metrics.mean_gamma_deviance

metrics.mean_gamma_deviance

The two most popular error metrics for regression from above are MSE and RMSE:

從上方進行回歸分析的兩個最受歡迎的錯誤度量標準是MSE和RMSE:

MSE: the concept is → mean absolute error regression loss (sklearn)

MSE:概念是→平均絕對誤差回歸損失(sklearn)

RMSE: the concept is → mean squared error regression loss (sklearn)

RMSE:概念是→均方誤差回歸損失(sklearn)

For classification, you can expect to evaluate your model’s performance with accuracy and AUC (Area Under the Curve).

對于分類,您可以期望以準確性和AUC(曲線下面積)評估模型的性能。

評書 (Storytelling)

Image for post
Photo by Nong Vang on Unsplash [7].
Nong Vang在《 Unsplash 》上的照片 [7]。

I wanted to add a unique concept of Data Science that is storytelling. I cannot stress enough how important this concept is. It can be seen as a concept or skill, but the label here is not important, what is, is how well you articulate your problem-solving techniques in a business setting. A lot of Data Scientists will focus solely on model accuracy, but will then fail to understand the entire business process. That process includes:

我想添加一個講故事的數據科學獨特概念。 我不能足夠強調這個概念的重要性。 可以將其視為概念或技能,但此處的標簽并不重要,即您在業務環境中表達解決問題技術的能力如何。 許多數據科學家將只專注于模型的準確性,但隨后將無法理解整個業務流程。 該過程包括:

  • what is the business?

    什么事

  • what is the problem?

    問題是什么?

  • why do we need Data Science?

    為什么我們需要數據科學?

  • what is the goal of Data Science here?

    數據科學的目標是什么?

  • when will we get usable results?

    我們什么時候可以獲得可用的結果?

  • how can we apply our results?

    我們如何應用我們的結果?

  • what is the impact of our results?

    我們的結果有什么影響?

  • how do we share our results and overall process?

    我們如何分享我們的結果和整體流程?

As you can see, none of these points are the model itself/improvement in accuracy. The focus here is how you will use data to solve your company's problems. It is beneficial to become acquainted with stakeholders and your non-technical coworkers who you will ultimately be working with. You will also work with Product Managers who will work alongside you in assessing the problem, and Data Engineers to collect the data before even running a base model. At the end of your model process, you will share your results with key individuals who will usually like to see its impact in most likely some type of visual representation (Tableau, Google Slide deck, etc.), so being able to present and communicate is beneficial as well.

如您所見,這些要點都不是模型本身/準確性的提高。 這里的重點是如何使用數據來解決公司的問題。 結識最終將要與之合作的利益相關者和您的非技術合作伙伴是有益的。 您還將與產品經理一起工作,他們將與您一起評估問題,并與數據工程師一起甚至在運行基本模型之前收集數據。 在建模過程的最后,您將與主要人員分享您的結果,這些人員通常希望看到其對某種視覺表示形式( Tableau,Google Slide卡座等 )的影響,從而能夠進行演示和交流也是有益的。

摘要 (Summary)

There are plenty of key concepts Data Scientists, as well as Machine Learning Engineers, should know. Five of them discussed in this article were:

數據科學家以及機器學習工程師應該知道很多關鍵概念。 本文討論的其中五個是:

MulticollinearityOne-hot encodingSamplingErrorStorytelling

Please feel free to comment down below some concepts of Data Science that you focus on daily, or that you think others should know about. Thank you for reading my article, I hope you enjoyed it!

請隨意在以下您每天關注的或您認為其他人應該知道的數據科學概念下進行評論。 感謝您閱讀我的文章,希望您喜歡!

Below are some references and links that can provide more information on the topics discussed in this article.

下面是一些參考和鏈接,它們可以提供有關本文討論的主題的更多信息。

I also want to highlight two other stories I have written which are related to this article, [8] and [9]:

我還想強調我寫的另外兩個與本文有關的故事,[8]和[9]:

These two articles highlight key skills and projects you will need to either know or become familiar with and expect to eventually employ as a professional Data Scientist.

這兩篇文章重點介紹了您需要了解或熟悉的關鍵技能和項目,并期望他們最終成為專業的數據科學家。

翻譯自: https://towardsdatascience.com/5-concepts-every-data-scientist-should-know-16c74d080a83

pd種知道每個數據的類型

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389731.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389731.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389731.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

td

單元格td設置padding,而不能設置margin。轉載于:https://www.cnblogs.com/fpcbk/p/9617629.html

清除浮動的幾大方法

對于剛接觸到html的一些人經常會用到浮動布局,但對于浮動的使用和清除浮動來說是大為頭痛的,在這里介紹幾個關于清除浮動的的方法。如果你說你要的就是浮動為什么要清除浮動的話,我就真的無言以對了,那你就當我沒說。 關于我們在布…

xgboost keras_用catboost lgbm xgboost和keras預測財務交易

xgboost kerasThe goal of this challenge is to predict whether a customer will make a transaction (“target” 1) or not (“target” 0). For that, we get a data set of 200 incognito variables and our submission is judged based on the Area Under Receiver Op…

2017. 網格游戲

2017. 網格游戲 給你一個下標從 0 開始的二維數組 grid ,數組大小為 2 x n ,其中 grid[r][c] 表示矩陣中 (r, c) 位置上的點數。現在有兩個機器人正在矩陣上參與一場游戲。 兩個機器人初始位置都是 (0, 0) ,目標位置是 (1, n-1) 。每個機器…

HUST軟工1506班第2周作業成績公布

說明 本次公布的成績對應的作業為: 第2周個人作業:WordCount編碼和測試 如果同學對作業成績存在異議,在成績公布的72小時內(截止日期4月26日0點)可以進行申訴,方式如下: 畢博平臺的第二周在線答…

幣氪共識指數排行榜0910

幣氪量化數據在今天的報告中給出DASH的近期買賣信號,可以看出從今年4月中旬起到目前為止,DASH_USDT的價格總體呈現出下降的趨勢。 轉載于:https://www.cnblogs.com/tokpick/p/9621821.html

走出囚徒困境的方法_囚徒困境的一種計算方法

走出囚徒困境的方法You and your friend have committed a murder. A few days later, the cops pick the two of you up and put you in two separate interrogation rooms such that you have no communication with each other. You think your life is over, but the polic…

2016. 增量元素之間的最大差值

2016. 增量元素之間的最大差值 給你一個下標從 0 開始的整數數組 nums &#xff0c;該數組的大小為 n &#xff0c;請你計算 nums[j] - nums[i] 能求得的 最大差值 &#xff0c;其中 0 < i < j < n 且 nums[i] < nums[j] 。 返回 最大差值 。如果不存在滿足要求的…

Zookeeper系列四:Zookeeper實現分布式鎖、Zookeeper實現配置中心

一、Zookeeper實現分布式鎖 分布式鎖主要用于在分布式環境中保證數據的一致性。 包括跨進程、跨機器、跨網絡導致共享資源不一致的問題。 1. 分布式鎖的實現思路 說明&#xff1a; 這種實現會有一個缺點&#xff0c;即當有很多進程在等待鎖的時候&#xff0c;在釋放鎖的時候會有…

resize 按鈕不會被偽元素遮蓋

textarea默認有個resize樣式&#xff0c;效果就是下面這樣 讀 《css 揭秘》時發現兩個亮點&#xff1a; 其實這個屬性不僅適用于 textarea 元素&#xff0c;適用于下面所有元素&#xff1a;elements with overflow other than visible, and optionally replaced elements repre…

平臺api對數據收集的影響_收集您的數據不是那么怪異的api

平臺api對數據收集的影響A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are…

709. 轉換成小寫字母

709. 轉換成小寫字母 給你一個字符串 s &#xff0c;將該字符串中的大寫字母轉換成相同的小寫字母&#xff0c;返回新的字符串。 示例 1&#xff1a;輸入&#xff1a;s "Hello" 輸出&#xff1a;"hello"示例 2&#xff1a;輸入&#xff1a;s "here…

前端技術周刊 2018-09-10:Redux Mobx

前端快爆 在 Chrome 10 周年之際&#xff0c;正式發布 69 版本&#xff0c;整體 UI 重新設計&#xff0c;同時iOS 版本重新將工具欄放置在了底部。API 層面&#xff0c;支持了 CSS Scroll Snap、前端資源鎖 Web Lock API、WebWorker 里面可以跑的 OffscreenCanvas API、toggleA…

PPT制作

0.【整體風格】整體風格統一 界面排版 0.1 字體大小&#xff1b; 0.2 字體顏色&#xff1b; 0.3 字體的種類統一(不是指只取一種字體)&#xff09; 1.【表達】結構化表達&#xff1b; 2.【取色】取色風格統一&#xff1b; 技巧&#xff1a;主色不超過三種&#xff0c;色彩不宜多…

1984. 學生分數的最小差值

1984. 學生分數的最小差值 給你一個 下標從 0 開始 的整數數組 nums &#xff0c;其中 nums[i] 表示第 i 名學生的分數。另給你一個整數 k 。 從數組中選出任意 k 名學生的分數&#xff0c;使這 k 個分數間 最高分 和 最低分 的 差值 達到 最小化 。 返回可能的 最小差值 。…

WBLoadingIndicatorView(加載等待動畫)

中文說明 基于CALayer封裝加載等待動畫&#xff0c;目前支持6種類型動畫&#xff1a; typedef NS_ENUM(NSInteger, WBLoadingAnimationType) { WBLoadingAnimationcircleStrokeSpinType, WBWBLoadingAnimationBallPulseType, WBWBLoadingAnimationBallClipRotateType, WBWBLoad…

邏輯回歸 概率回歸_概率規劃的多邏輯回歸

邏輯回歸 概率回歸There is an interesting dichotomy in the world of data science between machine learning practitioners (increasingly synonymous with deep learning practitioners), and classical statisticians (both Frequentists and Bayesians). There is gener…

sys.modules[__name__]的一個實例

關于sys.modules[__name__]的用法&#xff0c;百度上閱讀量比較多得一個帖子是&#xff1a;https://www.cnblogs.com/robinunix/p/8523601.html 對于里面提到的基礎性的知識點這里就不再重復了&#xff0c;大家看原貼就好。這里為大家提供一個詳細的例子&#xff0c;幫助大家更…

ajax不利于seo_利于探索移動選項的界面

ajax不利于seoLately, my parents will often bring up in conversation their desire to move away from their California home and find a new place to settle down for retirement. Typically they will cite factors that they perceive as having altered the essence o…

C#調用WebKit內核

原文:C#調用WebKit內核版權聲明&#xff1a;本文為博主原創文章&#xff0c;未經博主允許不得轉載。 https://blog.csdn.net/u013564470/article/details/80255954 系統要求 Windows與.NET框架 由于WebKit庫和.NET框架的要求&#xff0c;WebKit .NET只能在Windows系統上運行。從…