機器學習常用模型:決策樹_fairmodels:讓我們與有偏見的機器學習模型作斗爭

機器學習常用模型:決策樹

TL; DR (TL;DR)

The R Package fairmodels facilitates bias detection through model visualizations. It implements a few mitigation strategies that could reduce bias. It enables easy to use checks for fairness metrics and comparison between different Machine Learning (ML) models.

R Package 公平模型 通過模型可視化促進偏差檢測。 它實施了一些緩解策略,可以減少偏差。 它使易于使用的公平性檢查檢查和不同機器學習(ML)模型之間的比較成為可能。

長版 (Long version)

Bias mitigation is an important topic in Machine Learning (ML) fairness field. For python users, there are algorithms already implemented, well-explained, and described (see AIF360). fairmodels provides an implementation of a few popular, effective bias mitigation techniques ready to make your model fairer.

偏差緩解是機器學習(ML)公平性領域的重要主題。 對于python用戶,已經實現,充分解釋和描述了算法(請參閱AIF360 )。 fairmodels提供了一些流行的有效的偏差緩解技術的實現,這些技術可以使您的模型更加公平。

我的模型有偏見,現在呢? (I have a biased model, now what?)

Having a biased model is not the end of the world. There are lots of ways to deal with it. fairmodels implements various algorithms to help you tackle that problem. Firstly, I must describe the difference between the pre-processing algorithm and the post-processing one.

帶有偏見的模型并不是世界末日。 有很多方法可以處理它。 fairmodels實現了各種算法來幫助您解決該問題。 首先,我必須描述預處理算法和后處理算法之間的區別。

  • Pre-processing algorithms work on data before the model is trained. They try to mitigate the bias between privileged subgroup and unprivileged ones through inference from data.

    在訓練模型之前, 預處理算法會對數據進行處理 。 他們試圖通過數據推斷來減輕特權子群體與非特權子群體之間的偏見。

  • Post-processing algorithms change the output of the model explained with DALEX so that its output does not favor the privileged subgroup so much.

    后處理算法更改了用DALEX解釋的模型的輸出,因此其輸出不太喜歡特權子組。

這些算法如何工作? (How do these algorithms work?)

In this section, I will briefly describe how these bias mitigation techniques work. Code for more detailed examples and some visualizations used here may be found in this vignette.

在本節中,我將簡要描述這些偏差緩解技術的工作原理。 在此插圖中可以找到更詳細的示例代碼和此處使用的一些可視化效果。

前處理 (Pre-processing)

不同的沖擊消除劑(Feldman等,2015) (Disparate impact remover (Feldman et al., 2015))

Image for post
(image by author) Disparate impact removing. Blue and red distribution are transformed into “middle” distribution.
(作者提供的圖片)不同的影響消除。 藍色和紅色分布轉換為“中間”分布。

This algorithm works on numeric, ordinal features. It changes the column values so the distributions for the unprivileged (blue) and privileged (red) subgroups are close to each other. In general, we would like our algorithm not to judge on value of the feature but rather on percentiles (e.g., hiring 20% of best applicants for the job from both subgroups). The way that this algorithm works is that it finds such distribution that minimizes earth mover’s distance. In simple words, it finds the “middle” distribution and changes values in this feature for each subgroup.

此算法適用于數字順序特征。 它更改列的值,以便非特權(藍色)和特權(紅色)子組的分布彼此接近。 總的來說,我們希望我們的算法不是根據功能的價值來判斷,而是根據百分位數來判斷(例如,從兩個子組中聘請20%的最佳求職者)。 該算法的工作方式是找到使推土機距離最小的分布。 用簡單的話說,它找到“中間”分布并為每個子組更改此功能中的值。

Reweightnig (Kamiran et al。,2012) (Reweightnig (Kamiran et al., 2012))

Image for post
(image by author) In this mockup example, S=1 is a privileged subgroup. There is a weight for each unique combination of S and y.
(作者提供的圖像)在此模型示例中,S = 1是特權子組。 S和y的每個唯一組合都有權重。

Reweighting is a simple but effective tool for minimizing bias. The algorithm looks at the protected attribute and on the real label. Then, it calculates the probability of assigning favorable label (y=1) assuming the protected attribute and y are independent. Of course, if there is bias, they will be statistically dependent. Then, the algorithm divides calculated theoretical probability by true, empirical probability of this event. That is how weight is created. With these 2 vectors (protected variable and y ) we can create weights vector for each observation in data. Then, we pass it to the model. Simple as that. But some models don’t have weights parameter and therefore can’t benefit from this method.

重新加權是最小化偏差的簡單但有效的工具。 該算法查看受保護的屬性和實標簽。 然后,假設受保護的屬性和y獨立,則計算分配有利標簽(y = 1)的可能性。 當然,如果存在偏差,它們將在統計上相關。 然后,算法將計算出的理論概率除以該事件的真實,經驗概率。 重量就是這樣產生的。 使用這兩個向量(保護變量和y),我們可以為數據中的每個觀察值創建權重向量。 然后,我們將其傳遞給模型。 就那么簡單。 但是某些模型沒有權重參數,因此無法從該方法中受益。

重采樣(Kamiran et al。,2012) (Resampling (Kamiran et al., 2012))

Image for post
(image by author) Uniform sampling. Circles denote duplication and x’s omitting of observation.
(作者提供的圖片)統一采樣。 圓圈表示重復和x省略了觀察。

Resampling is closely related to the prior method as it implicitly uses reweighting to calculate how many observations must be omitted/duplicated in a particular case. Imagine there are 2 groups, deprived (S = 0) and favored (S = 1). This method duplicates observations from a deprived subgroup when the label is positive and omits observations with a negative label. The opposite is then performed on the favored group. There are 2 types of resampling methods implemented- uniform and preferential. Uniform randomly picks observations (like in the picture) whereas preferential utilizes probabilities to pick/omit observations close to cutoff (default is 0.5).

重采樣與先前的方法密切相關,因為它隱式使用重新加權來計算在特定情況下必須省略/重復多少個觀測值。 想象一下,有2個組,被剝奪(S = 0)和受青睞(S = 1)。 當標簽為正時,此方法復制來自被剝奪子組的觀察結果,而忽略帶有負標簽的觀察。 然后對喜歡的組執行相反的操作。 有兩種重采樣方法: 統一優先均勻地隨機選擇觀測值(如圖中所示),而“ 優先級”則利用概率來選擇/忽略接近臨界值的觀測值(默認值為0.5)。

后期處理 (Post-processing)

Post-processing takes place after creating an explainer. To create explainer we need the model and DALEX explainer. Gbm model will be trained on adult dataset predicting whether a certain person earns more than 50k annually.

在創建解釋器后進行后處理。 要創建解釋器,我們需要模型和DALEX解釋器。 Gbm模型將接受成人訓練 預測某個人的年收入是否超過5萬的數據集。

library(gbm)library(DALEX)library(fairmodels)data("adult")
adult$salary <- as.numeric(adult$salary) -1
protected <- adult$sex
adult <- adult[colnames(adult) != "sex"] # sex not specified
# making modelset.seed(1)
gbm_model <-gbm(salary ~. , data = adult, distribution = "bernoulli")
# making explainer
gbm_explainer <- explain(gbm_model,
data = adult[,-1],
y = adult$salary,
colorize = FALSE)

基于拒絕選項的分類(數據透視) (Kamiran等,2012) (Reject Option based Classification (pivot) (Kamiran et al., 2012))

Image for post
(image by author) Red- privileged, blue- unprivileged. If the value is close (-theta + cutoff, theta + cutoff) and particular case, the probability changes place (and value) to the opposite side od cutoff.
(作者提供的圖片)紅色特權,藍色特權。 如果該值接近(-theta +截止值,theta +截止值)并且在特定情況下,則概率將位置(和值)更改為od截止值的另一側。

ROC pivot is implemented based on Reject Option based Classification. Algorithm switches labels if an observation is from the unprivileged group and on the left of the cutoff. The opposite is then performed for the privileged group. But there is an assumption that the observation must be close (in terms of probabilities) to cutoff. So the user must input some value theta so that the algorithm will know how close must observation be to the cutoff for the switch. But there is a catch. If just the label was changed DALEX explainer would have a hard time properly calculating the performance of the model. For that reason instead of labels, in fairmodels implementation of this algorithm that is the probabilities that are switched (pivoted). They are just moved to the other side but with equal distance to the cutoff.

ROC數據透視是基于基于拒絕選項的分類實現的。 如果觀察值來自非特權組且位于截止值的左側,則算法會切換標簽。 然后對特權組執行相反的操作。 但是有一個假設,即觀察結果(在概率方面)必須接近臨界值。 因此,用戶必須輸入一些值theta,以便算法將知道必須觀察到與開關的截止點有多接近。 但是有一個問題! 如果僅更改標簽,DALEX解釋器將很難正確計算模型的性能。 因此,在公平模型中 ,此算法代替了標簽,而是切換(樞軸化)的概率。 它們只是移動到另一側,但與截止點的距離相等。

截止操作 (Cutoff manipulation)

Image for post
(image by author) plot(ceteris_paribus_cutoff(fobject, cumulated = TRUE))
(作者提供的圖像)plot(ceteris_paribus_cutoff(fobject,cumulated = TRUE))

Cutoff manipulation might be a great idea for minimizing the bias in a model. We simply choose metrics and subgroup for which the cutoff will change. The plot shows where the minimum is and for that value of cutoff parity loss will be the lowest. How to create fairness_object with the different cutoff for certain subgroup? It is easy!

截斷操作對于最小化模型中的偏差可能是一個好主意。 我們僅選擇截止值將更改的指標和子組。 該圖顯示了最小值所在的位置,并且對于該閾值, 奇偶校驗損耗將是最低的。 如何為某些子組創建具有不同截止值的fairness_object? 這很容易!

fobject <- fairness_check(gbm_explainer,
protected = protected,
privileged = "Male",
label = "gbm_cutoff",
cutoff = list(Female = 0.35))

Now the fairness_object (fobject) is a structure with specified cutoff and it will affect both fairness metrics and performance.

現在, fairness_object(fobject)是具有指定截止值的結構,它將同時影響公平性指標和性能。

公平與準確性之間的權衡 (The tradeoff between fairness and accuracy)

If we want to mitigate bias we must be aware of possible drawbacks of this action. Let’s say that Statical Parity is the most important metric for us. Lowering parity loss of this metric will (probably) result in an increase of False Positives which will cause the accuracy to drop. For this example (that you can find here) a gbm model was trained and then treated with different bias mitigation techniques.

如果我們想減輕偏見,我們必須意識到這一行動的可能弊端。 假設靜態奇偶校驗是我們最重要的指標。 降低此度量的奇偶校驗損失將(可能)導致誤報的增加,這將導致準確性下降。 對于此示例(您可以在此處找到),對gbm模型進行了訓練,然后使用了不同的偏差緩解技術對其進行了處理。

Image for post
image by author
圖片作者

The more we try to mitigate the bias, the less accuracy we get. This is something natural for this metric and the user should be aware of it.

我們越努力減輕偏差,獲得的準確性就越低。 這對于該指標是很自然的事情,用戶應該意識到這一點。

摘要 (Summary)

Debiasing methods implemented in fairmodels are certainly worth trying. They are flexible and most of them are suited for every model. Most of all they are easy to use.

公平模型中實現的偏置方法當然值得嘗試。 它們非常靈活,并且大多數適用于每種型號。 最重要的是它們易于使用。

接下來要讀什么? (What to read next?)

  • Blog post about introduction to fairness, problems, and solutions

    關于公平,問題和解決方案介紹的博客文章

  • Blog post about fairness visualization

    關于公平可視化的博客文章

學到更多 (Learn more)

  • Check the package’s GitHub website for more details

    檢查軟件包的GitHub網站以獲取更多詳細信息

  • Tutorial on full capabilities of the fairmodels package

    fairmodels軟件包的全部功能教程

  • Tutorial on bias mitigation techniques

    緩解偏見技術的教程

翻譯自: https://towardsdatascience.com/fairmodels-lets-fight-with-biased-machine-learning-models-f7d66a2287fc

機器學習常用模型:決策樹

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391631.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391631.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391631.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

高德地圖如何將比例尺放大到10米?

2019獨角獸企業重金招聘Python工程師標準>>> var map new AMap.Map(container, {resizeEnable: true,expandZoomRange:true,zoom:20,zooms:[3,20],center: [116.397428, 39.90923] }); alert(map.getZoom());http://lbs.amap.com/faq/web/javascript-api/expand-zo…

Android 手把手帶你玩轉自己定義相機

本文已授權微信公眾號《鴻洋》原創首發&#xff0c;轉載請務必注明出處。概述 相機差點兒是每一個APP都要用到的功能&#xff0c;萬一老板讓你定制相機方不方&#xff1f;反正我是有點方。關于相機的兩天奮斗總結免費送給你。Intent intent new Intent(); intent.setAction(M…

如何在JavaScript中克隆數組

JavaScript has many ways to do anything. I’ve written on 10 Ways to Write pipe/compose in JavaScript, and now we’re doing arrays.JavaScript有許多方法可以執行任何操作。 我已經寫了10種用JavaScript編寫管道/組合的方法 &#xff0c;現在我們正在做數組。 1.傳播…

leetcode 227. 基本計算器 II(棧)

給你一個字符串表達式 s &#xff0c;請你實現一個基本計算器來計算并返回它的值。 整數除法僅保留整數部分。 示例 1&#xff1a; 輸入&#xff1a;s “32*2” 輸出&#xff1a;7 解題思路 利用兩個棧&#xff0c;一個記錄操作數&#xff0c;一個記錄操作符&#xff0c;…

100米隊伍,從隊伍后到前_我們的隊伍

100米隊伍,從隊伍后到前The last twelve months have brought us a presidential impeachment trial, the coronavirus pandemic, sweeping racial justice protests triggered by the death of George Floyd, and a critical presidential election. News coverage of these e…

idea使用 git 撤銷commit

2019獨角獸企業重金招聘Python工程師標準>>> 填寫commit的id 就可以取消這一次的commit 轉載于:https://my.oschina.net/u/3559695/blog/1596669

ES6標準入門(第二版)pdf

下載地址&#xff1a;網盤下載 內容簡介 ES6&#xff08;又名 ES2105&#xff09;是 JavaScript 語言的新標準&#xff0c;2015 年 6 月正式發布后&#xff0c;得到了迅速推廣&#xff0c;是目前業界超級活躍的計算機語言。《ES6標準入門&#xff08;第2版&#xff09;》…

hexo博客添加暗色模式_我如何向網站添加暗模式

hexo博客添加暗色模式同一個網站&#xff0c;兩種不同的配色方案 (Same website, two different color schemes) Last year I made it a point to redesign my website from scratch. I wanted something simple and minimalist looking that clearly stated what this was — …

leetcode 331. 驗證二叉樹的前序序列化

序列化二叉樹的一種方法是使用前序遍歷。當我們遇到一個非空節點時&#xff0c;我們可以記錄下這個節點的值。如果它是一個空節點&#xff0c;我們可以使用一個標記值記錄&#xff0c;例如 #。_9_/ \3 2/ \ / \4 1 # 6 / \ / \ / \ # # # # # # 例如&#xff0…

mongodb數據可視化_使用MongoDB實時可視化開放數據

mongodb數據可視化Using Python to connect to Taiwan Government PM2.5 open data API, and schedule to update data in real time to MongoDB — Part 2使用Python連接到臺灣政府PM2.5開放數據API&#xff0c;并計劃將數據實時更新到MongoDB —第2部分 目標 (Goal) This ti…

4.kafka的安裝部署

為了安裝過程對一些參數的理解&#xff0c;我先在這里提一下kafka一些重點概念,topic,broker,producer,consumer,message,partition,依賴于zookeeper, kafka是一種消息隊列,他的服務端是由若干個broker組成的&#xff0c;broker會向zookeeper&#xff0c;producer生成者對應一個…

javascript初學者_針對JavaScript初學者的調試技巧和竅門

javascript初學者by Priyanka Garg由Priyanka Garg My intended audience for this tutorial is beginner programmers. You’ll learn about frustration-free debugging with chrome dev tools.本教程的目標讀者是初學者。 您將學習使用chrome開發工具進行無挫折的調試。 D…

leetcode 705. 設計哈希集合

不使用任何內建的哈希表庫設計一個哈希集合&#xff08;HashSet&#xff09;。 實現 MyHashSet 類&#xff1a; void add(key) 向哈希集合中插入值 key 。 bool contains(key) 返回哈希集合中是否存在這個值 key 。 void remove(key) 將給定值 key 從哈希集合中刪除。如果哈希…

ecshop 前臺個人中心修改側邊欄 和 側邊欄顯示不全 或 導航現實不全

怎么給個人中心側邊欄加項或者減項 在模板文件default/user_menu.lbi 文件里添加或者修改,一般看到頁面都會知道怎么加,怎么刪,這里就不啰嗦了 添加一個欄目以后,這個地址跳的頁面怎么寫 這是最基本的一個包括左側個人信息,頭部導航欄 <!DOCTYPE html PUBLIC "-//W3C//…

leetcode 706. 設計哈希映射

不使用任何內建的哈希表庫設計一個哈希映射&#xff08;HashMap&#xff09;。 實現 MyHashMap 類&#xff1a; MyHashMap() 用空映射初始化對象 void put(int key, int value) 向 HashMap 插入一個鍵值對 (key, value) 。如果 key 已經存在于映射中&#xff0c;則更新其對應…

數據庫語言 數據查詢_使用這種簡單的查詢語言開始查詢數據

數據庫語言 數據查詢Working with data is becoming an increasingly important skill in the modern workplace. 在現代工作場所中&#xff0c;處理數據已成為越來越重要的技能。 Data is no longer the domain of analysts and software engineers. With todays technology,…

面向對象編程思想-觀察者模式

一、引言 相信猿友都大大小小經歷過一些面試&#xff0c;其中有道經典題目&#xff0c;場景是貓咪叫了一聲&#xff0c;老鼠跑了&#xff0c;主人被驚醒&#xff08;設計有擴展性的可加分&#xff09;。對于初學者來說&#xff0c;可能一臉懵逼&#xff0c;這啥跟啥啊是&#x…

typescript 使用_如何使用TypeScript輕松修改Minecraft

typescript 使用by Josh Wulf通過喬什沃爾夫(Josh Wulf) 如何使用TypeScript輕松修改Minecraft (How to modify Minecraft the easy way with TypeScript) Usually, modifying Minecraft requires coding in Java, and a lot of scaffolding. Now you can write and share Min…

Python:在Pandas數據框中查找缺失值

How to find Missing values in a data frame using Python/Pandas如何使用Python / Pandas查找數據框中的缺失值 介紹&#xff1a; (Introduction:) When you start working on any data science project the data you are provided is never clean. One of the most common …

監督學習-回歸分析

一、數學建模概述 監督學習&#xff1a;通過已有的訓練樣本進行訓練得到一個最優模型&#xff0c;再利用這個模型將所有的輸入映射為相應的輸出。監督學習根據輸出數據又分為回歸問題&#xff08;regression&#xff09;和分類問題&#xff08;classfication&#xff09;&#…