如何識別媒體偏見_描述性語言理解,以識別文本中的潛在偏見

如何識別媒體偏見

TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackathon Project: Verity E-Quality (Aditya Ramesh, Erica Nishimura, Ishan Shrivastava, Lane Schechter and Trung Do).

T GumGum可以利用我們的自然語言處理技術來帶來變化,從而揭示網站內容可能存在的潛在偏見。 本博客中分享的想法和技術是GumGum Hackathon項目:Verity E-Quality(Aditya Ramesh,Erica Nishimura,Ishan Shrivastava,Lane Schechter和Trung Do)的結果。

In this blog, we will look into how we can utilize and build upon the existing product offering from GumGum to understand the Gender Representation in a website’s content. We aren’t saying that one publisher is more biased than the other, rather we are merely providing the awareness around the representation as it exists. With Natural Language Processing, we can compare between the descriptive language being used around Males and Females to provide this awareness.

在此博客中,我們將研究如何利用和建立GumGum提供的現有產品,以了解網站內容中的性別表示形式。 我們并不是說一個出版商比另一個出版商有更大的偏見,相反,我們只是在提供有關表示形式的意識。 通過自然語言處理,我們可以在男性和女性周圍使用的描述性語言之間進行比較,以提供這種意識。

In order to facilitate meaningful change, we need to be aware and mindful of where that change is needed. — Lane Schechter, Product Manager, GumGum Inc.

為了促進有意義的變更,我們需要意識到并銘記需要進行哪些變更。 —口香糖公司產品經理Lane Schechter

口香糖的產品 (GumGum’s Product Offerings)

Before we move ahead to understand how we build upon the existing product offerings, let us first take a brief look at them. GumGum’s Verity Product does a complete contextual analysis of a publisher’s webpage. Some of the key offerings of this product are:

在繼續了解如何在現有產品基礎上發展之前,讓我們首先簡要地了解一下它們。 GumGum的Verity產品對發布者的網頁進行了完整的上下文分析。 該產品的一些主要產品包括:

  • Contextual Classification & Targeting: This feature identifies and scores publisher’s content (webpages) for contextual classification based on standard IAB Content Taxonomy v1.0 and v2.0. Some of those categories are “Sports”, “Food & Drinks”, “Automotive”, “Medical Health” etc. Going forward, we will refer to them as IAB verticals.

    內容相關分類和定位 :此功能可根據標準IAB內容分類標準v1.0和v2.0對發布者的內容(網頁)進行識別和評分,以進行內容相關分類。 其中一些類別是“體育”,“食品和飲料”,“汽車”,“醫療保健”等。展望未來,我們將其稱為IAB行業。

  • Brand Safety & Suitability: This feature flags and rates brand safety threats based on GumGum’s proprietary threat classification taxonomy and in compliance with The 4A’s Advertising Assurance Brand Safety Framework.

    品牌安全性和適用性 :此功能基于GumGum專有的威脅分類法并符合4A的廣告保證品牌安全框架來標記和評估品牌安全威脅。

  • Named Entity Recognition (NER): This feature identifies and extracts any mention of a named entity in the publisher’s content. A named entity could be any mention of a ‘Person’, ‘Location’ or ‘Organization’.

    命名實體識別(NER) :此功能可以識別并提取發布者內容中對命名實體的任何提及。 命名實體可以是對“人員”,“位置”或“組織”的任何提及。

  • Sentiment Analysis: This feature analyzes the attitudes, opinions and emotions expressed online to provide the most nuanced brand safety and contextual insights.

    情感分析 :此功能可分析在線表達的態度,觀點和情感,以提供最細微的品牌安全性和上下文相關見解。

Here is one way we can provide the Descriptive Language Understanding Associated with Gender. We can use the Named Entity Recognition (NER) feature to extract Names of “Person” named entity type which can be used to identify the gender of the person being talked about. We can also use the Sentiment Analysis feature to extract sentiment of the sentences in which Males and Females are being talked about. We can use all of this information to understand the descriptive language being used around Males and Females (more on how to do this in the next section) and compare it across different IAB verticals extracted using our Contextual Classification feature.

這是我們提供與性別相關描述性語言理解的一種方法 我們可以使用命名實體識別( NER )功能來提取“個人”命名實體類型的名稱,該名稱可用于識別所談論人員的性別。 我們還可以使用情感分析功能來提取正在談論男性和女性的句子的情感。 我們可以使用所有這些信息來理解男性和女性周圍使用的描述性語言(在下一節中將詳細介紹如何操作),并使用上下文分類功能將其與不同的IAB垂直行業進行比較。

與性別相關的描述性語言理解方法 (Approach for Descriptive Language Understanding Associated with Gender)

Image for post
Fig 1: Flowchart diagram describing the approach for Descriptive Language Understanding Associated with Gender
圖1:流程圖描述了與性別相關的描述性語言理解方法

We start by running a Domain Specific Query on our NLP Databases to extract URL’s for the given publisher. We then utilize the Named Entity Recognition Feature of Verity to filter out pages that do not contain any “Person” Named Entity. From the remaining pages, we extract all “Person Names” and the Sentences in which those “Person Names” occur. As a future step, we can also perform coreference resolution, to extract more sentences where the “Persons” are mentioned using their respective pronouns.

我們首先在NLP數據庫上運行特定于域的查詢,以提取給定發布者的URL。 然后,我們利用Verity的命名實體識別功能來過濾掉不包含任何“人”命名實體的頁面。 從其余頁面中,我們提取所有“人名”和出現這些“人名”的句子。 作為未來的步驟,我們還可以執行共指解析,以提取更多句子,并使用各自的代詞提及“人物”。

We then use the “Person Names” to detect the gender of the person using an open source package called Gender Guesser. We also use the “Sentences” to extract the sentiment of the Sentence by utilizing our own FastText based Sentiment Classification model. This model is trained on our publisher data which classifies a sentence into Negative, Neutral or Positive Sentiment.

然后,我們使用稱為“性別名稱”的開源軟件包Gender Guesser來檢測人員的性別 。 我們還使用“句子”通過利用我們自己的基于FastText的情感分類模型來提取句子的情感。 此模型是根據我們的發布者數據訓練的,該數據將句子分為負面,中性或正面情緒。

We also use “Person Names” and the Sentences they occur in to extract Adjectives used in the surrounding context for a given person. To achieve this we used Spacy’s Part of Speech Tokenizer and extract adjectives used within a proximity of a mention of a person name. Consider the example given below:

我們還使用“人物名稱”及其出現的句子來提取給定人物在周圍環境中使用的形容詞。 為了達到這個目的,我們使用了Spacy的語音詞性分詞器,并提取了在提及某人名時使用的形容詞。 考慮下面給出的示例:

Image for post
Fig 2
圖2

We use all this information to create a Word Cloud for the Adjectives used around each Gender and Sentiment Pair across the entire content as well as specific to different IAB verticals.

我們使用所有這些信息為整個內容以及特定于不同IAB行業的每個性別和情感對使用的形容詞創建詞云 。

For example, consider the following four word clouds that we got based on the Adjectives used around Males and Females in a Positive and Negative context extracted from a Publisher’s content:

例如,考慮以下四個詞云,這些詞云是根據從發布者內容中提取的正面和負面上下文中男性和女性周圍使用的形容詞得出的:

Image for post
Negative Sentiment否定情感
Image for post
Negative Sentiment感的女性周圍的形容詞的詞云

Nothing stereotypical stands out here. It has similarly or equally negative adjectives being used around Males and Females alike.

沒有什么定型觀念在這里脫穎而出。 它在男性和女性周圍都有相似或同等的否定形容詞。

Image for post
Positive Sentiment陽性的男性使用
Image for post
Positive Sentiment積極情緒的女性

What we see here is that more Intellectual Type Adjectives being used around Males, while more Appearance Type Adjectives being used around Females.

我們在這里看到的是,在男性周圍使用更多的智力類型形容詞,而在女性周圍使用更多的外觀類型形容詞。

It becomes even more clearer if we look at the most frequent Adjectives used around ONLY Males or Female. We do this be considering the top 15 adjectives and extracting only the Uncommon Adjectives between the two genders and compare it among the Positive and Negative Context.

如果我們只看男性或女性周圍最常用的形容詞,就會更加清楚。 我們這樣做是在考慮前15名形容詞,并僅提取兩個性別之間的不常見形容詞 ,然后將其在正面和負面語境中進行比較。

Image for post
Fig 7: Most Frequent Adjectives used for Only Male/Female based on top 15 Adjectives for each Gender corresponding to different Sentiment Context
圖7:基于不同情感情境的每個性別的前15個形容詞,僅用于男性/女性的最常見形容詞

Here we can clearly see that in the Negative context, the most frequent Adjectives used around Only Males and Only Females can be considered equally negative. But in the Positive context, that is clearly not the case. Around Males, we see adjectives like “Proud”, “Sized”, “Perfect”, “Fantastic etc while we see adjectives like “Beautiful”, “Healthy”, “Amazing”, “Sweet”, “Supporting”, “Lucky” etc around Females. This is suggestive of more Intellectual Type Adjectives being used around Males and more Appearance Types adjectives being used around Females.

在這里我們可以清楚地看到,在否定語境中,僅男性和僅女性周圍使用最頻繁的形容詞可被視為同等否定。 但是,在積極方面,情況顯然并非如此。 在男性周圍,我們看到形容詞如“驕傲”,“大小”,“完美”,“棒極了”,而我們看到形容詞如“美麗”,“健康”,“驚人”,“甜”,“支持”,“幸運”等女性。 這表明在男性周圍使用更多的智力類型形容詞,在女性周圍使用更多的外觀類型形容詞。

This sort of analysis of the descriptive language being used around different Genders in different Sentimental Context can really help in understanding what sort of Bias if any is present in a publisher’s content. But how can we quantify this? For this we introduce a Context Based Similarity Score.

對在不同情感環境中不同性別之間使用的描述性語言進行的這種分析,確實可以幫助理解發行人內容中存在的哪種偏差(如有)。 但是我們如何量化呢? 為此,我們介紹了一個基于上下文的相似度評分

基于上下文的相似度評分 (Context Based Similarity Score)

The idea here is to find a way to compute a single score that shows the degree of similarity between the most frequent adjectives used around only Males and only Females. To achieve this we make use of the famous Transformer based Deep Learning model: BERT by Google Research.

這里的想法是找到一種方法來計算單個分數,該分數顯示僅在男性和女性之間使用的最常見形容詞之間的相似程度。 為此,我們利用了著名的基于Transformer的深度學習模型: Google Research的BERT 。

Among being awesome at a variety of NLP tasks and breaking the State of the Art results on them, BERT is also great at providing Contextualized Word Vector Representations (Embeddings). What that means is that, BERT doesn’t provide a single and constant representation of a word, rather it looks at the context in which the word was used in the sentence and spits out a context sensitive representation of that word. This is particularly useful as it captures more information than other representations such as Word2Vec or Glove. A famous example used to point this out is that BERT will provide different representations for the word “Bank” depending on the context in which it was used. The context could be of a river bank or of a financial bank. Therefore, to extract a word representation from BERT, you need to send a sentence in which it was used to get a Contextualized Word Vector Representations. (Apart from reading their original paper here, you can also look at this and this to get a more visualistic way of understanding Transformers and BERT. )

BERT擅長處理各種NLP任務并打破了最新的技術成果,其中,BERT擅長提供上下文化的詞向量表示(嵌入) 。 這就是說,BERT不提供單詞的單一且恒定的表示形式,而是查看句子中使用該單詞的上下文,并吐出該單詞的上下文相關表示形式。 這一點特別有用,因為它比諸如Word2Vec或Glove之類的其他表示形式捕獲的信息更多。 指出這一點的一個著名示例是,BERT將根據使用的上下文為“銀行”一詞提供不同的表示形式。 上下文可以是河岸或金融銀行。 因此,要從BERT中提取單詞表示形式,您需要發送一個句子,在該句子中使用它來獲取上下文化的單詞向量表示形式。 (除了這里閱讀他們的原始論文,你也可以看看這個和這個得到理解變壓器和BERT更visualistic方式。)

Therefore, along with the most frequent Male only and Female Only adjectives, we also extract the sentences in which these Male only and Female Only Adjectives are used. We send these sentences into BERT to extract Contextualized Vector Representations of length 768, for each of these Adjectives based on the context in which these adjectives were used.

因此,與最常見的男性專用和女性專用形容詞一起,我們還提取了使用這些男性專用和女性專用形容詞的句子。 我們根據使用這些形容詞的上下文,將這些句子發送到BERT中,以提取長度為768的上下文化向量表示形式。

We use these representation that have rich context information to compute a Context Based Similarity Score between the Male only Adjectives and Female Only Adjectives used in with Positive or a Negative context. We take the mean of the contextual representations of all Male only Adjectives and Female Only Adjectives to get an averaged representation for all the Male only Adjectives and Female only Adjectives respectively. We then take the cosine similarity between the two vector representations to compute a Context Based Similarity Score as shown in the figure below:

我們使用具有豐富上下文信息的這些表示來計算在正或負上下文中使用的僅男性形容詞和僅女性形容詞之間的基于上下文的相似性得分 。 我們取所有男性專用形容詞和女性專用形容詞的上下文表示的平均值,以分別獲得所有男性專用形容詞和女性專用形容詞的平均表示。 然后,我們使用兩個向量表示之間的余弦相似度來計算基于上下文的相似度得分,如下圖所示:

Image for post
Fig 8: Calculating the Context Based Similarity Score from Contextualized Word Vector Representations of the Adjectives used around only Males and around only Females.
圖8:從僅在男性周圍和僅在女性周圍使用的形容詞的上下文化詞向量表示形式,計算基于上下文的相似性分數

This score is calculated for a given sentiment and a given IAB vertical.

針對給定的情緒和給定的IAB垂直度計算此分數。

The higher this score, the better is the balance between the Adjectives being used around a particular gender in the context of a given sentiment and given IAB vertical.

該分數越高,在給定的情緒和IAB垂直的情況下針對特定性別使用的形容詞之間的平衡就越好。

Let us look at the Context Based Similarity score in action:

讓我們看一下基于上下文的相似性得分:

Image for post
Fig 9: The Context Based Similarity Score based on the most Frequent Adjectives used around Only Males and Only Females corresponding to different Sentiment Context
圖9:基于上下文的相似性評分,該評分基于對應于不同情感上下文的僅男性和女性周圍使用的最常見形容詞

Comparing the two scores, we can see that we get a higher score in the case of Negative sentiment, where there were similar kind of Adjectives (equally negative in this case) used around Males and Females. On the other hand, we get a lower score in the case of Positive sentiment, where we did see some form of Bias with Intellectual Type Adjectives being used around Males while Appearance Type Adjectives being used around Females.

比較這兩個分數,我們可以發現,在負面情緒的情況下,我們在男性和女性周圍使用了相似類型的形容詞(在這種情況下,均為負數)時得分更高。 另一方面,在積極情緒的情況下,我們得到了較低的分數,在這種情況下,我們確實看到了某種形式的偏見,其中男性使用智力類型形容詞,而女性使用外觀類型形容詞。

結論 (Conclusion)

In this blog we saw how we can analyze the Descriptive Language used around Males and Females. We analyzed the insights found from such an analysis and saw how it can guide and point us to where the change might be required. We took a look at how GumGum can leverage Product Offerings like Content Classification and Named Entity Recognition from its vast variety of feature arsenal and build upon them to quantify the degree of similarities in the descriptive language being used around Males and Females. As a part of our future works, we can work on identifying Race mentions in a piece of text and easily extend this work to understand the Descriptive Language used around different Races.

在此博客中,我們看到了如何分析男性和女性周圍使用的描述性語言。 我們分析了從這種分析中發現的見解,并了解了它如何指導并指出我們可能需要進行更改的地方。 我們研究了GumGum如何利用其功能豐富的功能庫中的內容分類和命名實體識別之類的產品,并以此為基礎來量化男性和女性使用的描述性語言的相似程度。 作為我們未來工作的一部分,我們可以在一段文字中識別種族提及,并輕松地擴展這項工作以理解圍繞不同種族使用的描述性語言。

About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.

關于我 :畢業于ASU的計算機科學碩士學位。 我是GumGum的NLP科學家。 我對應用機器學習/深度學習感興趣,以便為我們周圍的非結構化數據提供某種結構。

We’re always looking for new talent! View jobs.

我們一直在尋找新的人才! 查看工作 。

Follow us: Facebook | Twitter | | Linkedin | Instagram

關注我們: Facebook | 推特 | | Linkedin | Instagram

翻譯自: https://medium.com/gumgum-tech/descriptive-language-understanding-to-identify-potential-bias-in-text-89936fefbae7

如何識別媒體偏見

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389516.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389516.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389516.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

分享 : 警惕MySQL運維陷阱:基于MyCat的偽分布式架構

分布式數據庫已經進入了全面快速發展階段。這種發展是與時俱進的,與人的需求分不開,因為現在信息時代的高速發展,導致數據量和交易量越來越大。這種現象首先導致的就是存儲瓶頸,因為MySQL數據庫實質上還是一個單機版本的數據庫&am…

opencv:圖像讀取BGR變成RGB

opencv大坑之BGR opencv對于讀進來的圖片的通道排列是BGR,而不是主流的RGB!謹記! #opencv讀入的矩陣是BGR,如果想轉為RGB,可以這么轉 img cv2.imread(1.jpg) img cv2.cvtColor(img4,cv2.COLOR_BGR2RGB)

數據不平衡處理_如何處理多類不平衡數據說不可以

數據不平衡處理重點 (Top highlight)One of the common problems in Machine Learning is handling the imbalanced data, in which there is a highly disproportionate in the target classes.機器學習中的常見問題之一是處理不平衡的數據,其中目標類別的比例非常…

最小二乘法以及RANSAC(隨機采樣一致性)思想及實現

線性回歸–最小二乘法(Least Square Method) 線性回歸: 什么是線性回歸? 舉個例子,某商品的利潤在售價為2元、5元、10元時分別為4元、10元、20元, 我們很容易得出商品的利潤與售價的關系符合直線&#xf…

軟鍵盤彈起,導致底部被頂上去

計算出可視界面的高度,當軟鍵盤彈起時讓底部元素隱藏掉,當鍵盤收起時再讓它顯示,實在沒辦法時這種方法也不失為一種方法1 var hdocument.documentElement.clientHeight; 2 $(window).resize(function(){ 3 let heightdocument.documentEl…

關于LaaS,PaaS,SaaS一些個人的理解

關于LaaS,PaaS,SaaS一些個人的理解 關于LaaS,PaaS,SaaS一些個人的理解 其實如果從整個程序運營的角度來考慮比較好 第一個LaaS 這個也叫做Haas 就是硬件或者基礎設置即服務 比如現在的 aws azure 阿里云 騰訊云 百度云 都是提供服務器基礎設置服務的 包括服務器的硬件…

糖藥病數據集分類_使用optuna和mlflow進行心臟病分類器調整

糖藥病數據集分類背景 (Background) Data science should be an enjoyable process focused on delivering insights and real benefits. However, that enjoyment can sometimes get lost in tools and processes. Nowadays it is important for an applied data scientist to…

Android MVP 框架

為什么80%的碼農都做不了架構師?>>> 前言 根據網絡上的MVP套路寫了一個辣雞MVP DEMO 用到的 android studio MVPHelper插件,方便自動生成框架代碼rxjavaretrofit什么是MVP MVP就是英文的Model View Presenter,然而實際分包并不是只有這三個包…

相似圖像搜索的哈希算法思想及實現(差值哈希算法和均值哈希算法)

圖像相似度比較哈希算法: 什么是哈希(Hash)? ? 散列函數(或散列算法,又稱哈希函數,英語:Hash Function)是一種從任何一種數據中創建小 的數字“指紋”的方法。散列函數把消息或數…

騰訊云AI應用產品總監王磊:AI 在傳統產業的最佳實踐

歡迎大家前往騰訊云社區,獲取更多騰訊海量技術實踐干貨哦~ 背景:5月23-24日,以“煥啟”為主題的騰訊“云未來”峰會在廣州召開,廣東省各級政府機構領導、海內外業內學術專家、行業大咖及技術大牛等在現場共議云計算與數字化產業創…

標準化(Normalization)和歸一化實現

概念: 原因: 由于進行分類器或模型的建立與訓練時,輸入的數據范圍可能比較大,同時樣本中各數據可 能量綱不一致,這樣的數據容易對模型訓練或分類器的構建結果產生影響,因此需要對其進行標準 化處理&#x…

Toast源碼深度分析

目錄介紹 1.最簡單的創建方法 1.1 Toast構造方法1.2 最簡單的創建1.3 簡單改造避免重復創建1.4 為何會出現內存泄漏1.5 吐司是系統級別的 2.源碼分析 2.1 Toast(Context context)構造方法源碼分析2.2 show()方法源碼分析2.3 mParams.token windowToken是干什么用的2.4 schedul…

序列化框架MJExtension詳解 + iOS ORM框架

當開發中你的模型中屬性名稱和 字典(JSON/XML) 中的key 不能一一對應時, 或者當字典中嵌套了多層字典數組時..., 以及教你如何用 MJExtension 配置類來統一管理你的模型配置, 下面羅列了開發中常見的一些特殊情況, 請參考!(MJExtension/github) 最基本用法: // 將字典轉為模型 …

運行keras出現 FutureWarning: Passing (type, 1) or ‘1type‘ as a synonym of type is deprecated解決辦法

運行keras出現 FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, 原則來說,沒啥影響,還是能運行,但是看著難受 解決辦法: 點擊藍色的鏈接: 進入 …

RedirectToAction()轉移方式及參數傳遞

今天在做一個功能的時,使用RedirectToAction()需要從這里傳幾個參數,從網上查了一下,這樣解決。真好。 Return RedirectToAction("Index","ManageInfo",new{type0,page1});轉載于:https://www.cnblogs.com/ZaraNet/p/978…

軟件項目風險管理

近幾年來軟件開發技術、工具都有了很大的進步,但是軟件項目開發超時、超支、甚至不能滿足用戶需求而根本沒有得到實際使用的情況仍然比比皆是。軟件項目開發和管理中一直存在著種種不確定性,嚴重影響著項目的順利完成和提交。但這些軟件風險并未得到充分…

mongdb 群集_群集文檔的文本摘要

mongdb 群集This is a part 2 of the series analyzing healthcare chart notes using Natural Language Processing (NLP)這是使用自然語言處理(NLP)分析醫療保健圖表筆記的系列文章的第2部分。 In the first part, we talked about cleaning the text and extracting sectio…

keras框架實現手寫數字識別

詳細細節可學習從零開始神經網絡:keras框架實現數字圖像識別詳解! 代碼實現: [1]將訓練數據和檢測數據加載到內存中(第一次運行需要下載數據,會比較慢): (mnist是手寫數據集) train_images是用于訓練系統…

gdal進行遙感影像讀寫_如何使用遙感影像進行礦物勘探

gdal進行遙感影像讀寫Meet Jose Manuel Lattus, a geologist from Chile. In the latest Soar Cast, he discusses his work in mineral exploration and environmental studies, and explains how he makes a living by creating valuable information products based on diff…

從零開始神經網絡:keras框架實現數字圖像識別詳解!

接口實現可參考:keras框架實現手寫數字識別 思路: 我們的代碼要導出三個接口,分別完成以下功能: 初始化initialisation,設置輸入層,中間層,和輸出層的節點數。訓練train:根據訓練數據不斷的更…