如何識別媒體偏見

TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackathon Project: Verity E-Quality (Aditya Ramesh, Erica Nishimura, Ishan Shrivastava, Lane Schechter and Trung Do).
T GumGum可以利用我們的自然語言處理技術來帶來變化，從而揭示網站內容可能存在的潛在偏見。本博客中分享的想法和技術是GumGum Hackathon項目：Verity E-Quality(Aditya Ramesh，Erica Nishimura，Ishan Shrivastava，Lane Schechter和Trung Do)的結果。

In this blog, we will look into how we can utilize and build upon the existing product offering from GumGum to understand the Gender Representation in a website’s content. We aren’t saying that one publisher is more biased than the other, rather we are merely providing the awareness around the representation as it exists. With Natural Language Processing, we can compare between the descriptive language being used around Males and Females to provide this awareness.

在此博客中，我們將研究如何利用和建立GumGum提供的現有產品，以了解網站內容中的性別表示形式。我們并不是說一個出版商比另一個出版商有更大的偏見，相反，我們只是在提供有關表示形式的意識。通過自然語言處理，我們可以在男性和女性周圍使用的描述性語言之間進行比較，以提供這種意識。

In order to facilitate meaningful change, we need to be aware and mindful of where that change is needed. — Lane Schechter, Product Manager, GumGum Inc.
為了促進有意義的變更，我們需要意識到并銘記需要進行哪些變更。 —口香糖公司產品經理Lane Schechter

口香糖的產品 (GumGum’s Product Offerings)

Before we move ahead to understand how we build upon the existing product offerings, let us first take a brief look at them. GumGum’s Verity Product does a complete contextual analysis of a publisher’s webpage. Some of the key offerings of this product are:

在繼續了解如何在現有產品基礎上發展之前，讓我們首先簡要地了解一下它們。 GumGum的Verity產品對發布者的網頁進行了完整的上下文分析。該產品的一些主要產品包括：

Contextual Classification & Targeting: This feature identifies and scores publisher’s content (webpages) for contextual classification based on standard IAB Content Taxonomy v1.0 and v2.0. Some of those categories are “Sports”, “Food & Drinks”, “Automotive”, “Medical Health” etc. Going forward, we will refer to them as IAB verticals.
內容相關分類和定位 ：此功能可根據標準IAB內容分類標準v1.0和v2.0對發布者的內容(網頁)進行識別和評分，以進行內容相關分類。其中一些類別是“體育”，“食品和飲料”，“汽車”，“醫療保健”等。展望未來，我們將其稱為IAB行業。
Brand Safety & Suitability: This feature flags and rates brand safety threats based on GumGum’s proprietary threat classification taxonomy and in compliance with The 4A’s Advertising Assurance Brand Safety Framework.
品牌安全性和適用性 ：此功能基于GumGum專有的威脅分類法并符合4A的廣告保證品牌安全框架來標記和評估品牌安全威脅。
Named Entity Recognition (NER): This feature identifies and extracts any mention of a named entity in the publisher’s content. A named entity could be any mention of a ‘Person’, ‘Location’ or ‘Organization’.
命名實體識別(NER) ：此功能可以識別并提取發布者內容中對命名實體的任何提及。命名實體可以是對“人員”，“位置”或“組織”的任何提及。
Sentiment Analysis: This feature analyzes the attitudes, opinions and emotions expressed online to provide the most nuanced brand safety and contextual insights.
情感分析 ：此功能可分析在線表達的態度，觀點和情感，以提供最細微的品牌安全性和上下文相關見解。

Here is one way we can provide the Descriptive Language Understanding Associated with Gender. We can use the Named Entity Recognition (NER) feature to extract Names of “Person” named entity type which can be used to identify the gender of the person being talked about. We can also use the Sentiment Analysis feature to extract sentiment of the sentences in which Males and Females are being talked about. We can use all of this information to understand the descriptive language being used around Males and Females (more on how to do this in the next section) and compare it across different IAB verticals extracted using our Contextual Classification feature.

這是我們提供與性別相關的描述性語言理解的一種方法。我們可以使用命名實體識別( NER )功能來提取“個人”命名實體類型的名稱，該名稱可用于識別所談論人員的性別。我們還可以使用情感分析功能來提取正在談論男性和女性的句子的情感。我們可以使用所有這些信息來理解男性和女性周圍使用的描述性語言(在下一節中將詳細介紹如何操作)，并使用上下文分類功能將其與不同的IAB垂直行業進行比較。

與性別相關的描述性語言理解方法 (Approach for Descriptive Language Understanding Associated with Gender)

Image for post — Fig 1: Flowchart diagram describing the approach for Descriptive Language Understanding Associated with Gender

We start by running a Domain Specific Query on our NLP Databases to extract URL’s for the given publisher. We then utilize the Named Entity Recognition Feature of Verity to filter out pages that do not contain any “Person” Named Entity. From the remaining pages, we extract all “Person Names” and the Sentences in which those “Person Names” occur. As a future step, we can also perform coreference resolution, to extract more sentences where the “Persons” are mentioned using their respective pronouns.

我們首先在NLP數據庫上運行特定于域的查詢，以提取給定發布者的URL。然后，我們利用Verity的命名實體識別功能來過濾掉不包含任何“人”命名實體的頁面。從其余頁面中，我們提取所有“人名”和出現這些“人名”的句子。作為未來的步驟，我們還可以執行共指解析，以提取更多句子，并使用各自的代詞提及“人物”。

We then use the “Person Names” to detect the gender of the person using an open source package called Gender Guesser. We also use the “Sentences” to extract the sentiment of the Sentence by utilizing our own FastText based Sentiment Classification model. This model is trained on our publisher data which classifies a sentence into Negative, Neutral or Positive Sentiment.

然后，我們使用稱為“性別名稱”的開源軟件包Gender Guesser來檢測人員的性別。我們還使用“句子”通過利用我們自己的基于FastText的情感分類模型來提取句子的情感。此模型是根據我們的發布者數據訓練的，該數據將句子分為負面，中性或正面情緒。

We also use “Person Names” and the Sentences they occur in to extract Adjectives used in the surrounding context for a given person. To achieve this we used Spacy’s Part of Speech Tokenizer and extract adjectives used within a proximity of a mention of a person name. Consider the example given below:

我們還使用“人物名稱”及其出現的句子來提取給定人物在周圍環境中使用的形容詞。為了達到這個目的，我們使用了Spacy的語音詞性分詞器，并提取了在提及某人名時使用的形容詞。考慮下面給出的示例：

We use all this information to create a Word Cloud for the Adjectives used around each Gender and Sentiment Pair across the entire content as well as specific to different IAB verticals.

我們使用所有這些信息為整個內容以及特定于不同IAB行業的每個性別和情感對使用的形容詞創建詞云。

For example, consider the following four word clouds that we got based on the Adjectives used around Males and Females in a Positive and Negative context extracted from a Publisher’s content:

例如，考慮以下四個詞云，這些詞云是根據從發布者內容中提取的正面和負面上下文中男性和女性周圍使用的形容詞得出的：

Nothing stereotypical stands out here. It has similarly or equally negative adjectives being used around Males and Females alike.
沒有什么定型觀念在這里脫穎而出。它在男性和女性周圍都有相似或同等的否定形容詞。

What we see here is that more Intellectual Type Adjectives being used around Males, while more Appearance Type Adjectives being used around Females.
我們在這里看到的是，在男性周圍使用更多的智力類型形容詞，而在女性周圍使用更多的外觀類型形容詞。

It becomes even more clearer if we look at the most frequent Adjectives used around ONLY Males or Female. We do this be considering the top 15 adjectives and extracting only the Uncommon Adjectives between the two genders and compare it among the Positive and Negative Context.

如果我們只看男性或女性周圍最常用的形容詞，就會更加清楚。我們這樣做是在考慮前15名形容詞，并僅提取兩個性別之間的不常見形容詞 ，然后將其在正面和負面語境中進行比較。

Here we can clearly see that in the Negative context, the most frequent Adjectives used around Only Males and Only Females can be considered equally negative. But in the Positive context, that is clearly not the case. Around Males, we see adjectives like “Proud”, “Sized”, “Perfect”, “Fantastic etc while we see adjectives like “Beautiful”, “Healthy”, “Amazing”, “Sweet”, “Supporting”, “Lucky” etc around Females. This is suggestive of more Intellectual Type Adjectives being used around Males and more Appearance Types adjectives being used around Females.

在這里我們可以清楚地看到，在否定語境中，僅男性和僅女性周圍使用最頻繁的形容詞可被視為同等否定。但是，在積極方面，情況顯然并非如此。在男性周圍，我們看到形容詞如“驕傲”，“大小”，“完美”，“棒極了”，而我們看到形容詞如“美麗”，“健康”，“驚人”，“甜”，“支持”，“幸運”等女性。這表明在男性周圍使用更多的智力類型形容詞，在女性周圍使用更多的外觀類型形容詞。

This sort of analysis of the descriptive language being used around different Genders in different Sentimental Context can really help in understanding what sort of Bias if any is present in a publisher’s content. But how can we quantify this? For this we introduce a Context Based Similarity Score.

對在不同情感環境中不同性別之間使用的描述性語言進行的這種分析，確實可以幫助理解發行人內容中存在的哪種偏差(如有)。但是我們如何量化呢？為此，我們介紹了一個基于上下文的相似度評分 。

基于上下文的相似度評分 (Context Based Similarity Score)

The idea here is to find a way to compute a single score that shows the degree of similarity between the most frequent adjectives used around only Males and only Females. To achieve this we make use of the famous Transformer based Deep Learning model: BERT by Google Research.

這里的想法是找到一種方法來計算單個分數，該分數顯示僅在男性和女性之間使用的最常見形容詞之間的相似程度。為此，我們利用了著名的基于Transformer的深度學習模型： Google Research的BERT 。

Among being awesome at a variety of NLP tasks and breaking the State of the Art results on them, BERT is also great at providing Contextualized Word Vector Representations (Embeddings). What that means is that, BERT doesn’t provide a single and constant representation of a word, rather it looks at the context in which the word was used in the sentence and spits out a context sensitive representation of that word. This is particularly useful as it captures more information than other representations such as Word2Vec or Glove. A famous example used to point this out is that BERT will provide different representations for the word “Bank” depending on the context in which it was used. The context could be of a river bank or of a financial bank. Therefore, to extract a word representation from BERT, you need to send a sentence in which it was used to get a Contextualized Word Vector Representations. (Apart from reading their original paper here, you can also look at this and this to get a more visualistic way of understanding Transformers and BERT. )

BERT擅長處理各種NLP任務并打破了最新的技術成果，其中，BERT擅長提供上下文化的詞向量表示(嵌入) 。這就是說，BERT不提供單詞的單一且恒定的表示形式，而是查看句子中使用該單詞的上下文，并吐出該單詞的上下文相關表示形式。這一點特別有用，因為它比諸如Word2Vec或Glove之類的其他表示形式捕獲的信息更多。指出這一點的一個著名示例是，BERT將根據使用的上下文為“銀行”一詞提供不同的表示形式。上下文可以是河岸或金融銀行。因此，要從BERT中提取單詞表示形式，您需要發送一個句子，在該句子中使用它來獲取上下文化的單詞向量表示形式。 (除了這里閱讀他們的原始論文，你也可以看看這個和這個得到理解變壓器和BERT更visualistic方式。)

Therefore, along with the most frequent Male only and Female Only adjectives, we also extract the sentences in which these Male only and Female Only Adjectives are used. We send these sentences into BERT to extract Contextualized Vector Representations of length 768, for each of these Adjectives based on the context in which these adjectives were used.

因此，與最常見的男性專用和女性專用形容詞一起，我們還提取了使用這些男性專用和女性專用形容詞的句子。我們根據使用這些形容詞的上下文，將這些句子發送到BERT中，以提取長度為768的上下文化向量表示形式。

We use these representation that have rich context information to compute a Context Based Similarity Score between the Male only Adjectives and Female Only Adjectives used in with Positive or a Negative context. We take the mean of the contextual representations of all Male only Adjectives and Female Only Adjectives to get an averaged representation for all the Male only Adjectives and Female only Adjectives respectively. We then take the cosine similarity between the two vector representations to compute a Context Based Similarity Score as shown in the figure below:

我們使用具有豐富上下文信息的這些表示來計算在正或負上下文中使用的僅男性形容詞和僅女性形容詞之間的基于上下文的相似性得分 。我們取所有男性專用形容詞和女性專用形容詞的上下文表示的平均值，以分別獲得所有男性專用形容詞和女性專用形容詞的平均表示。然后，我們使用兩個向量表示之間的余弦相似度來計算基于上下文的相似度得分，如下圖所示：

This score is calculated for a given sentiment and a given IAB vertical.

針對給定的情緒和給定的IAB垂直度計算此分數。

The higher this score, the better is the balance between the Adjectives being used around a particular gender in the context of a given sentiment and given IAB vertical.
該分數越高，在給定的情緒和IAB垂直的情況下針對特定性別使用的形容詞之間的平衡就越好。

Let us look at the Context Based Similarity score in action:

讓我們看一下基于上下文的相似性得分：

Comparing the two scores, we can see that we get a higher score in the case of Negative sentiment, where there were similar kind of Adjectives (equally negative in this case) used around Males and Females. On the other hand, we get a lower score in the case of Positive sentiment, where we did see some form of Bias with Intellectual Type Adjectives being used around Males while Appearance Type Adjectives being used around Females.

比較這兩個分數，我們可以發現，在負面情緒的情況下，我們在男性和女性周圍使用了相似類型的形容詞(在這種情況下，均為負數)時得分更高。另一方面，在積極情緒的情況下，我們得到了較低的分數，在這種情況下，我們確實看到了某種形式的偏見，其中男性使用智力類型形容詞，而女性使用外觀類型形容詞。

結論 (Conclusion)

In this blog we saw how we can analyze the Descriptive Language used around Males and Females. We analyzed the insights found from such an analysis and saw how it can guide and point us to where the change might be required. We took a look at how GumGum can leverage Product Offerings like Content Classification and Named Entity Recognition from its vast variety of feature arsenal and build upon them to quantify the degree of similarities in the descriptive language being used around Males and Females. As a part of our future works, we can work on identifying Race mentions in a piece of text and easily extend this work to understand the Descriptive Language used around different Races.

在此博客中，我們看到了如何分析男性和女性周圍使用的描述性語言。我們分析了從這種分析中發現的見解，并了解了它如何指導并指出我們可能需要進行更改的地方。我們研究了GumGum如何利用其功能豐富的功能庫中的內容分類和命名實體識別之類的產品，并以此為基礎來量化男性和女性使用的描述性語言的相似程度。作為我們未來工作的一部分，我們可以在一段文字中識別種族提及，并輕松地擴展這項工作以理解圍繞不同種族使用的描述性語言。

About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.

關于我 ：畢業于ASU的計算機科學碩士學位。我是GumGum的NLP科學家。我對應用機器學習/深度學習感興趣，以便為我們周圍的非結構化數據提供某種結構。

We’re always looking for new talent! View jobs.

我們一直在尋找新的人才！查看工作。

關注我們： Facebook | 推特 | | Linkedin | Instagram

翻譯自: https://medium.com/gumgum-tech/descriptive-language-understanding-to-identify-potential-bias-in-text-89936fefbae7

如何識別媒體偏見

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389516.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389516.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389516.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！