r語言調用數據集中的數據集

Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant resources towards curating datasets. However, despite best efforts, it is nearly impossible to collect perfectly clean data, especially at the scale demanded by deep learning.

垃圾進垃圾出。 您不必是ML專家就能聽到這句話。模型揭示了數據中的模式，因此，當數據損壞時，它們會表現出損壞的行為。這就是為什么研究人員分配大量資源來管理數據集的原因。但是，盡管盡了最大的努力，但幾乎不可能收集完美的干凈數據，尤其是在深度學習所需的規模下。

This article discusses popular natural language datasets that turned out to disobey fundamental principles of machine learning and data science, despite being produced by experts in the field. Some of these flaws were exposed and quantified years after the publication and intense usage of the datasets. This is to show that data collection and validation are arduous processes. Here are some of their main impediments:

本文討論了流行的自然語言數據集，盡管這些數據集是由該領域的專家制作的，但它們卻違反了機器學習和數據科學的基本原理。這些缺陷中的一些在發布和大量使用數據集后數年就暴露并量化了。這表明數據收集和驗證是艱巨的過程 。以下是一些主要障礙：

Machine learning is data hungry. The sheer volume of data needed for ML (deep learning in particular) calls for automation, i.e., mining the Internet. Datasets end up inheriting undesirable properties from the Internet (e.g., duplication, statistical biases, falsehoods) that are non-trivial to detect and remove.
機器學習需要大量數據。 機器學習(尤其是深度學習)所需的龐大數據量要求自動化，即挖掘Internet。數據集最終從Internet繼承了不良的屬性(例如，重復，統計偏差，虛假信息)，這些屬性對于檢測和刪除而言并非不重要。
Desiderata cannot be captured exhaustively. Even in the presence of an oracle that could produce infinite data according to some predefined rules, it would be practically infeasible to enumerate all requirements. Consider the training data for a conversational bot. We can express general desiderata like diverse topics, respectful communication, or balanced exchange between interlocutors. But we don’t have enough imagination to specify all the relevant parameters.
Desiderata無法詳盡捕獲。 即使存在可以根據某些預定義規則生成無限數據的預言機，枚舉所有要求實際上也是不可行的。考慮對話式機器人的培訓數據。我們可以表達各種主題，尊重的溝通或對話者之間的平衡交流之類的一般愿望。但是我們沒有足夠的想象力來指定所有相關參數。
Humans take the path of least resistance. Some data collection efforts are still manageable at human scale. But we ourselves are not flawless and, despite our best efforts, are subconsciously inclined to take shortcuts. If you were tasked to write a statement that contradicts the premise “The dog is sleeping”, what would your answer be? Continue reading to find out whether you’d be part of the problem.
人類走阻力最小的道路 。一些數據收集工作仍然可以在人類規模上進行管理。但是我們自己也不是完美無瑕的，盡管我們盡了最大的努力，但我們下意識地傾向于走捷徑。如果您被要求撰寫與“狗在睡覺”這一前提相矛盾的陳述，您的答案將是什么？繼續閱讀以找出您是否會遇到問題。

重疊的培訓和評估集 (Overlapping training and evaluation sets)

ML practitioners split their data three-ways: there’s a training set for actual learning, a validation set for hyperparameter tuning, and an evaluation set for measuring the final quality of the model. It is common knowledge that these sets should be mostly disjunct. When evaluating on training data, you are measuring the model’s capacity to memorize rather than its ability to recognize patterns and apply them in new contexts.

ML練習者將數據分為三類：用于實際學習的訓練集 ，用于超參數調整的驗證集和用于測量模型最終質量的評估集 。眾所周知，這些集合大部分應該是分離的。在評估訓練數據時，您正在衡量模型的記憶能力，而不是模型識別模式并將其應用于新環境的能力。

This guideline sounds straightforward to apply, yet Lewis et al. [1] show in a 2020 paper that the most popular open-domain question answering datasets (open-QA) have a significant overlap between their training and evaluation sets. Their analysis includes WebQuestions, TriviaQA and Open Natural Questions — datasets created by reputable institutions and heavily used as QA benchmarks.

該指南聽起來很容易應用，但Lewis等人。 [1]在2020年的一篇論文中表明，最流行的開放域問題回答數據集(open-QA)在它們的訓練和評估集之間有明顯的重疊。 他們的分析包括WebQuestions ， TriviaQA和開放自然問題 -由知名機構創建并被廣泛用作質量檢查基準的數據集。

We find that 60–70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets.
我們發現，在測試集中某處還存在60-70％的測試時間答案。我們還發現30％的測試集問題在其相應的訓練集中具有近乎重復的釋義。

Of course, a 0% overlap between training and testing would not be ideal either. We do want some degree of memorization — models should be able to answer questions seen during training and know when to surface previously-seen answers. The real problem is benchmarking a model on a dataset with high training/evaluation overlap and making rushed conclusions about its generalization ability.

當然，培訓和測試之間的0％重疊也不理想。我們確實需要一定程度的記憶-模型應該能夠回答訓練中看到的問題，并知道何時出現以前看到的答案。 真正的問題是在訓練/評估重疊度高的數據集上對模型進行基準測試，并對模型的泛化能力做出匆忙的結論。

Lewis et al. [1] re-evaluate state-of-the-art QA models after partitioning the evaluation sets into three subsets: (a) question overlap — for which identical or paraphrased question-answer pairs occur in the training set, (b) answer overlap only—for which the same answers occur in the training set, but paired with a different question, and (c) no overlap. QA models score vastly differently across these three subsets. For instance, when tested on Open Natural Questions, the state-of-the-art Fusion-in-Decoder model scores ~70% on question overlap, ~50% on answer overlap only, ~35% on no overlap.

Lewis等。 [1]將評估集劃分為三個子集后，重新評估最新的質量保證模型：(a) 問題重疊-在訓練集中出現相同或釋義的問題-答案對，(b) 答案重疊僅-為此相同的答案出現在訓練集中，但與另一個問題配對，并且(c) 沒有重疊 。質量檢查模型在這三個子集中的得分差異很大。例如，在“ 開放自然問題”上進行測試時，最新的解碼器融合模型在問題重疊方面的得分約為70％，僅在答案重疊方面的得分為50％，在沒有重疊問題的得分為35％。

It is clear that performance on these datasets cannot be properly understood by overall QA accuracy and suggest that in future, a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures.
很明顯，這些數據集的性能無法通過總體質量保證準確性來正確理解，并建議在未來，應更加注重行為驅動的評估，而不是追求單一數字的總體準確性。

虛假相關 (Spurious correlations)

Just like humans, models take shortcuts and discover the simplest patterns that explain the data. For instance, consider a dog-vs-cat image classifier and a na?ve training set in which all dog images are grayscale and all cat images are in full color. The model will most likely latch onto the spurious correlation between presence/absence of color and labels. When tested on a dog in full color, it will probably label it as a cat.

就像人類一樣，模型采用捷徑并發現解釋數據的最簡單模式。例如，考慮一個狗對貓圖像分類器和一個幼稚的訓練集，其中所有的狗圖像都是灰度的，而所有的貓圖像都是全彩色的。該模型很可能會鎖定顏色是否存在和標簽之間的虛假相關性。在對全色狗進行測試時，它可能會將其標記為貓。

Gururangan et al. [2] showed that similar spurious correlations occur in two of the most popular natural language inference (NLI) datasets, SNLI (Stanford NLI) and MNLI (Multi-genre NLI). Given two statements, a premise and a hypothesis, the natural language inference task is to decide the relationship between them: entailment, contradiction or neutrality. Here is an example from the MNLI dataset:

Gururangan等。 [2]表明， 在兩個最流行的自然語言推論( NLI )數據集 SNLI (斯坦福NLI )和MNLI (多體NLI )中也發生了類似的虛假相關 。給定兩個陳述(一個前提和一個假設)，自然語言推理任務就是確定它們之間的關系： 蘊涵，矛盾或中立。 這是來自MNLI數據集的示例：

Image for post — MNLI datasetMNLI數據集中的示例

Solving NLI requires understanding the subtle connection between the premise and the hypothesis. However, Gururangan et al. [2] revealed that, when models are shown the hypothesis alone, they can achieve accuracy as high as 67% on SNLI and 53% on MNLI. This is significantly higher than the most-frequent-class baseline (~35%), surfacing undeniable flaws in the datasets.

解決NLI需要了解前提和假設之間的微妙聯系。但是，Gururangan等。 [2]揭示了， 當僅顯示假設時，它們可以在SNLI上達到67％的精度，在MNLI上達到53％的精度 。這大大高于最常見的基線(約35％)，從而彌補了數據集中不可否認的缺陷。

How did this happen? SNLI and MNLI were both crowd-sourced; humans were given a premise and asked to produce three hypotheses, one for each label. Which brings us back to the premise “The dog is sleeping”. How would you contradict it? “The dog is not sleeping” is a perfectly reasonable candidate. However, if negation is consistently applied as a heuristic, models learn to detect contradiction by simply checking for the occurrence of “not” in the hypothesis, achieving high accuracy without even reading the premise.

這怎么發生的？ SNLI和MNLI都是眾包的。給人類一個前提，并要求他們產生三個假設，每個假設一個。這使我們回到了“狗在睡覺”的前提。你會怎么矛盾呢？ “狗沒有睡覺”是一個非常合理的選擇。但是，如果將否定作為啟發式方法始終如一地應用，則模型會通過簡單地檢查假設中是否出現“ not”來學習檢測矛盾，從而無需閱讀前提即可獲得高精度。

Gururangan et al. [2] reveal several other such annotation artefacts:

Gururangan等。 [2]揭示了其他一些這樣的注釋偽像：

Entailment hypotheses were produced by generalizing words found in the premise (dog → animal, 3 → some, woman → person), making entailment recognizable from the hypothesis alone.
蘊涵假設是通過對前提中發現的單詞 ( 狗→動物，3→某些，女人→人 )進行概括而產生的，因此僅憑假設就可以識別蘊涵。
Neutral hypotheses were produced by injecting modifiers (tall, first, most) as an easy way to introduce information not entailed by the premise but also not contradictory to it.
中性假設是通過注入修飾符 ( 高，第一，多數 )產生的作為引入前提所不包含但又不矛盾的信息的簡便方法。

Despite these discoveries, MNLI remains under the GLUE leaderboard, one of the most popular benchmarks for natural language processing. Due to its considerable size compared to the other GLUE corpora (~400k data instances), MNLI is prominently featured in abstracts and used in ablation studies. While its shortcomings are starting to be recognized more widely, it is unlikely to lose its popularity until we find a better alternative.

盡管有這些發現，但是MNLI仍然排在GLUE排行榜的首位， GLUE是自然語言處理最受歡迎的基準之一。由于與其他GLUE語料庫(約40萬個數據實例)相比，MNLI的大小相當大，因此MNLI在摘要中具有突出的特征，并用于消融研究中。盡管它的缺點已開始被廣泛認可，但在我們找到更好的替代方法之前，它不太可能失去其流行性。

偏見和代表性不足 (Bias and under-representation)

In the past few years, bias in machine learning has been exposed across multiple dimensions including gender and race. In response to biased word embeddings and model behavior, the research community has been directing increasingly more efforts towards bias mitigation, as illustrated by Sun et al. [3] in their comprehensive literature review.

在過去的幾年中，機器學習的偏見已經暴露在包括性別和種族在內的多個層面。為了應對有偏見的詞嵌入和模型行為，研究團體一直在引導越來越多的努力來減輕偏見，如Sun等人所述。 [3]在他們的綜合文獻綜述中。

Yann LeCun, co-recipient of the 2018 Turing Award, pointed out that biased data leads to biased model behavior:

2018年圖靈獎的共同獲獎者Yann LeCun指出，有偏見的數據導致有偏見的模型行為：

演示地址

His Tweet drew a lot of engagement from the research community, with mixed reactions. On the one hand, people acknowledged almost unanimously that bias does exist in many datasets. On the other hand, some disagreed with the perceived implication that bias stems solely from data, additionally blaming modeling and evaluation choices, and the unconscious bias of those who design and build the models. Yann LeCun later clarified that he does not consider data bias to be the only cause for societal bias in models:

他的推文吸引了研究界的廣泛參與，React參差不齊。一方面，人們幾乎一致承認，在許多數據集中確實存在偏見。另一方面，有些人不同意感知的含義，即偏差僅源于數據，另外歸咎于建模和評估選擇以及設計和構建模型的人員的無意識偏差。 Yann LeCun后來澄清說，他不認為數據偏差是模型中社會偏差的唯一原因：

演示地址

Even though the dataset being discussed was an image corpus used for computer vision, natural language processing suffers no less from biased datasets. A prominent task that has exposed gender bias is coreference resolution, where a referring expression (like a pronoun) must be linked to an entity mentioned in the text. Here is an example from Webster et al. [4]:

即使正在討論的數據集是用于計算機視覺的圖像語料庫，自然語言處理也受到偏向數據集的影響。暴露性別偏見的一項突出任務是共指稱解析 ，即指稱表達(如代詞)必須與文本中提到的實體鏈接。這是Webster等人的示例。 [4]：

In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Kitami where she had spent her junior days.
今年五月，藤澤加入本橋麻里的溜冰場作為球隊的跳躍，從移動到輕井澤北見回來，她度過了她的初中天。

The authors point out that less than 15% of biographies on Wikipedia are about women, and that they tend to discuss marriage and divorce more prominently than pages about men. Given that many NLP datasets are extracted from Wikipedia, this impacts many downstream tasks. For coreference resolution in particular, the lack of female pronouns or their association with certain stereotypes is problematic. For instance, how would you interpret the sentence “Mary saw her doctor as she entered the room”?

作者指出，維基百科上不到15％的傳記是關于女性的，與男性頁面相比，他們更傾向于討論婚姻和離婚。鑒于從Wikipedia中提取了許多NLP數據集，因此這會影響許多下游任務。尤其對于共指解決，缺少女性代詞或它們與某些定型觀念的聯系是有問題的。舉例來說，你會如何解釋句子“瑪麗看到她的醫生，因為她進入房間”？

Eliminating bias from the training data is an unsolved problem. First, because we cannot exhaustively enumerate the axes in which bias manifests; in addition to gender and race, there are many other subtle dimensions that can invite bias (age, proper names, profession etc.). Second, even if we selected a single axis like gender, removing bias would either mean dropping a large portion of the data or applying error-prone heuristics to turn male pronouns into under-represented gender pronouns. Instead, the research community is currently focusing on producing unbiased evaluation datasets, since their smaller scale is more conducive of manual intervention. This at least gives us the ability to measure the performance of our models more truthfully, across a representative sample of the population.

從訓練數據中消除偏差是一個尚未解決的問題。首先，因為我們不能窮舉列舉偏見的軸；除了性別和種族外，還有許多其他細微的方面可能引起偏見(年齡，專有名稱，職業等)。其次，即使我們選擇像性別這樣的單一軸，消除偏見也可能意味著丟棄大量數據或應用容易出錯的啟發式方法將男性代詞轉變為代表性不足的性別代詞。取而代之的是，研究社區目前專注于生成無偏的評估數據集，因為它們較小的規模更有利于人工干預。這至少使我們能夠在整個代表性樣本中更真實地衡量模型的性能。

Building natural language datasets is a never-ending process: we continuously collect data, validate it, acknowledge its shortcomings and work around them. Then we rinse and repeat whenever a new source becomes available. And in the meantime we make progress. All the datasets mentioned above, despite their flaws, have undeniably helped push natural language understanding forward.

建立自然語言數據集是一個永無止境的過程：我們不斷收集數據，對其進行驗證，確認其缺點并加以解決。然后，我們會沖洗并在有新來源可用時重復進行。同時，我們取得了進步。上面提到的所有數據集，盡管存在缺陷，但無疑有助于推動自然語言理解的發展。

翻譯自: https://towardsdatascience.com/unsolved-problems-in-natural-language-datasets-2b09ab37e94c

r語言調用數據集中的數據集

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/391674.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/391674.shtml
英文地址，請注明出處：http://en.pswp.cn/news/391674.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！