成像數據更好的展示
Over the past few years, there has been a growing consensus that the more data one has, the better the eventual analysis will be.
在過去的幾年中,越來越多的共識是,數據越多,最終的分析就越好。
However, just as humans can become overwhelmed by too much information — so can machine learning models.
但是,就像人類會被太多的信息所淹沒一樣,機器學習模型也是如此。
以酒店取消為例 (Hotel Cancellations as an example)
I was thinking about this issue recently when reflecting on a side project I have been working on for the past year — Predicting Hotel Cancellations with Machine Learning.
最近,當我反思過去一年來一直在做的一個輔助項目時,我正在考慮這個問題- 通過機器學習預測酒店取消 。
Having written numerous articles about the topic on Medium — it is clear that the landscape for the hospitality industry has changed fundamentally in the past year.
在“媒介”這個主題上寫了許多文章之后,很顯然,在過去一年中,酒店業的格局發生了根本變化。
With a growing emphasis on “staycations”, or local holidays — this fundamentally changes the assumptions that any machine learning model should make when predicting hotel cancellations.
隨著人們越來越重視“住宿”或當地假期,這從根本上改變了任何機器學習模型在預測酒店取消時都應做出的假設。
The original data from Antonio, Almeida and Nunes (2016) used datasets from Portuguese hotels with a response variable indicating whether the customer had cancelled their booking or not, along with other information on that customer such as country of origin, market segment, etc.
Antonio,Almeida和Nunes(2016)的原始數據使用了來自葡萄牙酒店的數據集,其響應變量指示客戶是否取消了預訂,以及該客戶的其他信息,例如原籍國,細分市場等。
In the two datasets in question, approximately 55-60% of all customers were international customers.
在上述兩個數據集中,大約55-60%的客戶是國際客戶。
However, let’s assume this scenario for a moment. This time next year — hotel occupancy is back to normal levels — but the vast majority of customers are domestic, in this case from Portugal. For the purposes of this example, let’s assume the extreme scenario that 100% of customers are domestic.
但是,讓我們暫時假設這種情況。 明年的這個時候-酒店入住率恢復到正常水平-但絕大多數客戶來自國內,在這種情況下來自葡萄牙。 出于本示例的目的,我們假設100%的國內客戶是極端情況。
Such an assumption will radically affect the ability of any previously trained model to accurately forecast cancellations. Let’s take an example.
這樣的假設將從根本上影響任何先前訓練的模型準確預測取消的能力。 讓我們舉個例子。
使用SVM模型進行分類 (Classification using SVM Model)
An SVM model was originally used to predict hotel cancellations — with the model being trained on one dataset (H1) and the predictions then compared to a test set (H2) using the feature data from that test set. The response variable is categorical (1 = booking cancelled by customer, 0 = booking not cancelled by customer).
SVM模型最初用于預測酒店的取消情況-在一個數據集(H1)上對該模型進行訓練,然后使用來自該測試集的特征數據將該預測與測試集(H2)進行比較。 響應變量是分類變量(1 =客戶取消預訂,0 =客戶未取消預訂)。
Here are the results as displayed by a confusion matrix across three different scenarios.
這是在三種不同情況下的混淆矩陣顯示的結果。
方案1:在H1(完整數據集)上訓練,在H2(完整數據集)上測試 (Scenario 1: Trained on H1 (full dataset), tested on H2 (full dataset))
[[25217 21011]
[ 8436 24666]]
precision recall f1-score support
0 0.75 0.55 0.63 46228
1 0.54 0.75 0.63 33102
accuracy 0.63 79330
macro avg 0.64 0.65 0.63 79330
weighted avg 0.66 0.63 0.63 79330
Overall accuracy comes in at 63%, while recall for the positive class (cancellations) came in at 75%. To clarify, recall in this instance means that of all the cancellation incidences — the model correctly identifies 75% of them.
總體準確度為63%,而正面評價(取消)的查全率為75%。 為了明確起見,在這種情況下,召回意味著所有取消事件-該模型正確地識別了其中的75%。
Now let’s see what happens when we train the SVM model on the full training set, but only include domestic customers from Portugal in our test set.
現在,讓我們看看在完整的訓練集上訓練SVM模型但僅在測試集中包括葡萄牙的國內客戶時會發生什么。
方案2:在H1(完整數據集)上進行培訓,在H2(僅適用于本地)上進行了測試 (Scenario 2: Trained on H1 (full dataset), tested on H2 (domestic only))
[[10879 0]
[20081 0]]
precision recall f1-score support
0 0.35 1.00 0.52 10879
1 0.00 0.00 0.00 20081
accuracy 0.35 30960
macro avg 0.18 0.50 0.26 30960
weighted avg 0.12 0.35 0.18 30960
Accuracy has dropped dramatically to 35%, while recall for the cancellation class has dropped to 0% (meaning the model has not predicted any of the cancellation incidences in the test set). The performance in this instance is clearly very poor.
準確性急劇下降到35%,而取消類的召回率下降到0%(這意味著模型尚未預測測試集中的任何取消發生率)。 在這種情況下,性能顯然很差。
方案3:在H1(僅限于國內)上受過培訓,并在H2(僅限于國內)上進行了測試 (Scenario 3: Trained on H1 (domestic only), tested on H2 (domestic only))
However, what if the training set was modified to only include customers from Portugal and the model trained once again?
但是,如果將培訓集修改為僅包括來自葡萄牙的客戶,并且再次對模型進行了培訓,該怎么辦?
[[ 8274 2605]
[ 6240 13841]]
precision recall f1-score support
0 0.57 0.76 0.65 10879
1 0.84 0.69 0.76 20081
accuracy 0.71 30960
macro avg 0.71 0.72 0.70 30960
weighted avg 0.75 0.71 0.72 30960
Accuracy is back up to 71%, while recall is at 69%. Using less, but more relevant data in the training set has allowed for the SVM model to predict cancellations across the test set much more accurately.
準確率回升到71%,而召回率則達到69%。 在訓練集中使用更少但更相關的數據可以使SVM模型更加準確地預測整個測試集中的取消情況。
如果數據錯誤,則模型結果也將錯誤 (If The Data Is Wrong, Model Results Will Also Be Wrong)
More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality.
如果很多數據與您要預測的內容無關,則更多的數據并不會更好。 如果訓練集不能代表現實,那么甚至會誤導機器學習模型。
This was cited by a Columbia Business School study as an issue in the 2016 U.S. Presidential Elections, where the polls had put Clinton on a firm lead against Trump. However, it turned out that there were many “secret Trump voters” who had not been accounted for in the polls — and this had skewed the results towards a predicted Clinton win.
哥倫比亞大學商學院的一項研究將其引用為2016年美國總統大選的一個問題,民意測驗使克林頓在對抗特朗普方面處于堅決領先地位。 然而,事實證明,有許多“秘密特朗普選民”并未在民意調查中得到解釋,這使結果偏向了預期的克林頓勝利。
I’m non-U.S. and neutral on the subject by the way — I simply use this as an example to illustrate that even data we often think of as “big” can still contain inherent biases and may not be representative of what is actually going on.
順便說一下,我不是美國人,對這個問題持中立態度-我僅以此為例來說明,即使我們經常認為“大”的數據也可能包含固有偏差,并且可能無法代表實際情況上。
Instead, the choice of data needs to be scrutinised as much as model selection, if not more so. Is inclusion of certain data relevant to the problem that we are trying to solve?
取而代之的是,數據選擇需要與模型選擇一樣仔細檢查,如果不是更多的話。 是否包含與我們要解決的問題相關的某些數據?
Going back to the hotel example, inclusion of international customer data in the training set did not enhance our model when our goal is to predict cancellations across the domestic customer base.
回到酒店的例子,當我們的目標是預測整個國內客戶群的取消時,將國際客戶數據包含在培訓集中并不能改善我們的模型。
結論 (Conclusion)
There is increasingly a push to gather more data across all domains. While more data in and of itself is not a bad thing — it should not be assumed that blindly introducing more data into a model will improve its accuracy.
越來越多的人要求跨所有域收集更多數據。 盡管更多的數據本身并不是一件壞事,但不應認為盲目地將更多數據引入模型可以提高其準確性。
Rather, data scientists still need the ability to determine the relevance of such data to the problem at hand. From this point of view, model selection becomes somewhat of an afterthought. If the data is representative of the problem that you are trying to solve in the first instance, then even the more simple machine learning models will generate strong predictive results.
而是,數據科學家仍然需要能夠確定此類數據與當前問題的相關性。 從這個角度來看,模型選擇變得有些事后思考。 如果數據代表您首先要解決的問題,那么即使是更簡單的機器學習模型也將產生強大的預測結果。
Many thanks for reading, and feel free to leave any questions or feedback in the comments below.
非常感謝您的閱讀,并隨時在下面的評論中留下任何問題或反饋。
If you are interested in taking a deeper look at the hotel cancellation example, you can find my GitHub repository here.
如果您想深入了解酒店取消示例,則可以在此處找到我的GitHub存儲庫 。
Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.
免責聲明:本文按“原樣”撰寫,不作任何擔保。 它旨在提供數據科學概念的概述,并且不應以任何方式解釋為專業建議。
翻譯自: https://towardsdatascience.com/why-more-data-is-not-always-better-de96723d1499
成像數據更好的展示
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/387936.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/387936.shtml 英文地址,請注明出處:http://en.pswp.cn/news/387936.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!