成像數據更好的展示_為什么更多的數據并不總是更好

成像數據更好的展示

Over the past few years, there has been a growing consensus that the more data one has, the better the eventual analysis will be.

在過去的幾年中,越來越多的共識是,數據越多,最終的分析就越好。

However, just as humans can become overwhelmed by too much information — so can machine learning models.

但是,就像人類會被太多的信息所淹沒一樣,機器學習模型也是如此。

以酒店取消為例 (Hotel Cancellations as an example)

I was thinking about this issue recently when reflecting on a side project I have been working on for the past year — Predicting Hotel Cancellations with Machine Learning.

最近,當我反思過去一年來一直在做的一個輔助項目時,我正在考慮這個問題- 通過機器學習預測酒店取消

Having written numerous articles about the topic on Medium — it is clear that the landscape for the hospitality industry has changed fundamentally in the past year.

在“媒介”這個主題上寫了許多文章之后,很顯然,在過去一年中,酒店業的格局發生了根本變化。

With a growing emphasis on “staycations”, or local holidays — this fundamentally changes the assumptions that any machine learning model should make when predicting hotel cancellations.

隨著人們越來越重視“住宿”或當地假期,這從根本上改變了任何機器學習模型在預測酒店取消時都應做出的假設。

The original data from Antonio, Almeida and Nunes (2016) used datasets from Portuguese hotels with a response variable indicating whether the customer had cancelled their booking or not, along with other information on that customer such as country of origin, market segment, etc.

Antonio,Almeida和Nunes(2016)的原始數據使用了來自葡萄牙酒店的數據集,其響應變量指示客戶是否取消了預訂,以及該客戶的其他信息,例如原籍國,細分市場等。

In the two datasets in question, approximately 55-60% of all customers were international customers.

在上述兩個數據集中,大約55-60%的客戶是國際客戶。

However, let’s assume this scenario for a moment. This time next year — hotel occupancy is back to normal levels — but the vast majority of customers are domestic, in this case from Portugal. For the purposes of this example, let’s assume the extreme scenario that 100% of customers are domestic.

但是,讓我們暫時假設這種情況。 明年的這個時候-酒店入住率恢復到正常水平-但絕大多數客戶來自國內,在這種情況下來自葡萄牙。 出于本示例的目的,我們假設100%的國內客戶是極端情況。

Such an assumption will radically affect the ability of any previously trained model to accurately forecast cancellations. Let’s take an example.

這樣的假設將從根本上影響任何先前訓練的模型準確預測取消的能力。 讓我們舉個例子。

使用SVM模型進行分類 (Classification using SVM Model)

An SVM model was originally used to predict hotel cancellations — with the model being trained on one dataset (H1) and the predictions then compared to a test set (H2) using the feature data from that test set. The response variable is categorical (1 = booking cancelled by customer, 0 = booking not cancelled by customer).

SVM模型最初用于預測酒店的取消情況-在一個數據集(H1)上對該模型進行訓練,然后使用來自該測試集的特征數據將該預測與測試集(H2)進行比較。 響應變量是分類變量(1 =客戶取消預訂,0 =客戶未取消預訂)。

Here are the results as displayed by a confusion matrix across three different scenarios.

這是在三種不同情況下的混淆矩陣顯示的結果。

方案1:在H1(完整數據集)上訓練,在H2(完整數據集)上測試 (Scenario 1: Trained on H1 (full dataset), tested on H2 (full dataset))

[[25217 21011]
[ 8436 24666]]
precision recall f1-score support
0 0.75 0.55 0.63 46228
1 0.54 0.75 0.63 33102
accuracy 0.63 79330
macro avg 0.64 0.65 0.63 79330
weighted avg 0.66 0.63 0.63 79330

Overall accuracy comes in at 63%, while recall for the positive class (cancellations) came in at 75%. To clarify, recall in this instance means that of all the cancellation incidences — the model correctly identifies 75% of them.

總體準確度為63%,而正面評價(取消)的查全率為75%。 為了明確起見,在這種情況下,召回意味著所有取消事件-該模型正確地識別了其中的75%。

Now let’s see what happens when we train the SVM model on the full training set, but only include domestic customers from Portugal in our test set.

現在,讓我們看看在完整的訓練集上訓練SVM模型但僅在測試集中包括葡萄牙的國內客戶時會發生什么。

方案2:在H1(完整數據集)上進行培訓,在H2(僅適用于本地)上進行了測試 (Scenario 2: Trained on H1 (full dataset), tested on H2 (domestic only))

[[10879     0]
[20081 0]]
precision recall f1-score support
0 0.35 1.00 0.52 10879
1 0.00 0.00 0.00 20081
accuracy 0.35 30960
macro avg 0.18 0.50 0.26 30960
weighted avg 0.12 0.35 0.18 30960

Accuracy has dropped dramatically to 35%, while recall for the cancellation class has dropped to 0% (meaning the model has not predicted any of the cancellation incidences in the test set). The performance in this instance is clearly very poor.

準確性急劇下降到35%,而取消類的召回率下降到0%(這意味著模型尚未預測測試集中的任何取消發生率)。 在這種情況下,性能顯然很差。

方案3:在H1(僅限于國內)上受過培訓,并在H2(僅限于國內)上進行了測試 (Scenario 3: Trained on H1 (domestic only), tested on H2 (domestic only))

However, what if the training set was modified to only include customers from Portugal and the model trained once again?

但是,如果將培訓集修改為僅包括來自葡萄牙的客戶,并且再次對模型進行了培訓,該怎么辦?

[[ 8274  2605]
[ 6240 13841]]
precision recall f1-score support
0 0.57 0.76 0.65 10879
1 0.84 0.69 0.76 20081
accuracy 0.71 30960
macro avg 0.71 0.72 0.70 30960
weighted avg 0.75 0.71 0.72 30960

Accuracy is back up to 71%, while recall is at 69%. Using less, but more relevant data in the training set has allowed for the SVM model to predict cancellations across the test set much more accurately.

準確率回升到71%,而召回率則達到69%。 在訓練集中使用更少但更相關的數據可以使SVM模型更加準確地預測整個測試集中的取消情況。

如果數據錯誤,則模型結果也將錯誤 (If The Data Is Wrong, Model Results Will Also Be Wrong)

More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality.

如果很多數據與您要預測的內容無關,則更多的數據并不會更好。 如果訓練集不能代表現實,那么甚至會誤導機器學習模型。

This was cited by a Columbia Business School study as an issue in the 2016 U.S. Presidential Elections, where the polls had put Clinton on a firm lead against Trump. However, it turned out that there were many “secret Trump voters” who had not been accounted for in the polls — and this had skewed the results towards a predicted Clinton win.

哥倫比亞大學商學院的一項研究將其引用為2016年美國總統大選的一個問題,民意測驗使克林頓在對抗特朗普方面處于堅決領先地位。 然而,事實證明,有許多“秘密特朗普選民”并未在民意調查中得到解釋,這使結果偏向了預期的克林頓勝利。

I’m non-U.S. and neutral on the subject by the way — I simply use this as an example to illustrate that even data we often think of as “big” can still contain inherent biases and may not be representative of what is actually going on.

順便說一下,我不是美國人,對這個問題持中立態度-我僅以此為例來說明,即使我們經常認為“大”的數據也可能包含固有偏差,并且可能無法代表實際情況上。

Instead, the choice of data needs to be scrutinised as much as model selection, if not more so. Is inclusion of certain data relevant to the problem that we are trying to solve?

取而代之的是,數據選擇需要與模型選擇一樣仔細檢查,如果不是更多的話。 是否包含與我們要解決的問題相關的某些數據?

Going back to the hotel example, inclusion of international customer data in the training set did not enhance our model when our goal is to predict cancellations across the domestic customer base.

回到酒店的例子,當我們的目標是預測整個國內客戶群的取消時,將國際客戶數據包含在培訓集中并不能改善我們的模型。

結論 (Conclusion)

There is increasingly a push to gather more data across all domains. While more data in and of itself is not a bad thing — it should not be assumed that blindly introducing more data into a model will improve its accuracy.

越來越多的人要求跨所有域收集更多數據。 盡管更多的數據本身并不是一件壞事,但不應認為盲目地將更多數據引入模型可以提高其準確性。

Rather, data scientists still need the ability to determine the relevance of such data to the problem at hand. From this point of view, model selection becomes somewhat of an afterthought. If the data is representative of the problem that you are trying to solve in the first instance, then even the more simple machine learning models will generate strong predictive results.

而是,數據科學家仍然需要能夠確定此類數據與當前問題的相關性。 從這個角度來看,模型選擇變得有些事后思考。 如果數據代表您首先要解決的問題,那么即使是更簡單的機器學習模型也將產生強大的預測結果。

Many thanks for reading, and feel free to leave any questions or feedback in the comments below.

非常感謝您的閱讀,并隨時在下面的評論中留下任何問題或反饋。

If you are interested in taking a deeper look at the hotel cancellation example, you can find my GitHub repository here.

如果您想深入了解酒店取消示例,則可以在此處找到我的GitHub存儲庫 。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免責聲明:本文按“原樣”撰寫,不作任何擔保。 它旨在提供數據科學概念的概述,并且不應以任何方式解釋為專業建議。

翻譯自: https://towardsdatascience.com/why-more-data-is-not-always-better-de96723d1499

成像數據更好的展示

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/387936.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/387936.shtml
英文地址,請注明出處:http://en.pswp.cn/news/387936.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

支付寶架構

支付寶系統架構圖如下: 支付寶架構文檔有兩個搞支付平臺設計的人必須仔細揣摩的要點。 一個是賬務處理。在記賬方面,涉及到內外兩個子系統,外部子系統是單邊賬,滿足線上性能需求;內部子系統走復式記賬,滿足…

怎樣可以跨進程測試

在Android系統下模擬鼠標鍵盤等輸入設備,網絡上資料非常多。但不少是人云亦云,甚至測試都不愿測試一下就抄上來了。這次寫一點體會,當作拋磚引玉。0. 背景知識:眾所周知,Android是將Framework架在Linux之上的系統。Lin…

Android Studio 導入新工程項目

1 導入之前先修改工程下相關文件 1.1 只需修改如下三個地方1.2 修改build.gradle文件 1.3 修改gradle/wrapper/gradle-wrapper.properties 1.4 修改app/build.gradle 2 導入修改后的工程 2.1 選擇File|New|Import Project 2.2 選擇修改后的工程 如果工程沒有變成AS符號&#xf…

馬蜂窩張矗:績效考核是為了激發工作潛力,而不是逃避問題

3 月 23 日,由高端技術領導者社交平臺 TGO 鯤鵬會主辦的 GTLC 全球技術領導峰會分站首站在北京舉行。會上馬蜂窩技術副總裁 \u0026amp; TGO 鯤鵬會會員張矗發表了主題為“我在馬蜂窩的技術管理實踐”的演講。本文根據其演講整理而成。大家好,我是來自馬蜂…

vue domo網站_DOMO與Tableau-逐輪

vue domo網站Let me be your BI consultant. Best yet, let me be your free consultant on the following question:讓我成為您的BI顧問。 最好的是,讓我成為您的免費顧問 ,解決以下問題: DOMO vs. Tableau — What should I use?DOMO vs.…

fiddler抓包1-抓小程序https包

抓小程序包和抓app包是一樣的操作方法;安卓用fiddler,ios用charles; 一、環境準備 1.電腦已裝最新版fiddler 2.手機和電腦在同一局域網 二、fiddler設置 1.fiddler>Tools>Options>HTTPS 勾選Capture HTTPS CONNECTs 及下邊的子項&am…

多態使用的前提

1:必須是繼承(extends),實現(implements) 才行2:必須要重寫(覆蓋)父類的方法。轉載于:https://www.cnblogs.com/liyunchuan/p/10663788.html

Linux下的 FTP

1.安裝vsftpd yum install vsftpd 2.啟動/重啟/關閉vsftpd服務器 [rootlocalhost ftp]# /sbin/service vsftpd restart Shutting down vsftpd: [ OK ] Starting vsftpd for vsftpd: [ OK ] OK表示重啟成功了. 啟動和關閉分別把restart改為start/stop即可. 如果是源碼安裝的,到…

python入門23 pymssql模塊(python連接sql server增刪改數據 )

增刪改數據必須connect.commit()才會生效 回滾函數 connect.rollback() 連接數據庫 dinghanhua sql server增刪改 import pymssqlserver 192.168.1.1 user user password 111111 database testdbconnect pymssql.connect(server server,user user,passwordpassword,da…

每個人都應該使用的Python 3中被忽略的3個功能

重點 (Top highlight)Python 3 has been around for a while now, and most developers — especially those picking up programming for the first time — are already using it. But while plenty of new features came out with Python 3, it seems like a lot of them ar…

iframe自適應高度

為什么需要使用iframe自適應高度呢?其實就是為了美觀,要不然iframe和窗口長短大小不一,看起來總是不那么舒服,特別是對于我們這些編程的來說,如鯁在喉的感覺。 首先設置樣式 body{margin:0; padding:0;} 如果不設置bod…

.Net轉Java自學之路—SpringMVC框架篇八(RESTful支持)

RESTful架構,REST即Representational State Transfer。表現層狀態轉換,就是目前最流行的一種互聯網軟件架構。它結構清晰、符合標準、易于理解、擴展方便,所以得到越來越多網站的采用。 RESTful其實就是一個開發理念,是對http的很…

沖刺第七天

今天任務進行情況:今天我們將我們的游戲導到界面形成可用的應用程序,并且進行調試與運行,讓同學試玩,發現了困難并加以改正。 遇到的困難及解決方法: 運行時發現游戲界面中UI的button和image的位置會隨分辨率的不同而發…

數據探查_數據科學家,開始使用探查器

數據探查Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.數據科學家經常需要編寫許多復雜,緩慢&…

Node.js Streams:你需要知道的一切

Node.js Streams:你需要知道的一切 圖像來源 Node.js流以難以使用而聞名,甚至更難理解。好吧,我有個好消息 - 不再是這樣了。 多年來,開發人員在那里創建了許多軟件包,其唯一目的是簡化流程。但在本文中,我…

oracle表分區

1.表空間:是一個或多個數據文件的集合,主要存放的是表,所有的數據對象都存放在指定的表空間中;一個數據文件只能屬于一個表空間,一個數據庫空間由若干個表空間組成,其中包括:a.系統表空間:10g以前,默認系統表空間是System,10g包括10g以后,默認系統表空間是User,存放數據字典和視…

oracle異機恢復 open resetlogs 報:ORA-00392

參考文檔:ALTER DATABASE OPEN RESETLOGS fails with ORA-00392 (Doc ID 1352133.1) 打開一個克隆數據庫報以下錯誤: SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-00392: log 1 of thread 1 is being…

從ncbi下載數據_如何從NCBI下載所有細菌組件

從ncbi下載數據One of the most important steps in genome analysis is gathering the data required for downstream research. This sometimes requires us to have the assembled reference genomes (mostly bacterial) so we can verify the classifiers trained or bins …

shell之引號嵌套引號大全

萬惡的引號 這個能看懂你就出師了! 轉載于:https://www.cnblogs.com/theodoric008/p/10000480.html

oracle表分區詳解

oracle表分區詳解 從以下幾個方面來整理關于分區表的概念及操作: 表空間及分區表的概念表分區的具體作用表分區的優缺點表分區的幾種類型及操作方法對表分區的維護性操作 1.表空間及分區表的概念 表空間: 是一個或多個數據文件的集合,所有的數據對象都存…