不知道輸入何時停止_知道何時停止

不知道輸入何時停止

In predictive analytics, it can be a tricky thing to know when to stop.

在預測分析中,知道何時停止可能是一件棘手的事情。

Unlike many of life’s activities, there’s no definitive finishing line, after which you can say “tick, I’m done”. The possibility always remains that a little more work can yield an improvement to your model. With so many variables to tweak, it’s easy to end up obsessing over tenths of a percentage point, pouring huge amounts of effort into the details before looking up and wondering “Where did the time go?”.

與生活中的許多活動不同,沒有明確的終點線,之后您可以說“打勾,我完成了”。 總是有可能更多的工作可以改進您的模型。 由于要調整的變量太多,因此很容易最終迷上一個十分之一的百分點,在查尋細節之前投入了大量的精力,并想知道“時間花在哪里?”。

Iterating your model, via feature engineering, model selection and hyper-parameter tuning is a key skill of any data scientist. But knowing when to stop is something that rarely gets addressed, and can vastly alter the cost of model development and the ROI of a Data Science project.

通過特征工程,模型選擇和超參數調整來迭代模型是任何數據科學家的一項關鍵技能。 但是,知道何時停止是很少解決的事情,并且可以極大地改變模型開發的成本和數據科學項目的投資回報率。

I’m not talking about over vs under fitting here. Over-fitting is where your model is too closely fit to your training data and can be detected by comparing the training set error with a validation set error. There are many great tutorials on Medium and elsewhere which explain all this in much more detail.

我在這里談論的不是過度與不足。 過度擬合是模型過于適合訓練數據的地方,可以通過將訓練集誤差與驗證集誤差進行比較來檢測出。 在Medium和其他地方,有很多 很棒的 教程 ,它們對這一切進行了更詳細的解釋。

I’m referring to the time you spend working on the entire modelling pipeline, and how you quantify the rewards and justify the cost.

我指的是您花費在整個建模流程上的時間,以及如何量化收益并證明成本合理。

策略 (Strategies)

Some strategies that can help you decide when to wrap things up might be:

可以幫助您決定何時打包的一些策略可能是:

  • Set a deadline — Parkinson’s law states that “work expands so as to fill the time available for its completion”. Having an open ended time-frame invites you to procrastinate by spending time on things that ultimately don’t provide much value to the end result. Setting yourself a deadline is a good way of keeping costs low and predictable by forcing you to prioritise effectively. The down-side is of course that if you set your deadline too aggressively, you may deliver a model that is of poor quality.

    設定最后期限-帕金森法則指出:“工作在擴大,以填補完成工作所需的時間”。 有一個開放式的時間框架會邀請您拖延時間,最終花費一些時間最終無法為最終結果提供太多價值。 為自己設定一個截止日期是一種有效的方法,它可以迫使您有效地確定優先級,從而將成本保持在較低水平且可預測。 不利的一面當然是,如果您過分地設置截止日期,則可能會提供質量較差的模型。

  • Acceptable error rate — You could decide beforehand on an acceptable error rate and stop once you reach it. For example, a self-driving car might try to identify cyclists with a 99.99% level of accuracy. The difficulty of this approach is that before you start experimenting, it’s very hard to set expectations as to how accurate your model could be. Your desired accuracy rate might be impossible, given the level of irreducible error. On the other hand, you might stop prematurely whilst there is still room to easily improve your model.

    可接受的錯誤率-您可以預先確定可接受的錯誤率,并在達到錯誤率時停止。 例如,自動駕駛汽車可能會嘗試以99.99%的準確度識別騎自行車的人。 這種方法的困難在于,在您開始實驗之前,很難就模型的精確度設定期望。 鑒于無法減少的誤差水平,可能無法達到理想的準確率。 另一方面,您可能仍會過早停止,而仍有足夠的空間輕松改善模型。

  • Value gradient method — By plotting the real-world cost of error in your model, vs the effort required to enhance it, you gain an understanding of what the return on investment is for each incremental improvement. This allows you to keep developing your model, only stopping when the predicted value of additional tuning fall below the value of your time.

    值梯度法-通過繪制模型中的實際錯誤成本與增強誤差所需的工作量,您可以了解每次增量改進的投資回報率。 這使您可以繼續開發模型,僅在其他調整的預測值低于您的時間值時才停止。

Image for post
Photo by Nicolas Hoizey on Unsplash
Nicolas Hoizey在Unsplash上拍攝的照片

收益遞減法則 (The law of diminishing returns)

As you invest time into tweaking your model, you may find that your progress is fast in the beginning, but quickly plateaus. You’ll likely perform the most obvious improvements first, but as time goes by you’ll end up working harder and harder for smaller gains. Within the data itself, the balance between reducible and irreducible error puts an upper limit on the level of accuracy that your model can achieve.

當您花時間調整模型時,您可能會發現開始時進展很快,但很快就達到了平穩狀態。 您可能會首先執行最明顯的改進,但是隨著時間的流逝,您將越來越努力地爭取較小的收益。 在數據本身內,可減少的誤差與不可減少的誤差之間的平衡為模型可達到的精度水平設置了上限。

In a learning exercise or a Kaggle competition, you can iterate to your heart’s content, chasing those incremental improvements further and further down the decimal places.

在學習練習或Kaggle比賽中,您可以迭代自己的內心內容,將這些遞增的改進逐個追逐。

However, for a commercial project, the cost of tuning this model climbs linearly with respect to the amount of time you have invested. This means there comes a point where scraping out an extra 0.1% will not be worth the investment.

但是,對于商業項目,調整此模型的成本相對于您投入的時間呈線性增長。 這意味著有些時候刮掉額外的0.1%將不值得投資。

This varies from project to project. If you’re working with supermarket data, given the huge number of purchases on a daily basis, an additional hundredth of a percentage point of accuracy might be worth a lot of money. This puts a strong ROI on continuing efforts to improve your model. But for projects of more modest scale, you might have to draw the line a bit sooner.

這因項目而異。 如果您使用超市數據,由于每天都有大量購買,那么,精確度提高百分之一百分之百,可能是很多錢。 這為持續改進模型提供了可觀的投資回報率。 但是,對于規模較小的項目,您可能需要盡快畫線。

Image for post
Photo by Jp Valery on Unsplash
Jp Valery在Unsplash上拍攝的照片

模型錯誤的實際成本 (The real-world cost of model error)

When tuning a model, the values you’re likely to be paying attention to are statistical in nature. MSE, % accuracy, R2 and AIC are defined by their mathematical formulae, and are indifferent to the real-world problem you’re attempting to solve.

調整模型時,您可能需要注意的值實際上是統計值。 MSE,%精度,R2和AIC由它們的數學公式定義,并且與您要解決的實際問題無關。

Rather than solely considering statistical measures of accuracy and error, these should be converted into something that can be weighed against the time investment you’re making, i.e. money.

不應只考慮準確性和錯誤的統計指標,而應將這些指標轉換為可以與您所花費的時間(即金錢)相權衡的指標。

Let’s say we run an ice-cream kiosk, and we’re trying to predict how many ice-creams we’ll sell on a daily basis, using variables like the weather, day of week, time of year etc.

假設我們經營一個冰淇淋亭,并嘗試使用天氣,星期幾,一年中的時間等變量來預測每天售出多少個冰淇淋。

No model we create will be perfect, and for any given day it will usually either;

我們創建的任何模型都不完美,并且在任何給定的一天通常都不會完美。

  • overestimate — meaning we buy more ingredients than we need for the number of ice-creams sold.

    高估了-這意味著我們購買的食材比冰淇淋數量要多。
  • underestimate — meaning we run out of stock and lose out on potential business.

    低估了-這意味著我們缺貨而失去了潛在業務。

Both of these types of error introduce a monetary cost to the business. If we run out of stock at midday, we’ve lost the margin on half a day’s sales. And if we overestimate, we may end up spending money on ingredients that end up being thrown away.

這兩種類型的錯誤都會給企業帶來金錢上的損失。 如果我們在中午缺貨,那么我們半天的銷售利潤就會損失。 而且,如果我們高估了價值,我們最終可能會花錢購買最終被扔掉的食材。

We can introduce business rules on top of our model to help reduce some of this loss. The cost of losing 1 ice-cream’s worth of sales is likely higher than the cost of throwing away 1 ice-cream’s worth of out-of-date milk (given we’re hopefully making a profit). Therefore, we’ll want to be biased in favour of over-stocking, for example by holding 20% more ingredients than suggested by the model’s prediction. This will greatly reduce the frequency and cost of stock outages, at the expense of having to throw out a few bottles of out-of-date milk.

我們可以在模型之上引入業務規則,以幫助減少部分損失。 損失1杯冰淇淋的銷售成本可能會比丟掉1杯冰淇淋的過期牛奶的成本高(假設我們希望獲利)。 因此,我們希望偏向于庫存過多,例如,持有比模型預測所建議的多20%的成分。 這將大大減少斷貨的頻率和成本,但以不得不扔掉幾瓶過期牛奶為代價。

Optimising this 20% rule, falls under the umbrella of Prescriptive Analytics. Using the training set, we can tweak this rule, until the average estimated real-world cost of the error in the model is at its lowest.

優化此20%規則屬于Prescriptive Analytics的保護范圍 。 使用訓練集,我們可以調整此規則,直到模型中錯誤的平均估計實際成本達到最低。

Image for post
Photo by Anthony Da Cruz on Unsplash
安東尼·達·克魯茲 ( Anthony Da Cruz)在《 Unsplash》上的照片

值梯度法 (The value gradient method)

Now that we have an estimated real-world cost for the accuracy of the model, we gain an idea of what the time we’re investing in the model is worth. With each iteration, we subtract the real-world cost from that of the previous version, to work out the value added by our extra effort. From there, we can extrapolate to a window of ROI.

現在,我們已經為模型的準確性估算了實際成本,現在我們了解了在模型上投資的時間是值得的。 在每次迭代中,我們都從上一版本中減去實際成本,以計算出我們付出的額外努力所帶來的價值。 從那里,我們可以推斷出一個投資回報率窗口。

For example, your validation set may contain 1,000 rows and your latest model saved $40 vs the previous iteration. If you are expecting to collect 100,000 data-points per year, then you can multiply the added value by 100 to get an annual rate. Therefore, the work you put in to produce the latest version of the model gives a return of $4,000 per year.

例如,您的驗證集可能包含1,000行,而您的最新模型與先前的迭代相比節省了40美元。 如果您希望每年收集100,000個數據點,則可以將增加值乘以100以獲得年費率。 因此,您投入到生產最新版本模型的工作中,每年可得到4,000美元的回報。

Comparing this to the cost of our time gives us an expected return on investment. E.g. if the above enhancement required a day’s work for someone earning $400 per day, it pays for itself very quickly.

將其與我們的時間成本進行比較,可以為我們帶來預期的投資回報。 例如,如果上述改進要求每天賺取$ 400的某人一天的工作,它會很快收回成本。

However, as the law of diminishing returns eats away at our rate of improvement, our margin will begin to fall. When it approaches zero, it’s time to take what we have and move on to the next stage in our project.

但是,隨著收益遞減法則的吞噬,我們的利潤率將開始下降。 當它接近零時,是時候利用我們所擁有的,并進入項目的下一個階段。

Of course, this is an inexact science. It assumes that improvements to our model will occur in a smooth and predictable way and that future gains will smaller than previous improvements. Whenever you call it a day, there will be the possibility that a significant breakthrough lies just around the corner.

當然,這是一門不精確的科學。 它假設對我們模型的改進將以一種平滑且可預測的方式進行,并且未來的收益將小于以前的改進。 每當您將其命名為“一天”時,都有可能出現重大突破。

But it’s always a good idea to keep a commercial eye on the time you’re investing in a model, allowing you to do more valuable work by keeping costs down and freeing up time to spend on the most important things.

但是,始終保持商業眼光投資模型的時間始終是一個好主意,通過降低成本并騰出時間花在最重要的事情上,從而使您能夠做更多有價值的工作。

Coming soon: A Python module which takes the predictions, actual values and a cost-function and outputs the expected ROI for any model — allowing you to integrate the above decision making into your model tuning process.

即將推出:一個Python模塊,它將獲取預測,實際值和成本函數,并輸出任何模型的預期ROI-使您可以將上述決策整合到模型調整過程中。

翻譯自: https://towardsdatascience.com/knowing-when-to-stop-b73ceeec7d9f

不知道輸入何時停止

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391849.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391849.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391849.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

移動認證_如何在移動設備上實施安全的生物特征認證

移動認證by Kathy Dinh凱西丁(Kathy Dinh) 如何在移動設備上實施安全的生物特征認證 (How to implement secure Biometric Authentication on mobile devices) A quick search for React Native biometric authentication would give you several tutorials. That was the fir…

[Luogu1890]gcd區間

原題鏈接https://www.luogu.org/problem/show?pid1890 暴力中的暴力。 對于每一組詢問l..r,我們先循環暴力枚舉l..r中最大值到1,再暴力循環l..r的每一個數,判斷前一重循環能否整除后一重,如果全部都能,則可判定它就是…

Android Studio自定義模板 做開發竟然可以如此輕松 后篇

###1.概述 最近有很多人反饋,有些哥們不喜歡看文字性的東西,還有一些哥們根本就不知道我在搞啥子,那么以后我就采用博客加視頻的方式,我們可以選擇看視頻講解:http://pan.baidu.com/s/1i5uh2uD   內涵段子項目資料及…

ExpressionChangedAfterItHasBeenCheckedError: Expression has changed after it was checked.

ExpressionChangedAfterItHasBeenCheckedError: Expression has changed after it was checked. 解決方案: 異步更新(建議使用)強制進行變更檢測,但是會觸發子組件的變更檢測,再次導致父組件屬性改變Parent.Component.…

leetcode 119. 楊輝三角 II

給定一個非負索引 k,其中 k ≤ 33,返回楊輝三角的第 k 行。 在楊輝三角中,每個數是它左上方和右上方的數的和。 示例: 輸入: 3 輸出: [1,3,3,1] 解題思路 因為楊輝三角的下層由上一層決定,所以只需要保存上一層的元素&#x…

掌握大數據數據分析師嗎?_要掌握您的數據嗎? 這就是為什么您應該關心元數據的原因...

掌握大數據數據分析師嗎?Either you are a data scientist, a data engineer, or someone enthusiastic about data, understanding your data is one thing you don’t want to overlook. We usually regard data as numbers, texts, or images, but data is more than that.…

react 使用 mobx_如何使用React和MobX狀態樹構建基于狀態的路由器

react 使用 mobxby Miles Till由Miles Till 如何使用React和MobX狀態樹構建基于狀態的路由器 (How to build a state-based router using React and MobX State Tree) Introducing mobx-state-tree-routerMobx狀態樹路由器簡介 If you want to skip ahead to the finished ex…

docker在Centos上的安裝

Centos6安裝docker 系統:centos6.5 內核:3.10.107-1(已升級),docker對RHEL/Centos的最低內核支持是2.6.32-431,epel源的docker版本推薦內核為3.10版本。 內核升級可參考:https://www.jslink.org/linux/centos-kernel-u…

Lambda表達式的前世今生

Lambda 表達式 早在 C# 1.0 時,C#中就引入了委托(delegate)類型的概念。通過使用這個類型,我們可以將函數作為參數進行傳遞。在某種意義上,委托可理解為一種托管的強類型的函數指針。 通常情況下,使用委托來…

matplotlib柱狀圖、面積圖、直方圖、散點圖、極坐標圖、箱型圖

一、柱狀圖 1.通過obj.plot() 柱狀圖用bar表示,可通過obj.plot(kindbar)或者obj.plot.bar()生成;在柱狀圖中添加參數stackedTrue,會形成堆疊圖。 fig,axes plt.subplots(2,2,figsize(10,6)) s pd.Series(np.random.randint(0,10,15),index …

微信支付商業版 結算周期_了解商業周期

微信支付商業版 結算周期Economics is an inexact science, finance and investing even more so (some would call them art). But if there’s one thing in economics that you can consistently count on over the long run, it’s the tendency of things to mean revert …

leetcode 448. 找到所有數組中消失的數字

給定一個范圍在 1 ≤ a[i] ≤ n ( n 數組大小 ) 的 整型數組,數組中的元素一些出現了兩次,另一些只出現一次。 找到所有在 [1, n] 范圍之間沒有出現在數組中的數字。 您能在不使用額外空間且時間復雜度為O(n)的情況下完成這個任務嗎? 你可以假定返回…

前端初學者開發學習視頻_初學者學習前端開發的實用指南

前端初學者開發學習視頻by Nikita Rudenko通過尼基塔魯登科(Nikita Rudenko) 初學者學習前端開發的實用指南 (A practical guide to learning front end development for beginners) I started my coding journey in spring 2018, a bit less than one year ago. I earned som…

weblogic啟動失敗案例(root啟動引起的權限問題)

weblogic的一個domain啟動失敗,在日志中有如下信息提示: **************************************************** To start WebLogic Server, use a username and ** password assigned to an admin-level user. For ** server administration, us…

HTTP請求示例

HTTP請求格式當瀏覽器向Web服務器發出請求時,它向服務器傳遞了一個數據塊,也就是請求信息,HTTP請求信息由3部分組成:l 請求方法URI協議/版本l 請求頭(Request Header)l 請求正文下面是一個HTTP請求的例子:GET/sa…

Bootstrap——可拖動模態框(Model)

還是上一個小項目,o(╥﹏╥)o,要實現點擊一個div或者button或者一個東西然后可以彈出一個浮在最上面的彈框。網上找了找,發現Bootstrap的Model彈出框可以實現該功能,因此學習了一下,實現了基本彈框功能(可拖…

mfcc中的fft操作_簡化音頻數據:FFT,STFT和MFCC

mfcc中的fft操作What we should know about sound. Sound is produced when there’s an object that vibrates and those vibrations determine the oscillation of air molecules which creates an alternation of air pressure and this high pressure alternated with low …

leetcode 765. 情侶牽手(并查集)

N 對情侶坐在連續排列的 2N 個座位上,想要牽到對方的手。 計算最少交換座位的次數,以便每對情侶可以并肩坐在一起。 一次交換可選擇任意兩人,讓他們站起來交換座位。 人和座位用 0 到 2N-1 的整數表示,情侶們按順序編號&#xff…

ariel字體_播客第58集:軟件開發人員和freeCodeCamp超級巨星Ariel Leslie

ariel字體On this weeks episode of the freeCodeCamp.org podcast, Abbey interviews Ariel Leslie, a software developer and avid contributor to the freeCodeCamp community.在本周的freeCodeCamp.org播客節目中,Abbey采訪了Ariel Leslie,他是free…

PHP繪制3D圖形

PEAR提供了Image_3D Package來創建3D圖像。圖像或光線在3D空間中按照X、Y 、Z 坐標定位。生成的圖像將呈現在2D空間中,可以存儲為 PNG、SVG 格式,或輸出到Shell。通過Image_3D可以很方便生成一些簡單的3D對象,例如立方體、錐體、球體、文本和…