sklearn.fit

by Nathan Toubiana

內森·圖比亞納(Nathan Toubiana)

兩個小時后仍在運行嗎？如何控制您的sklearn.fit (Two hours later and still running? How to keep your sklearn.fit under control)

Written by Gabriel Lerner and Nathan Toubiana

加布里埃爾·勒納 ( Gabriel Lerner)和內森·圖比亞 ( Nathan Toubiana)撰寫

All you wanted to do was test your code, yet two hours later your Scikit-learn fit shows no sign of ever finishing. Scitime is a package that predicts the runtime of machine learning algorithms so that you will not be caught off guard by an endless fit.

您要做的只是測試您的代碼，但是兩個小時后，您的Scikit學習適應度卻沒有任何提高的跡象。 Scitime是一個可預測機器學習算法運行時間的軟件包，因此您不會因無休止的選擇而措手不及。

Whether you are in the process of building a machine learning model or deploying your code to production, knowledge of how long your algorithm will take to fit is key to streamlining your workflow. With Scitime you will be able in a matter of seconds to estimate how long the fit should take for the most commonly used Scikit Learn algorithms.

無論您是在構建機器學習模型還是在將代碼部署到生產中，了解算法所需的適應時間是簡化工作流程的關鍵。使用Scitime，您將能夠在幾秒鐘內估算出最常用的Scikit Learn算法的擬合時間。

There have been a couple of research articles (such as this one) published on that subject. However, as far as we know, there’s no practical implementation of it. The goal here is not to predict the exact runtime of the algorithm but more to give a rough approximation.

關于該主題已經發表了幾篇研究文章(例如本文章)。但是，據我們所知，還沒有實際的實現。這里的目標不是預測算法的確切運行時間，而是給出一個大概的近似值。

什么是Scitime？ (What is Scitime?)

Scitime is a python package requiring at least python 3.6 with pandas, scikit-learn, psutil and joblib dependencies. You will find the Scitime repo here.

Scitime是一個Python軟件包，至少需要python 3.6，并帶有pandas ， scikit-learn ， psutil和joblib依賴項。您可以在此處找到Scitime回購。

The main function in this package is called “time”. Given a matrix vector X, the estimated vector Y along with the Scikit Learn model of your choice, time will output both the estimated time and its confidence interval. The package currently supports the following Scikit Learn algorithms with plans to add more in the near future:

該軟件包的主要功能稱為“ 時間 ”。給定矩陣向量X，估計向量Y以及您選擇的Scikit Learn模型，時間將同時輸出估計時間及其置信區間。該軟件包當前支持以下Scikit Learn算法，并計劃在不久的將來添加更多算法：

KMeans
均值
RandomForestRegressor
RandomForestRegressor
SVC
SVC
RandomForestClassifier
隨機森林分類器

快速開始 (Quick Start)

Let’s install the package and run the basics.

讓我們安裝軟件包并運行基礎知識。

First create a new virtualenv (this is optional, to avoid any version conflicts!)

首先創建一個新的virtualenv(這是可選的，以避免任何版本沖突！)

? virtualenv env? source env/bin/activate

and then run:

然后運行：

? (env) pip install scitime

or with conda:

或使用conda：

? (env) conda install -c conda-forge scitime

Once the installation has succeeded, you are ready to estimate the time of your first algorithm.

一旦安裝成功，您就可以估計第一個算法的時間。

Let’s say you wanted to train a kmeans clustering, for example. You would first need to import the scikit-learn package, set the kmeans parameters, and also choose the inputs (a.k.a X), here generated randomly for simplicity.

舉例來說，假設您想訓練kmeans聚類。您首先需要導入scikit-learn程序包，設置kmeans參數，還需要選擇輸入(aka X) ，為簡單起見，此處隨機生成。

Running this before doing the actual fit would give an approximation of the runtime:

在進行實際擬合之前運行此命令將給出運行時間的近似值：

As you can see, you can get this info only in one extra line of code! The inputs of the time function are exactly what’s needed to run the fit (that is the algo itself, and X), which makes it even easier to use.

如您所見，您只能在一行額外的代碼中獲得此信息！時間函數的輸入正是進行擬合所需要的(即算法本身和X)，這使其更易于使用。

Looking more closely at the last line of the above code, the first output (estimation: 15 seconds in this case) is the predicted runtime you’re looking for. Scitime will also output it with a confidence interval (lower_bound and upper_bound: 10 and 30 seconds in this case). You can always compare it to the actual training time by running:

仔細查看以上代碼的最后一行，第一個輸出(在這種情況下， 估計為 15秒)就是您要查找的預測運行時間。 Scitime還將以置信區間( lower_bound和upper_bound：在這種情況下為10和30秒)輸出它。您始終可以通過運行以下命令將其與實際訓練時間進行比較：

In this case, on our local machine, the estimation is 15 seconds, whereas the actual training time is 20 seconds (but you might not get the same results, as we’ll explain later).

在這種情況下，在我們的本地計算機上，估計時間為15秒，而實際訓練時間為20秒(但您可能不會獲得相同的結果，我們將在后面解釋)。

As a quick usage guide:

作為快速使用指南：

Estimator(meta_algo, verbose, confidence) class:

Estimator(meta_algo，verbose，置信度)類：

meta_algo: The estimator used to predict the time, either ‘RF’ or ‘NN’ (see details in next paragraph) — defaults to‘RF’
meta_algo ：用于預測時間的估計量，“ RF”或“ NN”(請參見下一段的詳細信息)-默認為“ RF”
verbose: Control of the amount of log output (either 0, 1, 2 or 3) — defaults to 0
詳細：控制日志輸出量(0、1、2或3)—默認為0
confidence: Confidence for intervals — defaults to 95%
置信度 ：間隔的置信度 -默認為95％

estimator.time(algo, X, y) function:

estimator.time(algo，X，y)函數：

algo: algo whose runtime the user wants to predict
algo ：用戶想要預測其運行時間的算法
X: numpy array of inputs to be trained
X ：要訓練的輸入的numpy數組
y: numpy array of outputs to be trained (set to None if the algo is unsupervised)
y ：要訓練的輸出的numpy數組(如果算法不受監督，則設置為None )

Quick note: to avoid any confusion, it’s worth highlighting that algo and meta_algo are two different things here: algo is the algorithm whose runtime we want to estimate, meta_algo is the algorithm used by Scitime to predict the runtime.

快速說明：為避免混淆，值得在這里強調algo和meta_algo是兩個不同的東西： algo是我們要估計其運行時間的算法， meta_algo是Scitime用于預測運行時間的算法。

Scitime的工作方式 (How Scitime works)

We are able to predict the runtime to fit by using our own estimator, we call it meta algorithm (meta_algo), whose weights are stored in a dedicated pickle file in the package metadata. For each Scikit Learn model, you will find a corresponding meta algo pickle file in Scitime’s code base.

我們可以使用自己的估算器來預測運行時的合適性，我們將其稱為元算法( meta_algo )，其權重存儲在包元數據中的專用pickle文件中。對于每個Scikit Learn模型，您都可以在Scitime的代碼庫中找到一個相應的meta algo pickle文件。

You might be thinking:

您可能在想：

Why not manually estimate the time complexity with big O notations?

為什么不使用大的O符號手動估算時間復雜度？

That’s a fair point. It’s a valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo and set of parameters which is rather challenging in some cases, given the number of factors playing a role in the runtime. The meta_algo basically does all the work for you, and we’ll explain how.

這是一個公平的觀點。這是解決問題的有效方法，也是我們在項目開始時就已經想到的事情。然而，有一件事是，我們需要為每種算法和參數集明確地制定復雜度，鑒于在運行時中起作用的因素數量，這在某些情況下是相當具有挑戰性的。 meta_algo基本上可以為您完成所有工作，我們將說明操作方法。

Two types of meta algos have been trained to estimate the time to fit (both from Scikit Learn):

已經訓練了兩種類型的元算法來估計適合的時間(均來自Scikit Learn)：

The RF meta algo, a RandomForestRegressor estimator.
RF元算法， RandomForestRegressor估計器。
The NN meta algo, a basic MLPRegressor estimator.
NN元算法，基本的MLPRegressor估算器。

These meta algos estimate the time to fit using an array of ‘meta’ features. Here’s a summary of how we build these features:

這些元算法使用一系列“元”功能來估計適合的時間。以下是我們如何構建這些功能的摘要：

Firstly, we fetch the shape of your input matrix X and output vector y. Second, the parameters you feed to the Scikit Learn model are taken into consideration as they will impact the training time as well. Lastly, your specific hardware, unique to your machine such as available memory and cpu counts are also considered.

首先，我們獲取您的輸入矩陣X和輸出向量y的形狀。其次，將考慮輸入到Scikit Learn模型的參數，因為它們也會影響培訓時間。最后，還考慮了計算機專用的特定硬件，例如可用內存和cpu計數。

As shown earlier, we also provide confidence intervals on the time prediction. The way these are computed depends on the meta algo chosen:

如前所述，我們還提供了時間預測的置信區間。計算這些方法的方式取決于所選的元算法：

For RF, since any random forest regressor is a combination of multiple trees (also called estimators), the confidence interval will be based on the distribution of the set of predictions computed by each estimator.
對于RF來說 ，由于任何隨機森林回歸量都是多個樹的組合(也稱為估計量 )，因此置信區間將基于每個估計量計算出的一組預測的分布。
For NN, the process is a little less straightforward: we first compute a set of MSEs along with the number of observations on a test set, grouped by predicted duration bins (that is from 0 to 1 second, 1 to 5 seconds, and so on), and we then compute a t-stat to get the lower and upper bounds of the estimation. As we don’t have a lot of data for very long models, the confidence interval for such data might get very broad.
對于NN來說 ，此過程要簡單一些：我們首先計算一組MSE以及測試集上的觀察數，然后按預測的持續時間段(即0到1秒，1到5秒，以及等等)，然后我們計算t-stat以獲得估計的上下限。由于我們沒有很長的模型的大量數據，因此此類數據的置信區間可能會變得很寬。

我們如何建造 (How we built it)

You might be thinking:

您可能在想：

How did you get enough data on the training time of all these sciki- learn fits over various parameters and hardware configurations?

您如何從各種參數和硬件配置的所有適合的訓練時間中獲得足夠的數據？

The (unglamorous) answer is we generated the data ourselves using a combination of computers and VM hardwares to simulate what the training time would be on the different systems. We then fitted our meta algos on these randomly generated data points to build an estimator meant to be reliable regardless of your system.

(無恥的)答案是我們使用計算機和VM硬件的組合自己生成數據，以模擬在不同系統上的培訓時間。然后，我們在這些隨機生成的數據點上擬合元算法，以構建一個估計器，無論您使用什么系統，該估計器都將可靠。

While the estimate.py file handles the runtime prediction, the _model.py file helped us generate data to train our meta algos, using our dedicated Model class. Here’s a corresponding code sample, for kmeans:

雖然estimate.py文件句柄運行預測中， _model.py文件幫助我們生成的數據來訓練我們的匯總交易算法，使用我們的專用Model類。這是kmeans的相應代碼示例：

Note that you can also use the file _data.py directly with the command line to generate data or train a new model. Related instructions can be found in the repo Readme file.

請注意，您還可以直接在命令行中使用文件_data.py來生成數據或訓練新模型。相關說明可以在回購自述文件中找到。

When generating data points, you can edit the parameters of the Scikit Learn models you want to train on. You can head to scitime/_config.json and edit the parameters of the models as well as the number of rows and columns you would want to train with.

生成數據點時，您可以編輯要在其上訓練的Scikit Learn模型的參數。您可以轉到scitime / _config.json并編輯模型的參數以及要訓練的行數和列數。

We use an itertool function to loop through every possible combination, along with a drop rate set between 0 and 1 to control how quickly the loop will jump through the different possible iterations.

我們使用itertool函數循環遍歷所有可能的組合，并設置介于0和1之間的下降率，以控制循環在不同的可能迭代中跳躍的速度。

Scitime有多準確？ (How accurate is Scitime?)

Below, we highlight how our predictions perform for the specific case of kmeans. Our generated dataset contains ~100k data points, which we split into a train and test sets (75% — 25%).

下面，我們重點介紹針對kmeans特定情況的預測效果。我們生成的數據集包含約10萬個數據點，我們將其分為訓練和測試集(75％— 25％)。

We grouped training predicted times by different time buckets and computed the MAPE and RMSE over each of those buckets for all our estimators using the RF meta-algo and the NN meta-algo.

我們將訓練預測時間按不同的時間段進行分組，并使用RF元算法和NN元算法為所有估計量計算每個時間段的MAPE和RMSE 。

Please note that these results were performed on a restricted data set, so they might be different on unexplored data points (such as other systems / extreme values of certain model parameters). For this specific training set, the R-squared is around 80% for NN and 90% for RF.

請注意，這些結果是在受限的數據集上執行的，因此在未探索的數據點(例如其他系統/某些模型參數的極值)上，它們可能會有所不同。對于此特定訓練集，NN的R平方約為RF，RF的R平方約為90％。

As we can see, not surprisingly, the accuracy is consistently higher on the train set than on the test, for both NN and RF. We also see that RF seems to perform way better than NN overall. The MAPE for RF is around 20% on the train set and 40% on the test set. The NN MAPE is surprisingly very high.

正如我們所看到的，對于NN和RF而言，在火車上的準確性始終比在測試中更高，這并不奇怪。我們還看到，RF的性能似乎總體上優于NN。 RF的MAPE在列車組上約為20％，在測試組上約為40％。令人驚訝的是，NN MAPE非常高。

Let’s slice the MAPE (on test set) by the number of predicted seconds:

讓我們按預測的秒數對MAPE(在測試集上)進行切片：

One important thing to keep in mind is that for some cases the time prediction is sensitive to the meta algo chosen (RF or NN). In our experience RF has performed very well within the data set input ranges, as shown above. However, for out of range points, NN might perform better, as suggested by the end of the above chart. This would explain why NN MAPE is quite high while the RMSE is decent: it performs poorly on small values.

要記住的重要一件事是，在某些情況下，時間預測對所選的元算法(RF或NN)敏感。根據我們的經驗，RF在數據集輸入范圍內表現非常出色，如上所述。但是，對于超出范圍的點，NN的性能可能會更好，如上圖結尾所示。這可以解釋為什么NN MAPE很高，而RMSE卻不錯：在較小的值上表現不佳。

As an example, if you try to predict the runtime of a kmeans with default parameters and with an input matrix of a few thousand lines, the RF meta algo will be precise because our training dataset contains similar data points. However, for predicting very specific parameters (for instance, a very high number of clusters), NN might perform better because it extrapolates from the training set, whereas RF doesn’t. NN performs worse on the above charts because these plots are only based on data close to the set of inputs of the training data.

例如，如果您嘗試使用默認參數和幾千行的輸入矩陣來預測kmeans的運行時間，那么RF元算法將是精確的，因為我們的訓練數據集包含相似的數據點。但是，對于預測非常具體的參數(例如，非常多的群集)，NN可能會表現更好，因為它是從訓練集中推斷出來的，而RF卻沒有。 NN在上述圖表上的表現較差，因為這些圖僅基于接近訓練數據輸入集的數據。

However, as shown in this graph, the out of range values (thin lines) are extrapolated by the NN estimator, whereas the RF estimator predicts the output stepwise.

但是，如該圖所示，超出范圍的值(細線)由NN估計器外推，而RF估計器則逐步預測輸出。

Now let’s look at the most important ‘meta’ features for the example of kmeans:

現在，讓我們看一下kmeans示例的最重要的“元”功能：

As we can see, only 6 features account for more than 80% of the model variance. Among them, the most important is a parameter of the scikit-learn kmeans class itself (number of clusters), but a lot of external factors have great influence on the runtime such as number of rows/columns and available memory.

如我們所見，只有6個要素占模型差異的80％以上。其中，最重要的是scikit-learn kmeans類本身的參數(集群數)，但是許多外部因素對運行時有很大影響，例如行數/列數和可用內存。

局限性 (Limitations)

As mentioned earlier, the first limitation is related to the confidence intervals: they may be very wide, especially for NN, and for heavy models (that would take at least an hour).

如前所述，第一個限制與置信區間有關：它們可能非常寬，特別是對于NN和重型模型(至少需要一個小時)。

Additionally, the NN might perform poorly on small to medium predictions. Sometimes, for small durations, the NN might even predict a negative duration, in which case we automatically switch back to RF.

此外，在中小型預測上，NN可能效果不佳。有時，在很短的持續時間內，NN甚至可能會預測為負持續時間，在這種情況下，我們會自動切換回RF。

Another limitation of the estimator arise for when ‘special’ algo parameter values are used. For example, in a RandomForest scenario, when max_depth is set to None, the depth could take any value. This might result in a much longer time to fit which is more difficult for the meta algo to pick up, although we did our best to account for them.

當使用“特殊”算法參數值時，估計器會出現另一個限制。例如，在RandomForest場景中，當max_depth設置為None時 ，深度可以采用任何值。盡管我們盡了最大努力來解決這些問題，但這可能會導致更長的擬合時間，這對于元算法來說更加困難。

When running estimator.time(algo, X, y) we do require the user to enter the actual X and y vector which seems unnecessary, as we could simply request the shape of the data to estimate the training time. The reason for this is that we actually try to fit the model before predicting the runtime, in order to raise any instant errors. We run algo.fit(X, y) in a subprocess for one second to check for any fit error up after which we move on to the prediction part. However, there are times where the algo (and / or the input matrix) are so big that running algo.fit(X,y) will throw a memory error eventually, which we can’t account for.

當運行estimator.time(algo，X，y)時，我們確實要求用戶輸入實際不必要的X和y向量，因為我們可以簡單地請求數據的形狀來估計訓練時間。原因是我們實際上在預測運行時之前嘗試擬合模型，以引起任何即時錯誤。我們在子流程中運行algo.fit(X，y)一秒鐘，以檢查是否存在擬合誤差，然后繼續進行預測。但是，有時候算法(和/或輸入矩陣)太大，以至于運行algo.fit(X，y)最終會引發內存錯誤，這是我們無法解釋的。

未來的改進 (Future improvements)

The most effective and obvious way to improve the performance of our current predictions would be to generate more data points on different systems to better support a wide range of hardware/parameters.

改善當前預測性能的最有效，最明顯的方法是在不同系統上生成更多數據點，以更好地支持廣泛的硬件/參數。

We will be looking at adding more supported Scikit Learn algos in the near future. We could also implement other algos such as lightGBM or xgboost. Feel free to contact us if there’s an algorithm you would like us to implement in the next iterations of Scitime!

我們將在不久的將來考慮添加更多受支持的Scikit Learn算法。我們還可以實現其他算法，例如lightGBM或xgboost 。如果您希望我們在下一個Scitime迭代中實現某種算法，請隨時與我們聯系！

Other interesting avenues for improving the performance of the estimator would be to include more granular information about the input matrix such as variance, or correlation with output. We currently generate data completely randomly, for which the fit time might be higher than for real world datasets. So in some cases it might overestimate the training time.

用于提高估計器性能的其他有趣途徑是包括有關輸入矩陣的更詳細的信息，例如方差或與輸出的相關性。目前，我們完全隨機地生成數據，其擬合時間可能比實際數據集更長。因此，在某些情況下，它可能會高估訓練時間。

In addition we could track finer hardware specific information such as frequency of the cpu, or current cpu usage.

另外，我們可以跟蹤更精細的硬件特定信息，例如cpu的頻率或當前cpu的使用情況。

Ideally, as the algorithm might change from a scikit-learn version to another, and thus have an impact on the runtime, we would also account for it, for example by using the version as a ‘meta’ feature.

理想情況下，由于該算法可能會從scikit-learn版本更改為另一個版本，從而對運行時間產生影響，因此我們也將其考慮在內，例如，將該版本用作“元”功能。

As we acquire more data to fit our meta algos, we might think of using more complex meta algos, such as sophisticated neural networks (using regularization techniques like dropout or batch normalization). We could even consider using tensorflow to fit the meta algo (and add it as optional): it would not only help us get a better accuracy, but also build more robust confidence intervals using dropout.

隨著我們獲取更多數據以適合我們的元算法，我們可能會考慮使用更復雜的元算法，例如復雜的神經網絡(使用諸如輟學或批歸一化之類的正則化技術)。我們甚至可以考慮使用tensorflow以適應元算法中(并將其添加為可選)：它不僅將幫助我們獲得更好的精確度，而且還利用建立更強大的置信區間輟學。

協助Scitime并向我們發送您的反饋意見 (Contributing to Scitime and sending us your feedback)

First, any kind of feedback, especially on the performance of the predictions and on ideas to improve this process of generating data, is very much appreciated!

首先，非常感謝任何形式的反饋，特別是有關預測的執行情況和有關改進此生成數據過程的想法的反饋！

As discussed before, you can use our repo to generate your own data points in order to train your own meta algorithm. When doing so, you can help make Scitime better by sharing your data points found in the result csv (~/scitime/scitime/[algo]_results.csv) so that we can integrate it to our model.

如前所述，您可以使用我們的存儲庫來生成自己的數據點，以訓練自己的元算法。這樣做時，您可以通過共享結果csv中的數據點( ?/ scitime / scitime / [algo] _results.csv )來幫助改進Scitime，以便我們可以將其集成到模型中。

To generate your own data you can run a command similar to this one (from the package repo source):

要生成自己的數據，可以運行與此命令類似的命令(來自軟件包回購源)：

? python _data.py --verbose 3 --algo KMeans --drop_rate 0.99

Note: if run directly using the code source (with the Model class), do not forget to set write_csv to true, otherwise the generated data points will not be saved.

注意：如果直接使用代碼源(帶有Model類)運行，請不要忘記將write_csv設置為true，否則將不會保存生成的數據點。

We use GitHub issues to track all bugs and feature requests. Feel free to open an issue if you have found a bug or wish to see a new feature implemented. More info can be found about how to contribute in the Scitime repo.

我們使用GitHub問題來跟蹤所有錯誤和功能請求。 如果您發現錯誤或希望看到新功能實施，請隨時打開問題。 可以找到有關如何在Scitime回購中做出貢獻的更多信息。

For issues with training time predictions, when submitting feedback, including the full dictionary of parameters you are fitting into your model might help, so that we can diagnose why the performance is subpar for your specific use case. To do so simply set the verbose parameter to 3 and copy paste the log of the parameter dic in the issue description.

對于培訓時間預測方面的問題，在提交反饋時(包括您適合模型的完整參數字典)可能會有所幫助，以便我們可以診斷為什么性能不如您的特定用例。 為此，只需將verbose參數設置為3，然后將問題dic的日志復制粘貼到問題描述中。

Find the code source

查找代碼源

Find the documentation

查找文檔

學分 (Credits)

Gabriel Lerner & Nathan Toubiana are the main contributors of this package and co-authors of this article
加布里埃爾·勒納 ( Gabriel Lerner) 和內森·圖比亞 ( Nathan Toubiana)是該軟件包的主要撰稿人和本文的合著者
Special thanks to Philippe Mizrahi for helping along the way
特別感謝Philippe Mizrahi一直以來的幫助
Thanks for all the help we got from early reviews / beta testing
感謝您從早期評論/ Beta測試中獲得的所有幫助