糖藥病數據集分類

背景 (Background)

Data science should be an enjoyable process focused on delivering insights and real benefits. However, that enjoyment can sometimes get lost in tools and processes. Nowadays it is important for an applied data scientist to be comfortable using tools and software for data/code versioning, reproducible model development, experiment tracking, and model inspection — to name a few!

數據科學應該是一個令人愉悅的過程，專注于提供見解和實際收益。但是，這種享受有時會在工具和過程中迷失。如今，重要的是，對于應用數據科學家來說，使用工具和軟件進行數據/代碼版本控制，可重現的模型開發，實驗跟蹤和模型檢查非常重要！

The purpose of this article is to briefly touch upon reproducible model development and experiment tracking through a worked example for tuning a Random Forest classifier using Optuna and mlflow.

本文的目的是通過一個使用Optuna和mlflow調整隨機森林分類器的實例簡要討論可再現模型的開發和實驗跟蹤。

心臟病數據和預處理 (The heart disease data and preprocessing)

This example uses the heart disease data available in the UCI ML archive. In previous work I describe some processing done to create a semi-cleaned dataset: developing a model for heart disease prediction using pycaret. In the .csv file provided for this short worked example, I have additionally converted any categorical features with two values into a single binary feature where for feature_name_x the _x suffix indicates the level represented by a value of 1. The target to predict is heart disease (hd_yes) and we want to tune a Random Forest classifier to do so.

本示例使用UCI ML存檔中可用的心臟病數據。在之前的工作中，我描述了創建半清潔數據集的一些處理過程：使用pycaret開發用于心臟病預測的模型。在為該簡短示例提供的.csv文件中，我另外將具有兩個值的所有分類特征轉換為單個二進制特征，其中對于feature_name_x ， _x后綴表示值1表示的水平。要預測的目標是心臟病( hd_yes )，我們想調整隨機森林分類器。

After loading the data and creating the feature matrix and target vector, I created a preprocessing pipeline. For numeric features missing values were imputed using the median and then scaled. For binary features missing values were imputed using the mode. For a three-level categorical feature chest pain, missing values were treated as a separate category and leave-one-out target encoding applied. Finally, these processing steps were wrapped up into a single processing pipeline.

加載數據并創建特征矩陣和目標矢量后，我創建了預處理管道。對于數字特征，使用中位數估算缺失值，然后進行縮放。對于二進制特征，使用該模式估算缺失值。對于三級分類的特征性胸痛，缺失值被視為一個單獨的類別，并應用了遺忘的目標編碼。最后，這些處理步驟被包裝到單個處理管道中。

使用Optuna和mlflow (Using Optuna and mlflow)

With our feature matrix, target vector and preprocessing pipeline ready to go, we can now tune a Random Forest classifier to predict heart disease. Note for the purpose of this demonstration I am going to forgo the use of a hold-out test set. To do the hyper-parameter optimization (model development) we will use Optuna and for experiment tracking one of its newer features: mlflow integration (see: mlflow callback in Optuna and mlflow tracking).

利用我們的特征矩陣，目標向量并準備好進行預處理，我們現在可以調整“隨機森林”分類器來預測心臟病。請注意，出于演示目的，我將放棄使用保持測試集。為了進行超參數優化(模型開發)，我們將使用Optuna并進行實驗跟蹤，以實現其較新的功能之一：mlflow集成(請參閱： Optuna中的mlflow回調和mlflow跟蹤 )。

First, we begin by defining the objective function to optimize with Optuna (Figure 1), which has three main components:

首先，我們首先定義目標函數以使用Optuna進行優化( 圖1 )，它具有三個主要組件：

Defining the space of hyper-parameters and values to optimize: in our case we focus on six Random Forest hyper-parameters (n_estimators, max_depth, max_features, bootstrap, min_samples_split, min_samples_split)
定義要優化的超參數和值的空間 ：在本例中，我們集中于六個隨機森林超參數(n_estimators，max_depth，max_features，bootstrap，min_samples_split，min_samples_split)
The Machine Learning pipeline: we have a preprocessing pipeline plus a Random Forest classifier
機器學習管道 ：我們有一個預處理管道以及一個隨機森林分類器
The calculation of the metric to optimize: we use the 5-fold CV average ROC AUC
優化指標的計算 ：我們使用5倍CV平均ROC AUC

Figure 1. The Optuna objective function圖1. Optuna目標函數

Next, we define a callback to mlflow, one of the new (and experimental!) features in Optuna 2.0 in order to track our classifier hyper-parameter tuning (Figure 2).

接下來，我們定義對mlflow的回調，這是Optuna 2.0中的新功能(也是實驗性功能！)之一，以便跟蹤分類器的超參數調整( 圖2 )。

Figure 2. Setting the mlflow callback to track an Optuna study圖2.設置mlflow回調以跟蹤Optuna研究

Then we instantiate the Optuna study object to maximize our chosen metric (ROC AUC) (Figure 3). We also apply Hyperband pruning, which helps find optimums in shorter times by stopping unpromising trials early. Note that this is typically of more benefit for more intensive and difficult optimization such as in deep learning.

然后，我們實例化Optuna研究對象以最大化我們選擇的指標(ROC AUC)( 圖3 )。我們還應用了Hyperband修剪，可通過盡早停止毫無希望的試驗來幫助您在更短的時間內找到最佳選擇。請注意，這通常對于深度學習等更深入，更困難的優化更有利。

Figure 3. Creating an Optuna study圖3 。創建一個Optuna研究

Finally, we run the hyper-parameter tuning using 200 trials (Figure 4). It may take a couple of minutes to run.

最后，我們使用200次試驗運行超參數調整( 圖4 )。運行可能需要幾分鐘。

Figure 4. Running the hyper-parameter optimization using Optuna圖4.使用Optuna運行超參數優化

The mlflow logged experiment including assessed hyper-parameter configurations for the Random Forest classifier (Optuna study/trials), are stored in a folder called mlruns. To view these results using the mlflow user interface, do the following:

mlflow記錄的實驗(包括為隨機森林分類器評估的超參數配置(Optuna研究/試驗))存儲在名為mlruns的文件夾中。要使用mlflow用戶界面查看這些結果，請執行以下操作：

Open a shell (for example a Windows PowerShell) in the directory containing the mlruns folder
在包含mlruns文件夾的目錄中打開外殼程序(例如Windows PowerShell)
Activate the virtual environment used for this notebook, i.e., conda activate optuna_env
激活用于此筆記本的虛擬環境，即conda activate optuna_env
Run mlflow ui
運行mlflow ui
Navigate to the localhost address provided, something like: http://kubernetes.docker.internal:5000/
導航到提供的localhost地址，例如： http://kubernetes.docker.internal:5000/

You should see Figure 5, and you can sift through the experiment trials. I deselected the user, source and version columns, as well as the tags section to simplify the view. I also sorted the results by ROC AUC. The top hyper-parameter configurations provided a ROC AUC of 0.917, where max_depth was in the range 14 to 16, max_features was 4, min_samples_leaf was 0.1, min_samples_split was 0.2, and n_estimators was 340 (note these results may differ slightly for you if you re-run the notebook).

您應該看到圖5 ，然后可以篩選實驗。我取消選擇了用戶，源和版本列以及標簽部分，以簡化視圖。我還按ROC AUC對結果進行了排序。頂級超參數配置的ROC AUC為0.917，其中max_depth在14到16之間，max_features是4，min_samples_leaf是0.1，min_samples_split是0.2，n_estimators是340(請注意，如果您使用重新運行筆記本)。

Image for post — **Figure 5.** mlflow summary for Optuna HPO of our heart disease random Forest classifier圖5.我們的心臟病隨機森林分類器的Optuna HPO的mlflow摘要

Optuna also has great visualizations for summarizing experiments. Here we look at hyper-parameter importance and the optimization history (Figure 6). We can see that the min_samples_leaf is the most important hyper-parameter to tune in this example. The hyper-parameter importance is calculated using a functional ANOVA approach. For more information see Hutter, Hoos and Leyton-Brown 2014.

Optuna還具有用于匯總實驗的出色可視化效果。在這里，我們看一下超參數的重要性和優化歷史( 圖6 )。我們可以看到，在此示例中，min_samples_leaf是最重要的超參數。使用功能方差分析方法計算超參數重要性。有關更多信息，請參閱Hutter，Hoos和Leyton-Brown 2014 。

摘要 (Summary)

Optuna has come a long way since its inception and version 2.0 released in July this year has some fantastic additions, one of which we touched upon in this short article: the mlflow callback. Perhaps as expected the tuning worked great and the summaries available via the in-built visualizations in Optuna and using the mlflow UI made this a lot of fun.

Optuna自問世以來已經走了很長一段路，并且在今年7月發布的2.0版中添加了一些很棒的功能，我們在這篇短文中談到了其中之一：mlflow回調。也許正如預期的那樣，調整效果很好，并且通過Optuna的內置可視化工具以及使用mlflow UI的摘要提供了很多樂趣。

The jupyter notebook, virtual environment and data used for this article are available at my GitHub. As always comments, thoughts, feedback and discussion are very welcome.

我的GitHub上提供了本文使用的jupyter筆記本，虛擬環境和數據。與往常一樣，我們非常歡迎評論，想法，反饋和討論。

翻譯自: https://medium.com/@jasonpben/heart-disease-classifier-tuning-using-optuna-and-mlflow-fc1366eefdec

糖藥病數據集分類

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389509.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389509.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389509.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！