如何不部署Keras / TensorFlow模型

While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them say “production”, but they often simply use the un-optimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.

盡管有關深度學習的大多數文章都集中在建模部分,但關于如何將此類模型部署到生產環境的文章也很少。 他們中的一些人說“生產”,但他們通常只是使用未優化的模型并將其嵌入Flask Web服務器中。 在這篇文章中,我將解釋為什么使用這種方法不能很好地擴展并浪費資源。

“生產”方式 (The “production” approach)

If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the raw keras model, a Flask web server and containerize it into a docker container. These examples use Python to serve predictions. The code for these “production” Flask webservers look like this:

如果您搜索如何將TensorFlow,Keras或Pytorch模型部署到生產環境中,則有很多不錯的教程,但是有時您會遇到非常簡單的示例,聲稱可以進行生產。 這些示例通常使用原始keras模型,Flask Web服務器并將其容器化到docker容器中。 這些示例使用Python進行預測。 這些“生產” Flask Web服務器的代碼如下所示:

from flask import Flask, jsonify, request
from tensorflow import keras
app = Flask(__name__)
model = keras.models.load_model("model.h5")
@app.route("/", methods=["POST"])
def index():
data = request.json
prediction = model.predict(preprocess(data))
return jsonify({"prediction": str(prediction)})

Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.

此外,他們經常展示如何容器化Flask服務器并將其與模型捆綁到docker中。 這些方法還聲稱,它們可以通過增加docker實例數量來輕松擴展。

Now let us recap what happens here and why it is not “production” grade.

現在讓我們回顧一下這里發生的事情以及為什么它不是“生產”等級。

沒有優化模型 (Not optimizing models)

First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more compute and are larger in terms of file size.

首先通常是按原樣使用模型,這意味著示例中的Keras模型只是通過model.save()導出的。 該模型包括所有參數和梯度,這些參數和梯度是訓練模型所必需的,但不是推理所必需的。 而且,該模型既不修剪也不量化。 結果,使用未優化的模型會導致較高的延遲,需要更多的計算并且文件大小也會更大。

Example with B5 Efficientnet:

B5 Efficientnet的示例:

  • h5 keras model: 454 MByte

    h5 keras模型:454 MByte
  • Optimized tensorflow model (no quantization): 222 MByte

    優化的張量流模型(無量化):222 MByte

使用Flask和Python API (Using Flask and the Python API)

The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.

下一個問題是使用普通的Python和Flask加載模型并提供預測。 這里有很多問題。

First let’s look at the worst thing you can do: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.

首先讓我們看一下您可以做的最壞的事情:為每個請求加載模型。 在上面的代碼示例中,在調用腳本時使用了模型,但是在其他教程中,他們將這一部分移至了預測函數中。 這樣做是每次您進行預測時都加載模型。 請不要那樣做。

That being said, let’s look at Flask. Flask includes a powerful and easy-to-use web server for development. On the official website, you can read the following:

話雖如此,讓我們看一下Flask。 Flask包括一個功能強大且易于使用的Web服務器,用于開發。 在官方網站上 ,您可以閱讀以下內容:

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.

Flask輕巧易用,但內置的服務器擴展性不好,因此不適合生產

That said, you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other web servers, they usually turn off multi threading completely.

也就是說,您可以在例如Google App Engine中將Flask用作WSGI應用程序。 但是,許多教程并未使用Google App Engine或NGIX,而是直接使用它并將其放入docker容器中。 但是,即使他們使用NGIX或任何其他Web服務器,也通常會完全關閉多線程。

Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your web server only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But once you allow more than one requests at the time, your web server stops working, because you can simply not access a TensorFlow model from different threads. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?

讓我們在這里更深入地研究問題。 如果您使用TensorFlow,它將為您處理計算資源(CPU,GPU)。 如果您加載模型并調用預測,TensorFlow將使用計算資源進行這些預測。 發生這種情況時,資源在使用中也被鎖定。 當您的Web服務器當時僅服務一個請求時,就可以了,因為模型已加載到該線程中,并且從該線程中調用了predict。 但是一旦您一次允許多個請求,您的Web服務器就會停止工作,因為您根本無法從其他線程訪問TensorFlow模型。 話雖這么說,在這種設置中您不能一次處理多個請求。 聽起來真的不是可擴展的,對嗎?

Example:

例:

  • Flask development web server: 1 simultaneous request

    Flask開發Web服務器:1個同時請求
  • TensorFlowX Model server: parallelism configurable

    TensorFlowX模型服務器:可配置并行性

使用docker擴展“低負載”實例 (Scaling “low-load” instances with docker)

Ok, the web server does not scale, but what about scaling the number of web servers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.

好的,Web服務器無法擴展,但是如何擴展Web服務器的數量呢? 在許多示例中,此方法是解決單個實例的縮放問題的方法。 沒什么可說的,它可以正常工作。 但是以這種方式擴展會浪費金錢,資源和能量。 這就像擁有一輛卡車并放入一個包裹,一旦有更多的包裹,您將獲得另一輛卡車,而不是更智能地使用現有的卡車。

Example latency:

延遲示例:

  • Flask Serving like shown above: ~2s per image

    上圖所示的燒瓶投放:每張圖片約2秒
  • Tensorflow model server (no batching, no GPU): ~250ms per image

    Tensorflow模型服務器(無批處理,無GPU):每個圖像約250ms
  • Tensorflow model server (no batching, GPU): ~100ms per image

    Tensorflow模型服務器(無批處理,GPU):每個圖像約100毫秒

不使用GPU / TPU (Not using GPUs/TPUs)

GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.

GPU使深度學習成為可能,因為它們可以并行進行大規模操作。 當使用Docker容器將深度學習模型部署到生產環境時,大多數示例不使用GPU,甚至不使用GPU實例。 在CPU機器上,每個請求的預測時間要慢得多,因此延遲是一個大問題。 即使使用功能強大的CPU實例,您也無法獲得與小型GPU實例相當的結果。

Just a side note: In general it is possible to use GPUs in docker, if the host has the correct driver installed. Docker is completely fine for scaling up instances, but scale up the correct instances.

附帶說明:通常,如果主機安裝了正確的驅動程序,則可以在docker中使用GPU。 Docker可以很好地擴展實例,但是可以擴展正確的實例。

Example costs:

費用示例:

  • 2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h

    2個CPU實例(16核,32GB,a1.4xlarge):0,816 $ / h
  • 1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h

    1個GPU實例(32G RAM,4核,Tesla M60,g3s.xlarge):0,75 $ / h

已經解決了 (It’s already solved)

As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in production, start from the model, then think about servers and finally about scaling instances.

如您所見,加載經過訓練的模型并將其放入Flask docker容器中并不是一個很好的解決方案。 如果要在生產中進行深度學習,請從模型開始,然后考慮服務器,最后考慮擴展實例。

優化模型 (Optimize the model)

Unfortunately optimizing a model for inference is not that straight forward as it should be. However, it can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with TensorFlow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel format), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.

不幸的是,為推理優化模型并不是應該的。 但是,它可以輕松地將推理時間減少幾倍,因此毫無疑問是值得的。 第一步是凍結重量并消除所有訓練開銷。 這可以直接用TensorFlow來實現,但是如果您來自Keras模型,則需要將模型轉換為估算器或Tensorflow圖(SavedModel格式)。 TensorFlow本身對此有一個教程 。 為了進一步優化,下一步是應用模型修剪和量化 ,刪除不重要的權重并減小模型大小。

使用模型服務器 (Use model servers)

When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offers the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.

擁有優化的模型后,您可以查看不同的模型服務器,這些服務器用于生產中的深度學習模型。 對于TensorFlow和Keras, TensorFlowX提供了tensorflow模型服務器 。 還有其他一些像TensorRT,Clipper,MLFlow,DeepDetect。

TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multi-threading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.

TensorFlow模型服務器提供了多種功能。 同時為多個模型提供服務,同時將開銷降至最低。 它允許您對模型進行版本控制,而在部署新版本時不會停機,同時仍可以使用舊版本。 除了gRPC API外,它還具有可選的REST API端點。 與使用Flask API相比,吞吐量要高出許多,因為它是用C ++編寫的并且使用多線程。 另外,您甚至可以啟用批處理,其中服務器將多個單個預測批處理為非常高的負載設置的批處理。 最后,您可以將其放入docker容器并進一步擴展。

Hint: tensorflow_model_server is available on every AWS-EC2 Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.

提示:在每個AWS-EC2深度學習AMI映像上都可以使用tensorflow_model_server,對于TensorFlow 2,它稱為tensorflow2_model_server。

使用GPU實例 (Use GPU instances)

And lastly, I would recommend using GPUs or TPUs for inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a GPU instance with Tesla M60 (g3s.xlarge).

最后,我建議在推理環境中使用GPU或TPU。 使用此類加速器時,延遲和吞吐量要高得多,同時可以節省能源和金錢。 請注意,只有在您的軟件堆棧可以利用GPU(優化的模型+模型服務器)的功能時,才可以使用它。 在AWS中,您可以研究Elastic Inference或僅將GPU實例與Tesla M60(g3s.xlarge)一起使用。

Originally posted on digital-thnking.de

最初發布在digital-thnking.de

翻譯自: https://towardsdatascience.com/how-to-not-deploy-keras-tensorflow-models-4fa60b487682

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/387967.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/387967.shtml
英文地址,請注明出處:http://en.pswp.cn/news/387967.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

[BZOJ3626] [LNOI2014] LCA 離線 樹鏈剖分

題面 考慮到詢問的\(l..r,z\)具有可減性,考慮把詢問差分掉,拆成\(r,z\)和\(l-1,z\)。 顯然這些LCA一定在\(z\)到根的路徑上。下面的問題就是怎么統計。 考慮不是那么暴力的暴力。 我們似乎可以把\(1..r\)的所有點先瞎搞一下,求出一個點內部有…

Linux查看系統各類信息

說明:Linux下可以在/proc/cpuinfo中看到每個cpu的詳細信息。但是對于雙核的cpu,在cpuinfo中會看到兩個cpu。常常會讓人誤以為是兩個單核的cpu。其實應該通過Physical Processor ID來區分單核和雙核。而Physical Processor ID可以從cpuinfo或者dmesg中找到…

biopython中文指南_Biopython新手指南-第1部分

biopython中文指南When you hear the word Biopython what is the first thing that came to your mind? A python library to handle biological data…? You are correct! Biopython provides a set of tools to perform bioinformatics computations on biological data s…

整合后臺服務和驅動代碼注入

整合后臺服務和驅動代碼注入 Home鍵的驅動代碼: /dev/input/event1: 0001 0066 00000001 /dev/input/event1: 0000 0000 00000000 /dev/input/event1: 0001 0066 00000000 /dev/input/event1: 0000 0000 00000000 對應輸入的驅動代碼: sendevent/dev/…

Java作業09-異常

6. 為如下代碼加上異常處理 byte[] content null; FileInputStream fis new FileInputStream("testfis.txt"); int bytesAvailabe fis.available();//獲得該文件可用的字節數 if(bytesAvailabe>0){content new byte[bytesAvailabe];//創建可容納文件大小的數組…

為數據計算提供強力引擎,阿里云文件存儲HDFS v1.0公測發布

2019獨角獸企業重金招聘Python工程師標準>>> 在2019年3月的北京云棲峰會上,阿里云正式推出全球首個云原生HDFS存儲服務—文件存儲HDFS,為數據分析業務在云上提供可線性擴展的吞吐能力和免運維的快速彈性伸縮能力,降低用戶TCO。阿里…

對食材的敬畏之心極致產品_這些數據科學產品組合將給您帶來敬畏和啟發(2020年中的版本)

對食材的敬畏之心極致產品重點 (Top highlight)為什么選擇投資組合? (Why portfolios?) Data science is a tough field. It combines in equal parts mathematics and statistics, computer science, and black magic. As of mid-2020, it is also a booming fiel…

android模擬用戶輸入

目錄(?)[-] geteventsendeventinput keyevent 本文講的是通過使用代碼,可以控制手機的屏幕和物理按鍵,也就是說不只是在某一個APP里去操作,而是整個手機系統。 getevent/sendevent getevent&sendevent 是Android系統下的一個工具&#x…

真格量化常見報錯信息和Debug方法

1.打印日志 1.1 在代碼中添加運行到特定部分的提示: 如果我們在用戶日志未能看到“調用到OnQuote事件”文字,說明其之前的代碼就出了問題,導致程序無法運行到OnQuote函數里的提示部分。解決方案為仔細檢查該部分之前的代碼是否出現問題。 1.2…

向量積判斷優劣弧_判斷經驗論文優劣的10條誡命

向量積判斷優劣弧There are a host of pathologies associated with the current peer review system that has been the subject of much discussion. One of the most substantive issues is that results reported in leading journals are commonly papers with the most e…

自定義PopView

改代碼是參考一個Demo直接改的&#xff0c;代碼中有一些漏洞&#xff0c;如果發現其他的問題&#xff0c;可以下方直接留言 .h文件 #import <UIKit/UIKit.h> typedef void(^PopoverBlock)(NSInteger index); interface CustomPopView : UIView //property(nonatomic,copy…

線控耳機監聽

當耳機的媒體按鍵被單擊后&#xff0c;Android系統會發出一個廣播&#xff0c;該廣播的攜帶者一個Action名為MEDIA_BUTTON的Intent。監聽該廣播便可以獲取手機的耳機媒體按鍵的單擊事件。 在Android中有個AudioManager類&#xff0c;該類會維護MEDIA_BUTTON廣播的分發&#xf…

當編程語言掌握在企業手中,是生機還是危機?

2019年4月&#xff0c;Java的收費時代來臨了&#xff01; Java是由Sun微系統公司在1995年推出的編程語言&#xff0c;2010年Oracle收購了Sun之后&#xff0c;Java的所有者也就自然變成了Oracle。2019年&#xff0c;Oracle宣布將停止Java 8更新的免費支持&#xff0c;未來Java的…

sql如何處理null值_如何正確處理SQL中的NULL值

sql如何處理null值前言 (Preface) A friend who has recently started learning SQL asked me about NULL values and how to deal with them. If you are new to SQL, this guide should give you insights into a topic that can be confusing to beginners.最近開始學習SQL的…

名言警句分享

“當你想做一件事&#xff0c;卻無能為力的時候&#xff0c;是最痛苦的。”基拉大和轉載于:https://www.cnblogs.com/yuxijun/p/9986489.html

文字創作類App分享-簡書

今天我用Mockplus做了一套簡書App的原型&#xff0c;這是一款文字創作類的App&#xff0c;用戶通過寫文、點贊等互動行為&#xff0c;提高自己在社區的影響力&#xff0c;打造個人品牌。我運用了Mockplus基礎組件、交互組件、移動組件等多個組件庫&#xff0c;簡單拖拽&#xf…

數據可視化 信息可視化_動機可視化

數據可視化 信息可視化John Snow’s map of Cholera cases near London’s Broad Street.約翰斯諾(John Snow)在倫敦寬街附近的霍亂病例地圖。 John Snow, “the father of epidemiology,” is famous for his cholera maps. These maps represent so many of our aspirations …

android 接聽和掛斷實現方式

轉載▼標簽&#xff1a; android 接聽 掛斷 it 分類&#xff1a; android應用技巧 參考&#xff1a;android 來電接聽和掛斷 支持目前所有版本 注意&#xff1a;android2.3版本及以上不支持下面的自動接聽方法。 &#xff08;會拋異常&#xff1a;java.lang.Securi…

Eclipse External Tool Configration Notepad++

Location&#xff1a; C:\Program Files\Notepad\notepad.exe Arguments&#xff1a;  ${resource_loc} 轉載于:https://www.cnblogs.com/rgqancy/p/9987610.html

利用延遲關聯或者子查詢優化超多分頁場景

2019獨角獸企業重金招聘Python工程師標準>>> MySQL并不是跳過offset行&#xff0c;而是取offsetN行&#xff0c;然后返回放棄前offset行&#xff0c;返回N行&#xff0c;那當offset 特別大的時候&#xff0c;效率就非常的低下&#xff0c;要么控制返回的總頁數&…