While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them say “production”, but they often simply use the un-optimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.

盡管有關深度學習的大多數文章都集中在建模部分，但關于如何將此類模型部署到生產環境的文章也很少。他們中的一些人說“生產”，但他們通常只是使用未優化的模型并將其嵌入Flask Web服務器中。在這篇文章中，我將解釋為什么使用這種方法不能很好地擴展并浪費資源。

“生產”方式 (The “production” approach)

If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the raw keras model, a Flask web server and containerize it into a docker container. These examples use Python to serve predictions. The code for these “production” Flask webservers look like this:

如果您搜索如何將TensorFlow，Keras或Pytorch模型部署到生產環境中，則有很多不錯的教程，但是有時您會遇到非常簡單的示例，聲稱可以進行生產。這些示例通常使用原始keras模型，Flask Web服務器并將其容器化到docker容器中。這些示例使用Python進行預測。這些“生產” Flask Web服務器的代碼如下所示：

from flask import Flask, jsonify, request
from tensorflow import keras
app = Flask(__name__)
model = keras.models.load_model("model.h5")
@app.route("/", methods=["POST"])
def index():
    data = request.json
    prediction = model.predict(preprocess(data))   
    return jsonify({"prediction": str(prediction)})

Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.

此外，他們經常展示如何容器化Flask服務器并將其與模型捆綁到docker中。這些方法還聲稱，它們可以通過增加docker實例數量來輕松擴展。

Now let us recap what happens here and why it is not “production” grade.

現在讓我們回顧一下這里發生的事情以及為什么它不是“生產”等級。

沒有優化模型 (Not optimizing models)

First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more compute and are larger in terms of file size.

首先通常是按原樣使用模型，這意味著示例中的Keras模型只是通過model.save()導出的。該模型包括所有參數和梯度，這些參數和梯度是訓練模型所必需的，但不是推理所必需的。而且，該模型既不修剪也不量化。結果，使用未優化的模型會導致較高的延遲，需要更多的計算并且文件大小也會更大。

Example with B5 Efficientnet:

B5 Efficientnet的示例：

h5 keras model: 454 MByte
h5 keras模型：454 MByte
Optimized tensorflow model (no quantization): 222 MByte
優化的張量流模型(無量化)：222 MByte

使用Flask和Python API (Using Flask and the Python API)

The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.

下一個問題是使用普通的Python和Flask加載模型并提供預測。這里有很多問題。

First let’s look at the worst thing you can do: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.

首先讓我們看一下您可以做的最壞的事情：為每個請求加載模型。在上面的代碼示例中，在調用腳本時使用了模型，但是在其他教程中，他們將這一部分移至了預測函數中。這樣做是每次您進行預測時都加載模型。請不要那樣做。

That being said, let’s look at Flask. Flask includes a powerful and easy-to-use web server for development. On the official website, you can read the following:

話雖如此，讓我們看一下Flask。 Flask包括一個功能強大且易于使用的Web服務器，用于開發。在官方網站上，您可以閱讀以下內容：

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.
Flask輕巧易用，但內置的服務器擴展性不好，因此不適合生產 。

That said, you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other web servers, they usually turn off multi threading completely.

也就是說，您可以在例如Google App Engine中將Flask用作WSGI應用程序。但是，許多教程并未使用Google App Engine或NGIX，而是直接使用它并將其放入docker容器中。但是，即使他們使用NGIX或任何其他Web服務器，也通常會完全關閉多線程。

Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your web server only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But once you allow more than one requests at the time, your web server stops working, because you can simply not access a TensorFlow model from different threads. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?

讓我們在這里更深入地研究問題。如果您使用TensorFlow，它將為您處理計算資源(CPU，GPU)。如果您加載模型并調用預測，TensorFlow將使用計算資源進行這些預測。發生這種情況時，資源在使用中也被鎖定。當您的Web服務器當時僅服務一個請求時，就可以了，因為模型已加載到該線程中，并且從該線程中調用了predict。但是一旦您一次允許多個請求，您的Web服務器就會停止工作，因為您根本無法從其他線程訪問TensorFlow模型。話雖這么說，在這種設置中您不能一次處理多個請求。聽起來真的不是可擴展的，對嗎？

Example:

例：

Flask development web server: 1 simultaneous request
Flask開發Web服務器：1個同時請求
TensorFlowX Model server: parallelism configurable
TensorFlowX模型服務器：可配置并行性

使用docker擴展“低負載”實例 (Scaling “low-load” instances with docker)

Ok, the web server does not scale, but what about scaling the number of web servers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.

好的，Web服務器無法擴展，但是如何擴展Web服務器的數量呢？在許多示例中，此方法是解決單個實例的縮放問題的方法。沒什么可說的，它可以正常工作。但是以這種方式擴展會浪費金錢，資源和能量。這就像擁有一輛卡車并放入一個包裹，一旦有更多的包裹，您將獲得另一輛卡車，而不是更智能地使用現有的卡車。

Example latency:

延遲示例：

Flask Serving like shown above: ~2s per image
上圖所示的燒瓶投放：每張圖片約2秒
Tensorflow model server (no batching, no GPU): ~250ms per image
Tensorflow模型服務器(無批處理，無GPU)：每個圖像約250ms
Tensorflow model server (no batching, GPU): ~100ms per image
Tensorflow模型服務器(無批處理，GPU)：每個圖像約100毫秒

不使用GPU / TPU (Not using GPUs/TPUs)

GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.

GPU使深度學習成為可能，因為它們可以并行進行大規模操作。當使用Docker容器將深度學習模型部署到生產環境時，大多數示例不使用GPU，甚至不使用GPU實例。在CPU機器上，每個請求的預測時間要慢得多，因此延遲是一個大問題。即使使用功能強大的CPU實例，您也無法獲得與小型GPU實例相當的結果。

Just a side note: In general it is possible to use GPUs in docker, if the host has the correct driver installed. Docker is completely fine for scaling up instances, but scale up the correct instances.

附帶說明：通常，如果主機安裝了正確的驅動程序，則可以在docker中使用GPU。 Docker可以很好地擴展實例，但是可以擴展正確的實例。

Example costs:

費用示例：

2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h
2個CPU實例(16核，32GB，a1.4xlarge)：0,816 $ / h
1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h
1個GPU實例(32G RAM，4核，Tesla M60，g3s.xlarge)：0,75 $ / h

已經解決了 (It’s already solved)

As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in production, start from the model, then think about servers and finally about scaling instances.

如您所見，加載經過訓練的模型并將其放入Flask docker容器中并不是一個很好的解決方案。如果要在生產中進行深度學習，請從模型開始，然后考慮服務器，最后考慮擴展實例。

優化模型 (Optimize the model)

Unfortunately optimizing a model for inference is not that straight forward as it should be. However, it can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with TensorFlow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel format), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.

不幸的是，為推理優化模型并不是應該的。但是，它可以輕松地將推理時間減少幾倍，因此毫無疑問是值得的。第一步是凍結重量并消除所有訓練開銷。這可以直接用TensorFlow來實現，但是如果您來自Keras模型，則需要將模型轉換為估算器或Tensorflow圖(SavedModel格式)。 TensorFlow本身對此有一個教程。為了進一步優化，下一步是應用模型修剪和量化，刪除不重要的權重并減小模型大小。

使用模型服務器 (Use model servers)

When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offers the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.

擁有優化的模型后，您可以查看不同的模型服務器，這些服務器用于生產中的深度學習模型。對于TensorFlow和Keras， TensorFlowX提供了tensorflow模型服務器。還有其他一些像TensorRT，Clipper，MLFlow，DeepDetect。

TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multi-threading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.

TensorFlow模型服務器提供了多種功能。同時為多個模型提供服務，同時將開銷降至最低。它允許您對模型進行版本控制，而在部署新版本時不會停機，同時仍可以使用舊版本。除了gRPC API外，它還具有可選的REST API端點。與使用Flask API相比，吞吐量要高出許多，因為它是用C ++編寫的并且使用多線程。另外，您甚至可以啟用批處理，其中服務器將多個單個預測批處理為非常高的負載設置的批處理。最后，您可以將其放入docker容器并進一步擴展。

Hint: tensorflow_model_server is available on every AWS-EC2 Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.

提示：在每個AWS-EC2深度學習AMI映像上都可以使用tensorflow_model_server，對于TensorFlow 2，它稱為tensorflow2_model_server。

使用GPU實例 (Use GPU instances)

And lastly, I would recommend using GPUs or TPUs for inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a GPU instance with Tesla M60 (g3s.xlarge).

最后，我建議在推理環境中使用GPU或TPU。使用此類加速器時，延遲和吞吐量要高得多，同時可以節省能源和金錢。請注意，只有在您的軟件堆棧可以利用GPU(優化的模型+模型服務器)的功能時，才可以使用它。在AWS中，您可以研究Elastic Inference或僅將GPU實例與Tesla M60(g3s.xlarge)一起使用。

Originally posted on digital-thnking.de

最初發布在digital-thnking.de

翻譯自: https://towardsdatascience.com/how-to-not-deploy-keras-tensorflow-models-4fa60b487682

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/387967.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/387967.shtml
英文地址，請注明出處：http://en.pswp.cn/news/387967.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！