After model training, we need to use the saved model for online prediction, which is the model Serving service. The TensorFlow team provides TensorFlow Serving specifically for model prediction. It is designed specifically for production environments with high performance and great flexibility. In this article, we will start from service construction, service configuration, TensorFlow Serving is discussed in detail.

Serving Service Setup

Docker is officially recommended as the easiest and most straightforward way to construct TensorFlow Serving. In order to facilitate operation, this paper also adopts this kind of construction method.

When set up Serving service, the first to make sure the system has been installed the Docker program, and then through the command Docker pull tensorflow/Serving: version to get the specified version of tensorflow Serving image, You can then use this image to start the Serving service.

It is important to note that the version of TensorFlow Serving is ideally the same as the version of TensorFlow used during training to reduce unnecessary compatibility issues. You can query all available versions of the image from the DockerHub website, such as Latest, specified version (2.2.0), and nightly.

If you want to use the GPU to support the prediction service of the model, You will need to obtain TensorFlow Serving mirror corresponding GPU version (docker pull TensorFlow/Serving: version – the GPU) and to configure the local environment of the GPU.

Single model Serving service

Once we’ve trained and saved a prototype, it’s easy to launch a Serving service for testing and verification.

First model will save SavedModel format files placed in local/path/to/TFS/models/model/version directory, the directory structure is as follows:

├── all exercises, all exercises, all exercises, all exercises, all exercises, all exercises, all exercisesCopy the code

Where model represents the name of the model to be loaded and 0 represents the available version of the model, where there is only one version or there can be multiple versions. The files in the version directory are in the SavedModel format. The contents of these files have been described in the save model article and will not be described here.

You can then start a Serving service using the following command:

Docker run - d - p \ 8501-8501 - v/path/to/TFS/models/model: / models/model \ tensorflow/serving: 2.2.0Copy the code

The Serving service loads only one model, named Model and version 0. The container carrying the Serving service exposes port 8501, which is used to receive an HTTP interface access request.

We can specify the storage path of the model at container startup by setting the environment variable MODEL_BASE_PATH within the container, which defaults to /models, indicating that the model is loaded from the /models path within the container. Similarly, the name of the model to load can be specified through the MODEL_NAME environment variable, which defaults to model. The Serving service startup command, based on the environment variable, looks like this:

docker run -d -p 8501:8501 \ -v /path/to/tfs/models/model:/models/model \ -e MODEL_BASE_PATH=/models \ -e MODEL_NAME = model \ tensorflow/serving: 2.2.0Copy the code

In addition to setting the environment variables in the container, we can also specify the model storage path and model name using the Serving service’s startup parameters — model_base_PATH and –model_name, respectively. The startup command is as follows:

Docker run - d - p \ 8501-8501 - v/path/to/TFS/models/model: / models/model \ tensorflow/serving: 2.2.0 \ --model_base_path=/models \ --model_name=modelCopy the code

After the Serving service is successfully started, you can check to see if the model is successfully loaded by accessing a specific HTTP interface. An example of its request and response is shown below:

$ curl http://localhost:8501/v1/models/model{ "model_version_status": [ { "version": "0", "state": "AVAILABLE", "status": { "error_code": "OK", "error_message": ""}}]}Copy the code

A model state of AVAILABLE indicates that the corresponding version of the model has been successfully loaded and is ready to serve Serving. If the interface returns an error or does not return, it indicates that there is a problem with model loading. In this case, you can locate the problem according to the returned error information or by using the Docker logs ContainerID to view the container logs.

Note that if there are multiple versions of the model under the Serving path, the Serving service loads the model corresponding to the maximum version number by default, meaning that only one version of a model can serve the Serving service at any one time.

Multi-model Serving service

Generally speaking, multiple models or versions of the model are deployed simultaneously in online Serving services for A/B Test to distinguish the actual performance of the different models. The single-model Serving service described in the previous section does not meet this requirement because both the model name and the loaded version of the model are fixed, and you can use the Serving service with a configuration file (models.config) to achieve this.

The contents of the model configuration file are generated based on the ModelServerConfig structure, which is a data structure (message) defined in the model_server_config.proto file, Refer to the source code for specific definitions. A basic example of this configuration file is as follows:

model_config_list {
    config {
        name: 'first_model'
        base_path: '/models/first_model'
        model_platform: 'tensorflow'
    }
    config {
        name: 'second_model'
        base_path: '/models/second_model'
        model_platform: 'tensorflow'
    }
}
Copy the code

Each config in the configuration file corresponds to a model to be loaded, where base_path corresponds to the loading path of the model, and name corresponds to the name of the model. If there are multiple versions of the model file in this path, Then TensorFlow Serving selects the maximum version number of the Serving model.

To start a multi-model TensorFlow Serving, place the above configuration file in the /path/to/ TFS /config directory and use the following command:

Docker run - d - p \ 8501-8501 - v/path/to/TFS/models: / models \ - v/path/to/TFS/config/config \ tensorflow/serving: 2.2.0 \ --model_config_file=/config/models.configCopy the code

Similarly, you can use the HTTP interface to view the loading information of the various models after the Serving service is started:

$ curl http://localhost:8501/v1/models/first_model
$ curl http://localhost:8501/v1/models/second_model{ "model_version_status": [ { "version": "0", "state": "AVAILABLE", "status": { "error_code": "OK", "error_message": ""}}]}Copy the code

At this point we can use multiple models to perform online Serving simultaneously.

Serving Service configuration

The model configuration

The models.config configuration file can specify the names and loading paths of multiple models, as well as the version policy, version labels and logging mode of the model, which provides great convenience for us.

At the same model path, the Serving service loads the model with the largest version number by default, and we can specify the model version to load by modifying the model_version_policy configuration item in the configuration file. If we want to provide Serving using the model with version N, we first need to set the model_version_policy policy to specific and then provide the corresponding version N in specific. The model configuration for the specified version number is as follows:

config {
    name: 'first_model'
    base_path: '/models/first_model'
    model_platform: 'tensorflow'
    model_version_policy {
        specific {
            versions: 0}}}Copy the code

This configuration option is useful when there is a problem with the latest version of the model Serving and you need to roll back to a known better version.

If you need to serve the Serving service using multiple versions of the model at the same time, you set this up the same way, except that you provide multiple version numbers in the specific. The model configuration is as follows:

config {
    name: 'first_model'
    base_path: '/models/first_model'
    model_platform: 'tensorflow'
    model_version_policy {
        specific {
            versions: 0
            versions: 1}}}Copy the code

Sometimes it is helpful to specify an alias or label for the version number of the model so that the client does not have to know the specific version number at call time and can access the corresponding label directly. For example, we can set up a stable tag that points to any version number of the model, and the client simply needs to access the stable tag to get the latest Serving service. If there is a new stable version, you only need to change the version number pointed to by the stable tag, and the client is unaware of the change and therefore does not need to make any changes. Notice The label function can be used only when the gRPC interface is used. The HTTP interface does not support the label function.

The model configuration with the label is as follows:

config {
    name: 'first_model'
    base_path: '/models/first_model'
    model_platform: 'tensorflow'
    model_version_policy {
        specific {
            versions: 0
            versions: 1
        }
    }
    version_labels {
        key: 'stable'
        value: 1
    }
    version_labels {
        key: 'canary'
        value: 0}}Copy the code

If the model needs to be rolled back, you only need to change the version number corresponding to the corresponding label in the configuration file.

Note that a tag can be assigned to a model version only if the model version is already loaded and the model version is in the AVAILABLE state. If you need to specify a label for a model version that has not been loaded (for example, when Serving starts, both the model version and the label are provided in the configuration file), Set the — allow_version_labelS_for_unavailable_models startup parameter to true.

The above startup parameters only apply to specifying a version number for a new label. If you want to reassign a version to a label that is already in use, you must assign it to a loaded model version to avoid an online failure when a request sent to the label becomes invalid during a version switch. Therefore, if you want to reassign the label stable from version N to version N + 1, you must first commit a configuration that contains both version N and version N + 1. After version N + 1 is loaded successfully, Commit the label stable to the configuration of version N + 1 to complete the label update.

Monitoring configuration

When Serving online, we need to know about the metrics of Serving so that we can find problems and make adjustments.

We can use the –monitoring_config_file parameter to enable the monitor metric fetching function for the Serving service. This parameter specifies a configuration file whose contents correspond to the MonitoringConfig data structure (message) defined in the monitoring_config.proto file, which can be defined in the source code. The content of the configuration file is as follows:

prometheus_config {
    enable: true,
    path: "/metrics"
}
Copy the code

Enable indicates whether to enable monitoring, and path indicates the URI of the monitoring indicator. Save the above configuration to the monitor.config file and place it in the /path/to/ TFS /config directory, and then start the Serving service with monitoring metrics using the following command:

Docker run - d - p \ 8501-8501 - v/path/to/TFS/models: / models \ - v/path/to/TFS/config/config \ tensorflow/serving: 2.2.0 \ --model_config_file=/config/models.config \ --monitoring_config_file=/config/monitor.configCopy the code

Visit http://localhost:8501/metrics to see TensorFlow Serving information service of real-time monitoring indicators. Some monitoring indicators are as follows:

# TYPE :tensorflow:api:op:using_fake_quantization gauge
# TYPE :tensorflow:cc:saved_model:load_attempt_count counter
:tensorflow:cc:saved_model:load_attempt_count{model_path="/models/first_model/0",status="success"} 1
:tensorflow:cc:saved_model:load_attempt_count{model_path="/models/first_model/1",status="success"} 1
:tensorflow:cc:saved_model:load_attempt_count{model_path="/models/second_model/1",status="success"} 1
# TYPE :tensorflow:cc:saved_model:load_latency counter
:tensorflow:cc:saved_model:load_latency{model_path="/models/first_model/0"} 61678
:tensorflow:cc:saved_model:load_latency{model_path="/models/first_model/1"} 59848
:tensorflow:cc:saved_model:load_latency{model_path="/models/second_model/1"} 95868
# TYPE :tensorflow:cc:saved_model:load_latency_by_stage histogram
:tensorflow:cc:saved_model:load_latency_by_stage_bucket{model_path="/models/first_model/0",stage="init_graph",le="10"} 0
Copy the code

In order to monitor these indicators more intuitively, all indicators were collected from Prometheus and visualized using Grafana. The visualization results of some indicators are shown in the figure below:

Batch configuration

Serving service batches requests from the client for better throughput. Batch operations are uniformly scheduled globally for all models and model versions loaded by the Serving service to ensure maximum utilization of the underlying computing resources. Batch operations can be enabled using the –enable_batching parameter, and batching-related parameters can be configured using the –batching_parameters_file startup parameter to specify the batch parameter configuration file (batching.config).

The contents of the batch parameter configuration file are as follows:

max_batch_size { value: 128 }
batch_timeout_micros { value: 0 }
num_batch_threads { value: 8 }
max_enqueued_batches { value: 8 }
Copy the code

The max_batch_size parameter is used to control the tradeoff between throughput and latency, and it also prevents requests from failing because they are too large to exceed resource limits. The batch_timeout_micros parameter indicates the maximum amount of time to wait before a batch operation is performed (even if max_batCH_size may not be reached) and is mainly used to control tail latency. Num_batch_threads indicates the maximum number of concurrent batch operations. Max_enqueued_batches represents the number of batch tasks that can be put into the scheduler queue to limit queuing latency.

Place the batching.config file in the /path/to/ TFS /config directory and then start a Serving service with batch operations using the following command:

Docker run - d - p \ 8501-8501 - v/path/to/TFS/models: / models \ - v/path/to/TFS/config/config \ tensorflow/serving: 2.2.0 \ --model_config_file=/config/models.config \ --enable_batching=true \ --batching_parameters_file=/config/batching.configCopy the code

When adjusting the MAX_BATCH_SIZE and max_ENqueued_BATCHES parameters, do not set them too large to avoid memory overflow, which may affect online Serving.

Service startup parameters

In addition to some of the Serving service startup parameters described above, there are some other startup parameters you need to know about, which you can use to fine-tune the deployed Serving service to achieve the desired effect. Here is a brief description of some of the more important startup parameters:

  1. --port:gRPCThe listening port of the service. The default is8500Port.
  2. --rest_api_port:HTTPThe listening port of the service. The default is8501Port.
  3. --rest_api_timeout_in_ms:HTTPTimeout limit for requests (milliseconds), default is30000Milliseconds.
  4. --model_config_file: Path to the model configuration file.
  5. --model_config_file_poll_wait_seconds: Interval (seconds) at which model configuration files are periodically loaded. The default is0Said,ServingServices are not loaded regularly after they start.
  6. --file_system_poll_wait_seconds: Interval (seconds) at which a new model version is periodically loaded. The default is1Seconds.
  7. --enable_batching:bool, whether to enable request batch processing. The default value isfalse
  8. --batching_parameters_file: Indicates the path of the parameter configuration file of the batch operation.
  9. --monitoring_config_file: Configuration file path for monitoring.
  10. --allow_version_labels_for_unavailable_models:bool, whether you can assign labels to models that have not yet been loaded. The default isfalse
  11. --model_name: Model name, in single modelServingWhen use.
  12. --model_base_path: model load path, in single modelServingWhen use.
  13. --enable_model_warmup:bool, whether to use data to preheat the model to reduce the response delay for the first request. The default isfalse. becauseTensorFlowThe runtime has lazy initialization components, which can result in a long response time for the first request sent to the model after the model is loaded, and this delay can be orders of magnitude higher than the time required for a single request. To reduce the impact of delayed initialization on the request response, a sample set of requests can be provided at model load time to trigger the initialization of the component, a process calledModel preheating.
  14. For more startup parameters, see the source code.

Complete sample

A fully configured multi-model Serving starts as follows:

docker run -d -p 8500:8500 -p 8501:8501 \ -v /path/to/tfs/models:/models \ -v /path/to/tfs/config:/config \ Tensorflow /serving:2.2.0 \ --model_config_file=/config/models.config \ --monitoring_config_file=/config/monitor.config \  --enable_batching=true \ --batching_parameters_file=/config/batching.config \ --model_config_file_poll_wait_seconds=10 \ --file_system_poll_wait_seconds=10 \ --allow_version_labels_for_unavailable_models=trueCopy the code

HTTP Interface Access

We can query model information, including model loading status, model metadata, and model predictions, by accessing the HTTP interface of the Serving service.

The Body of all requests and responses is in JSON format. For incorrect requests, the Serving service returns an error message in the following format:

{
  "error": "error message string"
}
Copy the code

Model state

Visit http://host:port/v1/models/${MODEL_NAME} [/ versions / ${MODEL_VERSION}] interface specifies the version of the status information of model can be gained, The /versions/${MODEL_VERSION} part is optional. If this part is not specified, the status information of all versions of the model is returned. The following is an example of a request and response:

$ curl http://localhost:8501/v1/models/first_model/versions/0{ "model_version_status": [ { "version": "0", "state": "AVAILABLE", "status": { "error_code": "OK", "error_message": ""}}]}Copy the code

Model Metadata

Visit http://host:port/v1/models/${MODEL_NAME} [/ versions / ${MODEL_VERSION}] / metadata interfaces specified version of the metadata information of model can be gained. /versions/${MODEL_VERSION} is optional. If this parameter is not specified, the latest version of the model metadata is returned. The following is an example of a request and response:

$ curl http://localhost:8501/v1/models/first_model/versions/0/metadata
{
    "model_spec": {
        "name": "first_model",
        "signature_name": "",
        "version": "0"
    },
    "metadata": {
        "signature_def": {
            "signature_def": {
                "serving_default": {
                    "inputs": {
                        "input_1": {
                            "dtype": "DT_INT64",
                            "tensor_shape": {
                                "dim": [
                                    {
                                        "size": "-1",
                                        "name": ""
                                    },
                                    {
                                        "size": "31",
                                        "name": ""
                                    }
                                ],
                                "unknown_rank": false
                            },
                            "name": "serving_default_input_1:0"
                        }
                    },
                    "outputs": {
                        "output_1": {
                            "dtype": "DT_FLOAT",
                            "tensor_shape": {
                                "dim": [
                                    {
                                        "size": "-1",
                                        "name": ""
                                    },
                                    {
                                        "size": "1",
                                        "name": ""
                                    }
                                ],
                                "unknown_rank": false
                            },
                            "name": "StatefulPartitionedCall:0"
                        }
                    },
                    "method_name": "tensorflow/serving/predict"
                },
                "__saved_model_init_op": {
                    "inputs": {},
                    "outputs": {
                        "__saved_model_init_op": {
                            "dtype": "DT_INVALID",
                            "tensor_shape": {
                                "dim": [],
                                "unknown_rank": true
                            },
                            "name": "NoOp"
                        }
                    },
                    "method_name": ""
                }
            }
        }
    }
}
Copy the code

Metadata information contains the data type and dimension information of the input tensor of the model, as well as the name (key) of the input tensor, input_1, which is very important for model prediction using data in the future. The input_key and input format of the data must be the same as defined in metadata.

Model to predict

Visit http://host:port/v1/models/${MODEL_NAME} [/ versions / ${MODEL_VERSION}] : predict interface to get the specified input data predicted value, namely the data input of a given model, get predict the output of the model. /versions/${MODEL_VERSION} is optional. If this parameter is not specified, the latest version of the model will be used by default.

To access this interface, you need to provide a jSON-formatted request Body that looks like this:

{
    // If unspecifed default serving signature is used.
    "signature_name": "string".// Input Tensors in row ("instances") or columnar ("inputs") format.
    // A request can have either of them but NOT both.
    "instances": <value>|<(nested)list>|<list-of-objects>
    "inputs": <value>|<(nested)list>|<object>
}
Copy the code

Signature_name indicates the signature of the model. If this parameter is not specified, the default value is serving_default. Instances and Inputs represent inputs to the model, one of which must be specified in the Body of the request, followed by input data in the specified format.

Instances represent the input data to the model as rows, with json requests as shown below:

{
  "instances": [{"input_1": [1.1]]}}Copy the code

Input_key can also be omitted when the model has only one input with a name, and the jSON-formatted request looks like this:

{
  "instances": [[1.1]]}Copy the code

An example of using curl to predict a request using the above method is shown below:

$ curl -d '{" instances ": [{" input_1" :,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 [1]}]}' -X POST http://localhost:8501/v1/models/first_model:predict
$ curl -d '{" instances ": [,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 [1]]}' -X POST http://localhost:8501/v1/models/first_model:predictPredictions ": [[0.778520346]]}Copy the code

The json key in the response results is the Predictions and the value is a list corresponding to the output values of each sample. Because the model has only one output tensor, the output key value output_1 is omitted.

Inputs provide input data for the model in columns. Their JSON request looks like this:

{
  "inputs": {
    "input_1": [[1.1]]}}Copy the code

Similarly, input_key can also be omitted when the model has only one input with a name, and the json request looks like this:

{
  "inputs": [[1.1]]}Copy the code

An example of using curl to predict a request using the above method is shown below:

$ curl -d '{" inputs ": {" input_1" : [,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 [1]]}}' -X POST http://localhost:8501/v1/models/first_model:predict
$ curl -d '{" inputs ": [,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 [1]]}' -X POST http://localhost:8501/v1/models/first_model:predict{"outputs": [[0.778520346]]}Copy the code

Here, the key of the response JSON is outputs, and the value is a list corresponding to the output value of each sample. Since the model has only one output tensor, the output key value output_1 is also omitted here.

For a model with multiple input tensors and multiple output tensors, an example of a request and response in the input mode of instances is shown below:

{
  "instances": [{"input_1": [1.1]."input_2": [1.1.1] {},"input_1": [1.1]."input_2": [1.1.1]]}}Copy the code
{
  "predictions": [{"output_1": [0.431975186]."output_2": [0.382744759.0.32798624.0.289268941] {},"output_1": [0.431975186]."output_2": [0.382744759.0.32798624.0.289268941]]}}Copy the code

Inputs The following are examples of request and response inputs:

{
  "inputs": {
    "input_1": [[1.1],
      [1.1]],"input_2": [[1.1.1],
      [1.1.1]]}}Copy the code
{
  "outputs": {
    "output_2": [[0.431975186], [0.431975186]],
    "output_2": [[0.382744759.0.32798624.0.289268941],
      [0.382744759.0.32798624.0.289268941]]}}Copy the code

To sum up the two input methods, the input data provided by instances can be understood as a list composed of dict elements. Each dictionary element in the list is an input sample. The key of the dictionary is the name of the model input tensor, and the value is the corresponding value of the tensor. This is also an input that we can easily understand; Inputs are pure dictionaries. The key in the dictionary is the name of the model input tensor, and the value in the dictionary is a list. Each element in the list is a value corresponding to a data sample. In addition, for each input, the format of the response output is the same.

GRPC remote call

In addition to the ability to use the HTTP interface for model access, TensorFlow Serving provides a gRPC interface to fulfill model access requests more efficiently.

At present, only the gRPC API of Python is provided. To use this, first install the Python package tensorflow-serving API corresponding to TensorFlow: PIP install tensorflow-serving- API ==2.2.0 and then use the API interface provided by it to make the gRPC request.

The sample code for the Python version of the Serving service using gRPC is shown below:

import grpc
import tensorflow as tf
from absl import app, flags
import numpy as np

from tensorflow.core.framework import types_pb2
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

flags.DEFINE_string(
    'server'.'127.0.0.1:8500'.'PredictionService host:port',
)
FLAGS = flags.FLAGS

def main(_) :
    channel = grpc.insecure_channel(FLAGS.server)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'first_model'
    request.model_spec.signature_name = 'serving_default'
    request.model_spec.version_label = "stable"
    # request.model_spec.version.value = 0
    data = np.ones((2.31))
    request.inputs['input_1'].CopyFrom(
        tf.make_tensor_proto(data, dtype=types_pb2.DT_INT64))

    request.output_filter.append('output_1')
    result = stub.Predict(request, 10.0)  # 10 secs timeout

    print(result)
    print(result.outputs["output_1"].float_val)

if __name__ == '__main__':
    app.run(main)
Copy the code

The above code execution process is as follows:

  1. So let’s create onegRPCChannel (channel), and initialize a client stub using the channel (stub) for calling remotePredictFunction.
  2. A request object is then initializedrequest, and you need to specify some attribute values of the request object, such as the name of the request model, the version of the model, the data of the input tensor, and so on.
  3. Finally usingstubSend the request toServingService, and returns the result.
  4. If the modeloutputskeyMultiple values can be passedoutput_filterTo filter the specified output. You can also useoutputfloat_valProperty to get the value of the returned result.

Execute the above code and print something like this:

outputs { key: "output_1" value { dtype: DT_FLOAT tensor_shape { dim { size: 2 } dim { size: 1 } } float_val: 0.7785203456878662}} model_spec {name: "first_model" version {value: 1} "Serving_default} [0.7785203456878662, 0.7785203456878662]"Copy the code

In a real online environment, data prediction is often a stand-alone service that receives a data request from the client, forwards the data to the Serving service for the prediction, and returns the prediction. And the service will involve client high concurrent requests, so we may not use Python framework to accomplish this task, and tend to choose some more mature, with high performance of the back-end framework to do forward requests, such as the language of web frameworks such as gin, their high concurrency scenario will have more excellent performance.

In order to use the high-performance framework of another language, we need to install the TensorFlow Serving gRPC API library for the Serving request. There is no gPRC API for any other language other than Python. At this point, we need to generate the corresponding API code file based on the.proto file in the source code.

The generation of API code files can be done by protoc tools. Note that TensorFlow Serving the.proto file relies on some.proto files in the TensorFlow source. Therefore, when compiling code files using protoc, the code for both TensorFlow and TensorFlow Serving needs to be pulled locally for unified generation.

Once you have generated the API code file for the appropriate language version, you can access the Serving service using the gRPC interface similar to Python.

Model local testing

After saving the model in SavedModel format, in addition to using the TensorFlow Serving service to load the model for validation, you can also use the SavedModel command-line tool (CLI) to directly examine the SavedModel. The CLI allows you to quickly confirm the data types of the input and output tensors in the SavedModel and whether their dimensions match the tensors in the model definition. In addition, you can use the CLI for simple data testing to verify the usability of the model by passing sample data to the model and taking its output.

The CLI tool is usually installed with TensorFlow when it is installed. It is an executable program named saved_model_cli in the bin directory of the TensorFlow installation directory. You can use saved_model_CLI -h to view how to use the CLI tool.

Saved_model_cli has two common operations, show and run.

Show Model information

The show operation displays basic information about SavedModel, similar to accessing HTTP metadata interface to obtain metadata information. It can be used as follows:

$ saved_model_cli show [-h] --dir DIR [--all] [--tag_set TAG_SET] [--signature_def SIGNATURE_DEF_KEY]
optional arguments:
  -h, --help                         show this help message and exit
  --dir DIR                          directory containing the SavedModel to inspect
  --all                              if set, will output all information in given SavedModel
  --tag_set TAG_SET                  tag-set of graph in SavedModel to show, separated by ','
  --signature_def SIGNATURE_DEF_KEY  key of SignatureDef to display input(s) and output(s) for
Copy the code

For example, to view basic information about the first_model model whose version number is 0, run the following command:

$ saved_model_cli show --dir /models/first_model/0 --tag_set serve --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['input_1'] tensor_info:
      dtype: DT_INT64
      shape: (-1, 31)
      name: serving_default_input_1:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
Copy the code

Operation of Run model

The run operation performs a calculation of the model, gives input data and returns the output. It can be used as follows:

saved_model_cli run [-h] --dir DIR --tag_set TAG_SET --signature_def SIGNATURE_DEF_KEY
                [--inputs INPUTS]|[--input_exprs INPUT_EXPRS]|[--input_examples INPUT_EXAMPLES]
                [--outdir OUTDIR]
                [--overwrite]
                [--tf_debug]
Copy the code

The run operation provides three data input modes: –inputs, –input_exprs, and –input_examples. One of the three input modes must be provided when the run operation is executed.

  1. Inputs: read a numpy array from a file. The inputs can be either input_key=filename or input_key=filename[variable_name]. The file format of filename can be.npy,.npz, or one of the pickles, which saved_model_CLI loads using the numpy.load method.

    When the file format is. Npy, the array in the file is directly used as the input data for input_key.

    When the file format is.npz, if variable_name is specified, the.npy file named variable_name in the.npz file will be loaded as the input data for input_key. If no variable_name is specified, any.npy file of.npz will be loaded as input data for input_key.

    When the file format is pickle, if variable_NAME is specified, saved_model_CLI assumes that the data stored in the pickle file is a dictionary, The value corresponding to the variable_NAME is then read as input to the input_key. If no variable_name is specified, everything in the pickle file is taken as input to the input_key.

  2. –input_exprs Input_exprs means to use Python expressions as input data, which is useful when you need some simple sample data to test the SavedModel model. INPUT_EXPRS can be either a simple list such as input_key=[1, 1, 1] or a numpy function such as input_key=np.ones((1, 3)).

  3. –input_examples Input_examples uses tf.train.Example as input data. The format of INPUT_EXAMPLES is input_key=[{“age”:[22,24],”education”:[“BS”,”MS”]}]. The value of input_key is a dict list. The key of the dictionary is the name of the model input feature, and the value is the list of values for each feature. Whether to use this data input method needs to be decided according to the basic information of SavedModel.

In general, — input_exPRS INPUTS is the fastest and most convenient way to verify the availability of SavedModel INPUTS, as follows:

$ saved_model_cli run --dir /models/first_model/0 --tag_set serve --signature_def serving_default --input_expr "Input_1 = np. 'ones ((1, 31))"
INFO:tensorflow:Restoring parameters from /models/first_model/0/variables/variables
Result for output key output_1:
[[0.77852035]]
Copy the code

The resources

  1. TensorFlow Serving with Docker
  2. TensorFlow Serving Configuration
  3. TensorFlow Serving RESTful API
  4. SavedModel Command Line Interface
  5. Python GRPC Client Example