使用 Triton 的 BLS 部署印章识别模型-XiaoLin's Blog

Trition 的 Python Backend

GitHub 地址：triton-inference-server/python_backend: Triton backend that enables pre-process, post-processing and other logic to be implemented in Python. (github.com)
Triton 的 Python Backend 是一个 Triton 后端，它允许在 Python 中实现预处理、后处理和其他逻辑。该后端将 Python 模块嵌入到 Triton 服务器中，以便在推理过程中执行特定的逻辑。通过使用该后端，用户可以使用 Python 编写自定义的预处理和后处理逻辑，以满足其特定的推理需求。

模型仓库结构定义：

models
└── add_sub
    ├── 1
    │   └── model.py
    └── config.pbtxt

根据官方的定义，我们需要在模型的版本目录创建 model.py，如下结构所示

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        """`auto_complete_config` is called only once when loading the model
        assuming the server was not started with
        `--disable-auto-complete-config`. Implementing this function is
        optional. No implementation of `auto_complete_config` will do nothing.
        This function can be used to set `max_batch_size`, `input` and `output`
        properties of the model using `set_max_batch_size`, `add_input`, and
        `add_output`. These properties will allow Triton to load the model with
        minimal model configuration in absence of a configuration file. This
        function returns the `pb_utils.ModelConfig` object with these
        properties. You can use the `as_dict` function to gain read-only access
        to the `pb_utils.ModelConfig` object. The `pb_utils.ModelConfig` object
        being returned from here will be used as the final configuration for
        the model.

        Note: The Python interpreter used to invoke this function will be
        destroyed upon returning from this function and as a result none of the
        objects created here will be available in the `initialize`, `execute`,
        or `finalize` functions.

        Parameters
        ----------
        auto_complete_model_config : pb_utils.ModelConfig
          An object containing the existing model configuration. You can build
          upon the configuration given by this object when setting the
          properties for this model.

        Returns
        -------
        pb_utils.ModelConfig
          An object containing the auto-completed model configuration
        """
        inputs = [{
            'name': 'INPUT0',
            'data_type': 'TYPE_FP32',
            'dims': [4],
            # this parameter will set `INPUT0 as an optional input`
            'optional': True
        }, {
            'name': 'INPUT1',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }]
        outputs = [{
            'name': 'OUTPUT0',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }, {
            'name': 'OUTPUT1',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }]

        # Demonstrate the usage of `as_dict`, `add_input`, `add_output`,
        # `set_max_batch_size`, and `set_dynamic_batching` functions.
        # Store the model configuration as a dictionary.
        config = auto_complete_model_config.as_dict()
        input_names = []
        output_names = []
        for input in config['input']:
            input_names.append(input['name'])
        for output in config['output']:
            output_names.append(output['name'])

        for input in inputs:
            # The name checking here is only for demonstrating the usage of
            # `as_dict` function. `add_input` will check for conflicts and
            # raise errors if an input with the same name already exists in
            # the configuration but has different data_type or dims property.
            if input['name'] not in input_names:
                auto_complete_model_config.add_input(input)
        for output in outputs:
            # The name checking here is only for demonstrating the usage of
            # `as_dict` function. `add_output` will check for conflicts and
            # raise errors if an output with the same name already exists in
            # the configuration but has different data_type or dims property.
            if output['name'] not in output_names:
                auto_complete_model_config.add_output(output)

        auto_complete_model_config.set_max_batch_size(0)

        # To enable a dynamic batcher with default settings, you can use
        # auto_complete_model_config set_dynamic_batching() function. It is
        # commented in this example because the max_batch_size is zero.
        #
        # auto_complete_model_config.set_dynamic_batching()

        return auto_complete_model_config

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to initialize any state associated with this model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
            ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        print('Initialized...')

    def execute(self, requests):
        """`execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        responses = []

        # Every Python backend must iterate through list of requests and create
        # an instance of pb_utils.InferenceResponse class for each of them.
        # Reusing the same pb_utils.InferenceResponse object for multiple
        # requests may result in segmentation faults. You should avoid storing
        # any of the input Tensors in the class attributes as they will be
        # overridden in subsequent inference requests. You can make a copy of
        # the underlying NumPy array and store it if it is required.
        for request in requests:
            # Perform inference on the request and append it to responses
            # list...

        # You must return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

创建自定义的 Python 环境

Triton 的 Python Backend 的 Python 是一个纯净的 3.10 版本，如果需要使用其他版本的 Python，那么需要自己构建一遍 Python Backend。如果不用换版本，只需要导出环境即可。

以 PyTorch 为例

export PYTHONNOUSERSITE=True
conda create -n triton python=3.10
pip install torch torchvision torchaudio
conda install conda-pack
conda-pack  # 运行打包程序，将会打包到运行的目录下面

然后我们在配置文件中添加参数即可

name: "model_a"
backend: "python"

...

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/home/iman/miniconda3/envs/python-3-6/python3.6.tar.gz"}
}

也可以使用相对路径来设置环境地址

name: "model_a"
backend: "python"

...

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/python3.6.tar.gz"}
}

Business Logic Scripting（BLS）

Triton's ensemble feature supports many use cases where multiple models are composed into a pipeline (or more generally a DAG, directed acyclic graph). However, there are many other use cases that are not supported because as part of the model pipeline they require loops, conditionals (if-then-else), data-dependent control-flow and other custom logic to be intermixed with model execution. We call this combination of custom logic and model executions Business Logic Scripting (BLS).

Triton 的 ensemble 特性支持许多将多个模型组合成 apipeline（或更一般的 DAG、有向无环图）的用例。然而，还有许多其他用例不受支持，因为作为模型管道的一部分，它们需要循环、条件（if-then-else）、data-dependentcontrol-flow 和其他自定义逻辑与模型执行混合在一起。我们称之为自定义逻辑和模型执行的组合 Business LogicScript ting（BLS）。

BLS 应该只在 execute 函数中使用，不支持 initialize 或 finalize 方法。下面的示例显示了如何使用此功能：

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
  ...
    def execute(self, requests):
      ...
      # Create an InferenceRequest object. `model_name`,
      # `requested_output_names`, and `inputs` are the required arguments and
      # must be provided when constructing an InferenceRequest object. Make
      # sure to replace `inputs` argument with a list of `pb_utils.Tensor`
      # objects.
      inference_request = pb_utils.InferenceRequest(
          model_name='model_name',
          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
          inputs=[<pb_utils.Tensor object>])

      # `pb_utils.InferenceRequest` supports request_id, correlation_id,
      # model version, timeout and preferred_memory in addition to the
      # arguments described above.
      # Note: Starting from the 24.03 release, the `correlation_id` parameter
      # supports both string and unsigned integer values.
      # These arguments are optional. An example containing all the arguments:
      # inference_request = pb_utils.InferenceRequest(model_name='model_name',
      #   requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
      #   inputs=[<list of pb_utils.Tensor objects>],
      #   request_id="1", correlation_id=4, model_version=1, flags=0, timeout=5,
      #   preferred_memory=pb_utils.PreferredMemory(
      #     pb_utils.TRITONSERVER_MEMORY_GPU, # or pb_utils.TRITONSERVER_MEMORY_CPU
      #     0))

      # Execute the inference_request and wait for the response
      inference_response = inference_request.exec()

      # Check if the inference response has an error
      if inference_response.has_error():
          raise pb_utils.TritonModelException(
            inference_response.error().message())
      else:
          # Extract the output tensors from the inference response.
          output1 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_1')
          output2 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_2')

          # Decide the next steps for model execution based on the received
          # output tensors. It is possible to use the same output tensors
          # to for the final inference response too.

可以看到 BLS 的核心 API 都储存在 triton_python_backend_utils 中，我们对其他模型的调用、获取 output 等等核心功能都在这个工具类中，以下是该工具类常用函数。

triton_python_backend_utils

get_input_tensor_by_name: 根据名称获取输入张量。
get_output_tensor_by_name: 根据名称获取输出张量。
get_input_config_by_name: 获取输入张量的配置。
get_output_config_by_name: 获取输出张量的配置。
get_input_names: 获取所有输入张量的名称。
get_output_names: 获取所有输出张量的名称。
Tensor: 表示输入或输出的张量对象。
InferenceResponse：构建响应张量
InferenceRequest: 构建模型调用请求

Tensor 对象

Tensor 对象 是 Triton Python Backend 中用于表示输入或输出张量的关键对象，提供了操作和管理张量的多种方法。

下面是 Tensor 对象的一些常用 API 列表：

Tensor: 表示输入或输出的张量对象。
- name: 获取张量的名称。
- dtype: 获取张量的数据类型。
- shape: 获取张量的形状。
- as_numpy: 将张量转换为 NumPy 数组。
- from_numpy: 从 NumPy 数组创建张量。
- get_byte_size: 获取张量的字节大小。
- to_dlpack: 将张量转换为 DLPack 对象。
- from_dlpack: 从 DLPack 对象创建张量。

构建示例：

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        self.model_config = args['model_config']

    def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT_TENSOR")
            output_tensor = pb_utils.Tensor("OUTPUT_TENSOR", input_tensor.as_numpy())
            response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
            responses.append(response)
        return responses

    def finalize(self):
        print("Cleaning up...")

InferenceRequest 和 InferenceResponse

InferenceRequest 和 InferenceResponse 是 Triton Python Backend 中用于处理推理请求和响应的关键类，提供了操作推理过程所需的各种方法。

InferenceRequest

__init__(self, model_name, model_version, requested_output_names, inputs, outputs)
- 创建一个新的推理请求。
- 参数:
  - model_name: 模型名称。
  - model_version: 模型版本。
  - requested_output_names: 请求的输出名称列表。
  - inputs: 输入张量列表。
  - outputs: 输出张量列表。
model_name(self)
- 返回推理请求的模型名称。
model_version(self)
- 返回推理请求的模型版本。
requested_output_names(self)
- 返回推理请求的输出名称列表。
inputs(self)
- 返回推理请求的输入张量列表。
outputs(self)
- 返回推理请求的输出张量列表。

InferenceResponse

__init__(self, output_tensors, error_message=None)
- 创建一个新的推理响应。
- 参数:
  - output_tensors: 输出张量列表。
  - error_message: 错误信息（如果有）。
output_tensors(self)
- 返回推理响应的输出张量列表。
error_message(self)
- 返回推理响应的错误信息。
has_error(self)
- 检查推理响应是否包含错误。

使用 BLS 部署 TrOCR-Seal-Recognition

GitHub 仓库：Gmgge/TrOCR-Seal-Recognition: 基于 transformer 的 ocr 识别，在公章 (印章识别, seal recognition）拓展应用 (github.com)

TrOCR-Seal-Recognition 是一个基于 TrOCR-Chinese 项目训练的端到端印章识别项目，项目仓库提供的预训练模型已经有了一个基础的识别印章的能力，我们通过 Triton 为其搭建一个服务化的接口供外部调用。

下载的模型包含 decoder_model.onnx 和 encoder_model.onnx，由于下载的预训练模型已经为我们转换成 ONNX 格式了，我们可以直接基于 Triton ONNX BackEnd 的能力进行集成，以下是集成步骤。

模型部署

我们参照 Triton 官方模型仓库的文档即可快速搭建出模型部署的整个结构，以下是部署的模型以及配置配置。

.
├── seal_decoder
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
├── seal_encoder
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

在配置前我们需要获取模型的输入输出的维度格式，使用 ONNX 自带的 API 可轻易的获取到模型训练时定义的输入输出张量（动态模型获取的都是 0），在上一节中已经描述过该 API，故此省略。

decoder

name: "seal_decoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
  {
      name: "input_ids"
      data_type: TYPE_INT64
      dims: [ -1,-1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1,-1 ]
  },
  {
    name: "encoder_hidden_states"
    data_type: TYPE_FP32
    dims: [ -1,-1,384 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1,-1,3584 ]
  }
]

instance_group: [
  {
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU，则该列表将指定对应序号下的可见CUDA设备来运行模型
  }
]

encoder

name: "seal_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
  {
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ 1,3,-1,-1 ]
  }
]
output [
  {
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "1533",
    data_type: TYPE_FP32
    dims: [ -1, 384 ]
  }
]

instance_group: [
  {
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU，则该列表将指定对应序号下的可见CUDA设备来运行模型
  }
]

然后直接启动 triton 服务即可完成模型的部署。

使用 BLS 部署印章识别模型

GitHub：Gmgge/TrOCR-Seal-Recognition: 基于 transformer 的 ocr 识别，在公章 (印章识别, seal recognition）拓展应用 (github.com)

TrOCR-Seal-Recognition 是一个基于 TrOCR 的印章端到端识别模型，仓库内提供了 ONNX 的预训练模型，现在我们将他使用 Triton 的 BLS 部署，实现一次调用即出结果的目的。

从官方下载的模型包含两个模型，encoder_model.onnx 和 decoder_model.onnx 。

ONNX 模型部署

ONNX 模型的部署跟上章节所述，编写配置文件并构建模型仓库的结构即可，以下是两个模型的模型仓库配置

encoder_model.onnx

name: "seal_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
  {
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ -1,-1,-1,-1 ]
  }
]
output [
  {
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "1533",
    data_type: TYPE_FP32
    dims: [ -1, 384 ]
  }
]

instance_group: [
  {
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU，则该列表将指定对应序号下的可见CUDA设备来运行模型
  }
]

decoder_model.onnx

name: "seal_decoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
  {
      name: "input_ids"
      data_type: TYPE_INT64
      dims: [ -1,-1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1,-1 ]
  },
  {
    name: "encoder_hidden_states"
    data_type: TYPE_FP32
    dims: [ -1,-1,384 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1,-1,3584 ]
  }
]

instance_group: [
  {
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU，则该列表将指定对应序号下的可见CUDA设备来运行模型
  }
]

以下是当前模型仓库的结构

.
├── seal_bls
│   ├── 1
│   │   ├── model.py
│   │   └── vocab.json
│   └── config.pbtxt
├── seal_decoder
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
└── seal_encoder
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

BLS 代码的编写

基于 BLS 的定义，我们知道 BLS 就是 Python Backend 的一种使用方式，我们许要通过编写 python 代码来实现我们对模型的编排、前处理、后处理的功能。

TrOCR-Seal-Recognition 官方对印章识别模型的调用有详细的代码调用步骤，我们通过将他移植到 BLS 脚本之中即可完成 BLS 的开发。

onnx_test.py

import argparse
import json
import os
import statistics

import cv2
import numpy as np
import onnxruntime
from scipy.special import softmax


def read_vocab(path):
    """
    加载词典
    """
    with open(path, encoding="utf-8") as f:
        vocab = json.load(f)
    return vocab


def do_norm(x):
    mean = [0.5, 0.5, 0.5]
    std = [0.5, 0.5, 0.5]
    x = x / 255.0
    x[0, :, :] -= mean[0]
    x[1, :, :] -= mean[1]
    x[2, :, :] -= mean[2]
    x[0, :, :] /= std[0]
    x[1, :, :] /= std[1]
    x[2, :, :] /= std[2]
    return x


def decode_text(tokens, vocab, vocab_inp):
    """
    decode trocr
    """
    s_start = vocab.get('<s>')
    s_end = vocab.get('</s>')
    unk = vocab.get('<unk>')
    pad = vocab.get('<pad>')
    text = ''
    for tk in tokens:

        if tk == s_end:
            break
        if tk not in [s_end, s_start, pad, unk]:
            text += vocab_inp[tk]

    return text


class OnnxEncoder(object):
    def __init__(self, model_path):
        self.model = onnxruntime.InferenceSession(model_path, providers=onnxruntime.get_available_providers())

    def __call__(self, image):
        onnx_inputs = {self.model.get_inputs()[0].name: np.asarray(image, dtype='float32')}
        onnx_output = self.model.run(None, onnx_inputs)[0]
        return onnx_output


class OnnxDecoder(object):
    def __init__(self, model_path):
        self.model = onnxruntime.InferenceSession(model_path, providers=onnxruntime.get_available_providers())
        self.input_names = {input_key.name: idx for idx, input_key in enumerate(self.model.get_inputs())}

    def __call__(self, input_ids,
                 encoder_hidden_states,
                 attention_mask):
        onnx_inputs = {"input_ids": input_ids,
                       "attention_mask": attention_mask,
                       "encoder_hidden_states": encoder_hidden_states}

        onnx_output = self.model.run(['logits'], onnx_inputs)
        return onnx_output


class OnnxEncoderDecoder(object):
    def __init__(self, model_path):
        self.encoder = OnnxEncoder(os.path.join(model_path, "model.onnx"))
        self.decoder = OnnxDecoder(os.path.join(model_path, "model.onnx"))
        self.vocab = read_vocab(os.path.join(model_path, "vocab.json"))
        self.vocab_inp = {self.vocab[key]: key for key in self.vocab}
        self.threshold = 0.88  # 置信度阈值，由于为进行负样本训练，该阈值较高
        self.max_len = 50  # 最长文本长度

    def run(self, image):
        """
        rgb:image
        """
        image = cv2.resize(image, (384, 384))
        pixel_values = cv2.split(np.array(image))
        pixel_values = do_norm(np.array(pixel_values))
        pixel_values = np.array([pixel_values])
        encoder_output = self.encoder(pixel_values)
        ids = [self.vocab["<s>"], ]
        mask = [1, ]
        scores = []
        for i in range(self.max_len):
            input_ids = np.array([ids]).astype('int64')
            attention_mask = np.array([mask]).astype('int64')
            decoder_output = self.decoder(input_ids=input_ids,
                                          encoder_hidden_states=encoder_output,
                                          attention_mask=attention_mask
                                          )
            pred = decoder_output[0][0]
            pred = softmax(pred, axis=1)
            max_index = pred.argmax(axis=1)
            if max_index[-1] == self.vocab["</s>"]:
                break
            scores.append(pred[max_index.shape[0] - 1, max_index[-1]])
            ids.append(max_index[-1])
            mask.append(1)
        print("解码单字评分：{}".format(scores))
        print("解码平均评分：{}".format(statistics.mean(scores)))
        # if self.threshold < statistics.mean(scores):
        text = decode_text(ids, self.vocab, self.vocab_inp)
        # else:
        #     text = ""
        return text


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='onnx model test')
    parser.add_argument('--model', type=str,
                        help="onnx 模型地址")
    parser.add_argument('--test_img', type=str, help="测试图像")

    args = parser.parse_args()
    model = OnnxEncoderDecoder(args.model)
    img = cv2.imread(args.test_img)
    img = img[..., ::-1]  # BRG to RGB
    res = model.run(img)
    print(res)

model.py

import json
import os

import numpy as np
import triton_python_backend_utils as pb_utils
from scipy.special import softmax
from torch.utils.dlpack import from_dlpack, to_dlpack


# from torch.utils.dlpack import from_dlpack


class TritonPythonModel:

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        print('Toguide Seal Recognition BLS model initializing...')
        self.model_config = json.loads(args['model_config'])
        cur_path = os.path.abspath(__file__)
        dir_path = os.path.dirname(cur_path)
        self.vocab = self.read_vocab(os.path.join(dir_path, "vocab.json"))
        self.vocab_inp = {self.vocab[key]: key for key in self.vocab}
        self.max_len = 50

    def execute(self, requests):
        print('Toguide Seal Recognition BLS model executing...')

        response = []

        for request in requests:
            encoder_model_name = "seal_encoder"
            decoder_model_name = "seal_decoder"

            input = pb_utils.get_input_tensor_by_name(
                request, 'pixel_values')

            encoder_response = self.request_execute([input], encoder_model_name, ["last_hidden_state", "1533"])

            ids = [self.vocab["<s>"], ]
            mask = [1, ]
            scores = []
            for i in range(self.max_len):
                input_ids_tensor = pb_utils.Tensor("input_ids", np.array([ids]).astype('int64'))
                attention_mask_tensor = pb_utils.Tensor("attention_mask", np.array([mask]).astype('int64'))
                hidden_state_tensor = pb_utils.get_output_tensor_by_name(encoder_response,
                                                                         "last_hidden_state")
                hidden_state_torch_tensor = self.pb_tensor_transform(hidden_state_tensor)
                # hidden_state_tensor = self.pb_tensor_transform(hidden_state_tensor)
                encoder_hidden_states_tensor = pb_utils.Tensor.from_dlpack("encoder_hidden_states",
                                                                           to_dlpack(hidden_state_torch_tensor))

                decoder_response = self.request_execute(
                    [input_ids_tensor, attention_mask_tensor, encoder_hidden_states_tensor],
                    decoder_model_name, ["logits"])
                logits_torch_tensor = self.pb_tensor_transform(pb_utils.get_output_tensor_by_name(decoder_response,
                                                                                                  "logits"))
                pred = logits_torch_tensor.cpu().numpy()[0]
                pred = softmax(pred, axis=1)
                max_index = pred.argmax(axis=1)
                if max_index[-1] == self.vocab["</s>"]:
                    break
                scores.append(pred[max_index.shape[0] - 1, max_index[-1]])
                ids.append(max_index[-1])
                mask.append(1)
                # print("Decoding single-character scoring：{}".format(scores))
                # print("Average rating decoding：{}".format(statistics.mean(scores)))
            text = self.decode_text(ids)
            utf8_bytes = self.string_to_utf8_bytes(text)
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[pb_utils.Tensor("OUTPUT_STRING", utf8_bytes)])
            response.append(inference_response)

        return response

    def string_to_utf8_bytes(self, s):
        return np.frombuffer(s.encode('utf-8'), dtype=np.uint8)

    def decode_text(self, tokens):
        """
        decode trocr
        """
        s_start = self.vocab.get('<s>')
        s_end = self.vocab.get('</s>')
        unk = self.vocab.get('<unk>')
        pad = self.vocab.get('<pad>')
        text = ''
        for tk in tokens:

            if tk == s_end:
                break
            if tk not in [s_end, s_start, pad, unk]:
                text += self.vocab_inp[tk]

        return text

    # BLS
    def request_execute(self, frames_tensor, model_name_string, model_output_name):
        # frames_tensor: tensor

        inference_request = pb_utils.InferenceRequest(
            model_name=model_name_string,
            requested_output_names=model_output_name,
            inputs=frames_tensor
        )
        inference_response = inference_request.exec()
        if inference_response.has_error():
            raise pb_utils.TritonModelException(inference_response.error().message())

        return inference_response

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to release any resources used for the model.
        """
        print('Toguide Seal Recognition BLS model finalizing...')

    def read_vocab(self, path):
        """
        加载词典
        """
        with open(path, encoding="utf-8") as f:
            vocab = json.load(f)
        return vocab

    def pb_tensor_transform(self, pb_tensor):
        if pb_tensor.is_cpu():
            # print(f'bls pb_tensor is from cpu', flush=True)
            return pb_tensor.as_numpy()
        else:
            pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
            # print(f'bls pb_tensor is from {pytorch_tensor.device}', flush=True)
            return pytorch_tensor
            # return pytorch_tensor.cpu().numpy()

然后我们跟部署其他模型一样，编写配置文件并将 model.py 放置于版本文件夹内即可。

注意：BLS 使用的是 Python Backend，所以配置文件内的 backend 需要修改为 python！！

config.pbtxt

name: "seal_bls"  
backend: "python"  
max_batch_size: 0  
input [  
  {  
    name: "pixel_values"  
    data_type: TYPE_FP32  
    dims: [ -1,-1,-1,-1 ]  
  }  
]  
output [  
  {  
    name: "OUTPUT_STRING"  
    data_type: TYPE_STRING  
    dims: [ -1 ]  
  }  
]  
  
instance_group: [  
  {  
    count: 1 # 数量  
    kind: KIND_GPU # 类型  
    gpus: [ 0 ] # 如果参数项为GPU，则该列表将指定对应序号下的可见CUDA设备来运行模型  
  }  
]

遇到的问题

Tensor is stored in GPU and cannot be converted to NumPypi.org

Tensor 被存储在 GPU 上，不能转成 Numpy。然后，Triton 没有提供其他接口去获取数据。目前没有比较好的解决办法，用一个笨方法来解决，存在一定的性能损耗，不过不算很大。这只能等 Triton 那边把相应的接口做出来了。我先将 Tensor 通过 dlpack 转成 Pytorch 的 Tensor，然后调用 numpy 方法。

def pb_tensor_to_numpy(pb_tensor):
    if pb_tensor.is_cpu():
        return pb_tensor.as_numpy()
    else:
        pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
        return pytorch_tensor.cpu().numpy()

目录CONTENT

使用 Triton 的 BLS 部署印章识别模型