使用 Triton 的 BLS 部署印章识别模型

Trition 的 Python Backend

GitHub 地址:triton-inference-server/python_backend: Triton backend that enables pre-process, post-processing and other logic to be implemented in Python. (github.com)
Triton 的 Python Backend 是一个 Triton 后端,它允许在 Python 中实现预处理、后处理和其他逻辑。该后端将 Python 模块嵌入到 Triton 服务器中,以便在推理过程中执行特定的逻辑。通过使用该后端,用户可以使用 Python 编写自定义的预处理和后处理逻辑,以满足其特定的推理需求。


└── add_sub
    ├── 1
    │   └── model.py
    └── config.pbtxt

根据官方的定义,我们需要在模型的版本目录创建 model.py,如下结构所示

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.

    def auto_complete_config(auto_complete_model_config):
        """`auto_complete_config` is called only once when loading the model
        assuming the server was not started with
        `--disable-auto-complete-config`. Implementing this function is
        optional. No implementation of `auto_complete_config` will do nothing.
        This function can be used to set `max_batch_size`, `input` and `output`
        properties of the model using `set_max_batch_size`, `add_input`, and
        `add_output`. These properties will allow Triton to load the model with
        minimal model configuration in absence of a configuration file. This
        function returns the `pb_utils.ModelConfig` object with these
        properties. You can use the `as_dict` function to gain read-only access
        to the `pb_utils.ModelConfig` object. The `pb_utils.ModelConfig` object
        being returned from here will be used as the final configuration for
        the model.

        Note: The Python interpreter used to invoke this function will be
        destroyed upon returning from this function and as a result none of the
        objects created here will be available in the `initialize`, `execute`,
        or `finalize` functions.

        auto_complete_model_config : pb_utils.ModelConfig
          An object containing the existing model configuration. You can build
          upon the configuration given by this object when setting the
          properties for this model.

          An object containing the auto-completed model configuration
        inputs = [{
            'name': 'INPUT0',
            'data_type': 'TYPE_FP32',
            'dims': [4],
            # this parameter will set `INPUT0 as an optional input`
            'optional': True
        }, {
            'name': 'INPUT1',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        outputs = [{
            'name': 'OUTPUT0',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }, {
            'name': 'OUTPUT1',
            'data_type': 'TYPE_FP32',
            'dims': [4]

        # Demonstrate the usage of `as_dict`, `add_input`, `add_output`,
        # `set_max_batch_size`, and `set_dynamic_batching` functions.
        # Store the model configuration as a dictionary.
        config = auto_complete_model_config.as_dict()
        input_names = []
        output_names = []
        for input in config['input']:
        for output in config['output']:

        for input in inputs:
            # The name checking here is only for demonstrating the usage of
            # `as_dict` function. `add_input` will check for conflicts and
            # raise errors if an input with the same name already exists in
            # the configuration but has different data_type or dims property.
            if input['name'] not in input_names:
        for output in outputs:
            # The name checking here is only for demonstrating the usage of
            # `as_dict` function. `add_output` will check for conflicts and
            # raise errors if an output with the same name already exists in
            # the configuration but has different data_type or dims property.
            if output['name'] not in output_names:


        # To enable a dynamic batcher with default settings, you can use
        # auto_complete_model_config set_dynamic_batching() function. It is
        # commented in this example because the max_batch_size is zero.
        # auto_complete_model_config.set_dynamic_batching()

        return auto_complete_model_config

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to initialize any state associated with this model.

        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name

    def execute(self, requests):
        """`execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        requests : list
          A list of pb_utils.InferenceRequest

          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`

        responses = []

        # Every Python backend must iterate through list of requests and create
        # an instance of pb_utils.InferenceResponse class for each of them.
        # Reusing the same pb_utils.InferenceResponse object for multiple
        # requests may result in segmentation faults. You should avoid storing
        # any of the input Tensors in the class attributes as they will be
        # overridden in subsequent inference requests. You can make a copy of
        # the underlying NumPy array and store it if it is required.
        for request in requests:
            # Perform inference on the request and append it to responses
            # list...

        # You must return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        print('Cleaning up...')

创建自定义的 Python 环境

Triton 的 Python Backend 的 Python 是一个纯净的 3.10 版本,如果需要使用其他版本的 Python,那么需要自己构建一遍 Python Backend。如果不用换版本,只需要导出环境即可。

PyTorch 为例

conda create -n triton python=3.10
pip install torch torchvision torchaudio
conda install conda-pack
conda-pack  # 运行打包程序,将会打包到运行的目录下面


name: "model_a"
backend: "python"


parameters: {
  value: {string_value: "/home/iman/miniconda3/envs/python-3-6/python3.6.tar.gz"}


name: "model_a"
backend: "python"


parameters: {
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/python3.6.tar.gz"}

Business Logic Scripting(BLS)

Triton's ensemble feature supports many use cases where multiple models are composed into a pipeline (or more generally a DAG, directed acyclic graph). However, there are many other use cases that are not supported because as part of the model pipeline they require loops, conditionals (if-then-else), data-dependent control-flow and other custom logic to be intermixed with model execution. We call this combination of custom logic and model executions Business Logic Scripting (BLS).

Triton 的 ensemble 特性支持许多将多个模型组合成 apipeline(或更一般的 DAG、有向无环图)的用例。然而,还有许多其他用例不受支持,因为作为模型管道的一部分,它们需要循环、条件(if-then-else)、data-dependentcontrol-flow 和其他自定义逻辑与模型执行混合在一起。我们称之为自定义逻辑和模型执行的组合 Business LogicScript ting(BLS)

BLS 应该只在 execute 函数中使用,不支持 initializefinalize 方法。下面的示例显示了如何使用此功能:

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def execute(self, requests):
      # Create an InferenceRequest object. `model_name`,
      # `requested_output_names`, and `inputs` are the required arguments and
      # must be provided when constructing an InferenceRequest object. Make
      # sure to replace `inputs` argument with a list of `pb_utils.Tensor`
      # objects.
      inference_request = pb_utils.InferenceRequest(
          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
          inputs=[<pb_utils.Tensor object>])

      # `pb_utils.InferenceRequest` supports request_id, correlation_id,
      # model version, timeout and preferred_memory in addition to the
      # arguments described above.
      # Note: Starting from the 24.03 release, the `correlation_id` parameter
      # supports both string and unsigned integer values.
      # These arguments are optional. An example containing all the arguments:
      # inference_request = pb_utils.InferenceRequest(model_name='model_name',
      #   requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
      #   inputs=[<list of pb_utils.Tensor objects>],
      #   request_id="1", correlation_id=4, model_version=1, flags=0, timeout=5,
      #   preferred_memory=pb_utils.PreferredMemory(
      #     0))

      # Execute the inference_request and wait for the response
      inference_response = inference_request.exec()

      # Check if the inference response has an error
      if inference_response.has_error():
          raise pb_utils.TritonModelException(
          # Extract the output tensors from the inference response.
          output1 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_1')
          output2 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_2')

          # Decide the next steps for model execution based on the received
          # output tensors. It is possible to use the same output tensors
          # to for the final inference response too.

可以看到 BLS 的核心 API 都储存在 triton_python_backend_utils 中,我们对其他模型的调用、获取 output 等等核心功能都在这个工具类中,以下是该工具类常用函数。


  • get_input_tensor_by_name: 根据名称获取输入张量。
  • get_output_tensor_by_name: 根据名称获取输出张量。
  • get_input_config_by_name: 获取输入张量的配置。
  • get_output_config_by_name: 获取输出张量的配置。
  • get_input_names: 获取所有输入张量的名称。
  • get_output_names: 获取所有输出张量的名称。
  • Tensor: 表示输入或输出的张量对象。
  • InferenceResponse:构建响应张量
  • InferenceRequest: 构建模型调用请求

Tensor 对象

Tensor 对象 是 Triton Python Backend 中用于表示输入或输出张量的关键对象,提供了操作和管理张量的多种方法。

下面是 Tensor 对象的一些常用 API 列表:

  1. Tensor: 表示输入或输出的张量对象。
    • name: 获取张量的名称。
    • dtype: 获取张量的数据类型。
    • shape: 获取张量的形状。
    • as_numpy: 将张量转换为 NumPy 数组。
    • from_numpy: 从 NumPy 数组创建张量。
    • get_byte_size: 获取张量的字节大小。
    • to_dlpack: 将张量转换为 DLPack 对象。
    • from_dlpack: 从 DLPack 对象创建张量。


import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        self.model_config = args['model_config']

    def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT_TENSOR")
            output_tensor = pb_utils.Tensor("OUTPUT_TENSOR", input_tensor.as_numpy())
            response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
        return responses

    def finalize(self):
        print("Cleaning up...")

InferenceRequest 和 InferenceResponse

InferenceRequest 和 InferenceResponse 是 Triton Python Backend 中用于处理推理请求和响应的关键类,提供了操作推理过程所需的各种方法。

  • __init__(self, model_name, model_version, requested_output_names, inputs, outputs)

    • 创建一个新的推理请求。
    • 参数:
      • model_name: 模型名称。
      • model_version: 模型版本。
      • requested_output_names: 请求的输出名称列表。
      • inputs: 输入张量列表。
      • outputs: 输出张量列表。
  • model_name(self)

    • 返回推理请求的模型名称。
  • model_version(self)

    • 返回推理请求的模型版本。
  • requested_output_names(self)

    • 返回推理请求的输出名称列表。
  • inputs(self)

    • 返回推理请求的输入张量列表。
  • outputs(self)

    • 返回推理请求的输出张量列表。
  • __init__(self, output_tensors, error_message=None)

    • 创建一个新的推理响应。
    • 参数:
      • output_tensors: 输出张量列表。
      • error_message: 错误信息(如果有)。
  • output_tensors(self)

    • 返回推理响应的输出张量列表。
  • error_message(self)

    • 返回推理响应的错误信息。
  • has_error(self)

    • 检查推理响应是否包含错误。

使用 BLS 部署 TrOCR-Seal-Recognition

TrOCR-Seal-Recognition 是一个基于 TrOCR-Chinese 项目训练的端到端印章识别项目,项目仓库提供的预训练模型已经有了一个基础的识别印章的能力,我们通过 Triton 为其搭建一个服务化的接口供外部调用。

下载的模型包含 decoder_model.onnxencoder_model.onnx,由于下载的预训练模型已经为我们转换成 ONNX 格式了,我们可以直接基于 Triton ONNX BackEnd 的能力进行集成,以下是集成步骤。


我们参照 Triton 官方模型仓库的文档即可快速搭建出模型部署的整个结构,以下是部署的模型以及配置配置。

├── seal_decoder
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
├── seal_encoder
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

在配置前我们需要获取模型的输入输出的维度格式,使用 ONNX 自带的 API 可轻易的获取到模型训练时定义的输入输出张量(动态模型获取的都是 0),在上一节中已经描述过该 API,故此省略。

  • decoder
name: "seal_decoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
      name: "input_ids"
      data_type: TYPE_INT64
      dims: [ -1,-1 ]
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1,-1 ]
    name: "encoder_hidden_states"
    data_type: TYPE_FP32
    dims: [ -1,-1,384 ]
output [
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1,-1,3584 ]

instance_group: [
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
  • encoder
name: "seal_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ 1,3,-1,-1 ]
output [
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
    name: "1533",
    data_type: TYPE_FP32
    dims: [ -1, 384 ]

instance_group: [
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型

然后直接启动 triton 服务即可完成模型的部署。

使用 BLS 部署印章识别模型

TrOCR-Seal-Recognition 是一个基于 TrOCR 的印章端到端识别模型,仓库内提供了 ONNX 的预训练模型,现在我们将他使用 Triton 的 BLS 部署,实现一次调用即出结果的目的。


ONNX 模型部署

ONNX 模型的部署跟上章节所述,编写配置文件并构建模型仓库的结构即可,以下是两个模型的模型仓库配置

  • encoder_model.onnx
name: "seal_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ -1,-1,-1,-1 ]
output [
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
    name: "1533",
    data_type: TYPE_FP32
    dims: [ -1, 384 ]

instance_group: [
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
  • decoder_model.onnx
name: "seal_decoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
      name: "input_ids"
      data_type: TYPE_INT64
      dims: [ -1,-1 ]
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1,-1 ]
    name: "encoder_hidden_states"
    data_type: TYPE_FP32
    dims: [ -1,-1,384 ]
output [
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1,-1,3584 ]

instance_group: [
    count: 1 # 数量
    kind: KIND_GPU # 类型
    gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型


├── seal_bls
│   ├── 1
│   │   ├── model.py
│   │   └── vocab.json
│   └── config.pbtxt
├── seal_decoder
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
└── seal_encoder
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

BLS 代码的编写

基于 BLS 的定义,我们知道 BLS 就是 Python Backend 的一种使用方式,我们许要通过编写 python 代码来实现我们对模型的编排、前处理、后处理的功能。

TrOCR-Seal-Recognition 官方对印章识别模型的调用有详细的代码调用步骤,我们通过将他移植到 BLS 脚本之中即可完成 BLS 的开发。

  • onnx_test.py
import argparse
import json
import os
import statistics

import cv2
import numpy as np
import onnxruntime
from scipy.special import softmax

def read_vocab(path):
    with open(path, encoding="utf-8") as f:
        vocab = json.load(f)
    return vocab

def do_norm(x):
    mean = [0.5, 0.5, 0.5]
    std = [0.5, 0.5, 0.5]
    x = x / 255.0
    x[0, :, :] -= mean[0]
    x[1, :, :] -= mean[1]
    x[2, :, :] -= mean[2]
    x[0, :, :] /= std[0]
    x[1, :, :] /= std[1]
    x[2, :, :] /= std[2]
    return x

def decode_text(tokens, vocab, vocab_inp):
    decode trocr
    s_start = vocab.get('<s>')
    s_end = vocab.get('</s>')
    unk = vocab.get('<unk>')
    pad = vocab.get('<pad>')
    text = ''
    for tk in tokens:

        if tk == s_end:
        if tk not in [s_end, s_start, pad, unk]:
            text += vocab_inp[tk]

    return text

class OnnxEncoder(object):
    def __init__(self, model_path):
        self.model = onnxruntime.InferenceSession(model_path, providers=onnxruntime.get_available_providers())

    def __call__(self, image):
        onnx_inputs = {self.model.get_inputs()[0].name: np.asarray(image, dtype='float32')}
        onnx_output = self.model.run(None, onnx_inputs)[0]
        return onnx_output

class OnnxDecoder(object):
    def __init__(self, model_path):
        self.model = onnxruntime.InferenceSession(model_path, providers=onnxruntime.get_available_providers())
        self.input_names = {input_key.name: idx for idx, input_key in enumerate(self.model.get_inputs())}

    def __call__(self, input_ids,
        onnx_inputs = {"input_ids": input_ids,
                       "attention_mask": attention_mask,
                       "encoder_hidden_states": encoder_hidden_states}

        onnx_output = self.model.run(['logits'], onnx_inputs)
        return onnx_output

class OnnxEncoderDecoder(object):
    def __init__(self, model_path):
        self.encoder = OnnxEncoder(os.path.join(model_path, "model.onnx"))
        self.decoder = OnnxDecoder(os.path.join(model_path, "model.onnx"))
        self.vocab = read_vocab(os.path.join(model_path, "vocab.json"))
        self.vocab_inp = {self.vocab[key]: key for key in self.vocab}
        self.threshold = 0.88  # 置信度阈值,由于为进行负样本训练,该阈值较高
        self.max_len = 50  # 最长文本长度

    def run(self, image):
        image = cv2.resize(image, (384, 384))
        pixel_values = cv2.split(np.array(image))
        pixel_values = do_norm(np.array(pixel_values))
        pixel_values = np.array([pixel_values])
        encoder_output = self.encoder(pixel_values)
        ids = [self.vocab["<s>"], ]
        mask = [1, ]
        scores = []
        for i in range(self.max_len):
            input_ids = np.array([ids]).astype('int64')
            attention_mask = np.array([mask]).astype('int64')
            decoder_output = self.decoder(input_ids=input_ids,
            pred = decoder_output[0][0]
            pred = softmax(pred, axis=1)
            max_index = pred.argmax(axis=1)
            if max_index[-1] == self.vocab["</s>"]:
            scores.append(pred[max_index.shape[0] - 1, max_index[-1]])
        # if self.threshold < statistics.mean(scores):
        text = decode_text(ids, self.vocab, self.vocab_inp)
        # else:
        #     text = ""
        return text

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='onnx model test')
    parser.add_argument('--model', type=str,
                        help="onnx 模型地址")
    parser.add_argument('--test_img', type=str, help="测试图像")

    args = parser.parse_args()
    model = OnnxEncoderDecoder(args.model)
    img = cv2.imread(args.test_img)
    img = img[..., ::-1]  # BRG to RGB
    res = model.run(img)
import json
import os

import numpy as np
import triton_python_backend_utils as pb_utils
from scipy.special import softmax
from torch.utils.dlpack import from_dlpack, to_dlpack

# from torch.utils.dlpack import from_dlpack

class TritonPythonModel:

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        print('Toguide Seal Recognition BLS model initializing...')
        self.model_config = json.loads(args['model_config'])
        cur_path = os.path.abspath(__file__)
        dir_path = os.path.dirname(cur_path)
        self.vocab = self.read_vocab(os.path.join(dir_path, "vocab.json"))
        self.vocab_inp = {self.vocab[key]: key for key in self.vocab}
        self.max_len = 50

    def execute(self, requests):
        print('Toguide Seal Recognition BLS model executing...')

        response = []

        for request in requests:
            encoder_model_name = "seal_encoder"
            decoder_model_name = "seal_decoder"

            input = pb_utils.get_input_tensor_by_name(
                request, 'pixel_values')

            encoder_response = self.request_execute([input], encoder_model_name, ["last_hidden_state", "1533"])

            ids = [self.vocab["<s>"], ]
            mask = [1, ]
            scores = []
            for i in range(self.max_len):
                input_ids_tensor = pb_utils.Tensor("input_ids", np.array([ids]).astype('int64'))
                attention_mask_tensor = pb_utils.Tensor("attention_mask", np.array([mask]).astype('int64'))
                hidden_state_tensor = pb_utils.get_output_tensor_by_name(encoder_response,
                hidden_state_torch_tensor = self.pb_tensor_transform(hidden_state_tensor)
                # hidden_state_tensor = self.pb_tensor_transform(hidden_state_tensor)
                encoder_hidden_states_tensor = pb_utils.Tensor.from_dlpack("encoder_hidden_states",

                decoder_response = self.request_execute(
                    [input_ids_tensor, attention_mask_tensor, encoder_hidden_states_tensor],
                    decoder_model_name, ["logits"])
                logits_torch_tensor = self.pb_tensor_transform(pb_utils.get_output_tensor_by_name(decoder_response,
                pred = logits_torch_tensor.cpu().numpy()[0]
                pred = softmax(pred, axis=1)
                max_index = pred.argmax(axis=1)
                if max_index[-1] == self.vocab["</s>"]:
                scores.append(pred[max_index.shape[0] - 1, max_index[-1]])
                # print("Decoding single-character scoring:{}".format(scores))
                # print("Average rating decoding:{}".format(statistics.mean(scores)))
            text = self.decode_text(ids)
            utf8_bytes = self.string_to_utf8_bytes(text)
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[pb_utils.Tensor("OUTPUT_STRING", utf8_bytes)])

        return response

    def string_to_utf8_bytes(self, s):
        return np.frombuffer(s.encode('utf-8'), dtype=np.uint8)

    def decode_text(self, tokens):
        decode trocr
        s_start = self.vocab.get('<s>')
        s_end = self.vocab.get('</s>')
        unk = self.vocab.get('<unk>')
        pad = self.vocab.get('<pad>')
        text = ''
        for tk in tokens:

            if tk == s_end:
            if tk not in [s_end, s_start, pad, unk]:
                text += self.vocab_inp[tk]

        return text

    # BLS
    def request_execute(self, frames_tensor, model_name_string, model_output_name):
        # frames_tensor: tensor

        inference_request = pb_utils.InferenceRequest(
        inference_response = inference_request.exec()
        if inference_response.has_error():
            raise pb_utils.TritonModelException(inference_response.error().message())

        return inference_response

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to release any resources used for the model.
        print('Toguide Seal Recognition BLS model finalizing...')

    def read_vocab(self, path):
        with open(path, encoding="utf-8") as f:
            vocab = json.load(f)
        return vocab

    def pb_tensor_transform(self, pb_tensor):
        if pb_tensor.is_cpu():
            # print(f'bls pb_tensor is from cpu', flush=True)
            return pb_tensor.as_numpy()
            pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
            # print(f'bls pb_tensor is from {pytorch_tensor.device}', flush=True)
            return pytorch_tensor
            # return pytorch_tensor.cpu().numpy()

然后我们跟部署其他模型一样,编写配置文件并将 model.py 放置于版本文件夹内即可。

注意:BLS 使用的是 Python Backend,所以配置文件内的 backend 需要修改为 python!!


name: "seal_bls"  
backend: "python"  
max_batch_size: 0  
input [  
    name: "pixel_values"  
    data_type: TYPE_FP32  
    dims: [ -1,-1,-1,-1 ]  
output [  
    name: "OUTPUT_STRING"  
    data_type: TYPE_STRING  
    dims: [ -1 ]  
instance_group: [  
    count: 1 # 数量  
    kind: KIND_GPU # 类型  
    gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型  


Tensor is stored in GPU and cannot be converted to NumPypi.org

Tensor 被存储在 GPU 上,不能转成 Numpy。然后,Triton 没有提供其他接口去获取数据。目前没有比较好的解决办法,用一个笨方法来解决,存在一定的性能损耗,不过不算很大。这只能等 Triton 那边把相应的接口做出来了。我先将 Tensor 通过 dlpack 转成 Pytorch 的 Tensor,然后调用 numpy 方法。

def pb_tensor_to_numpy(pb_tensor):
    if pb_tensor.is_cpu():
        return pb_tensor.as_numpy()
        pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
        return pytorch_tensor.cpu().numpy()
