Trition 的 Python Backend
GitHub 地址:triton-inference-server/python_backend: Triton backend that enables pre-process, post-processing and other logic to be implemented in Python. (github.com)
Triton 的 Python Backend 是一个 Triton 后端,它允许在 Python 中实现预处理、后处理和其他逻辑。该后端将 Python 模块嵌入到 Triton 服务器中,以便在推理过程中执行特定的逻辑。通过使用该后端,用户可以使用 Python 编写自定义的预处理和后处理逻辑,以满足其特定的推理需求。
模型仓库结构定义:
models
└── add_sub
├── 1
│ └── model.py
└── config.pbtxt
根据官方的定义,我们需要在模型的版本目录创建 model.py
,如下结构所示
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
@staticmethod
def auto_complete_config(auto_complete_model_config):
"""`auto_complete_config` is called only once when loading the model
assuming the server was not started with
`--disable-auto-complete-config`. Implementing this function is
optional. No implementation of `auto_complete_config` will do nothing.
This function can be used to set `max_batch_size`, `input` and `output`
properties of the model using `set_max_batch_size`, `add_input`, and
`add_output`. These properties will allow Triton to load the model with
minimal model configuration in absence of a configuration file. This
function returns the `pb_utils.ModelConfig` object with these
properties. You can use the `as_dict` function to gain read-only access
to the `pb_utils.ModelConfig` object. The `pb_utils.ModelConfig` object
being returned from here will be used as the final configuration for
the model.
Note: The Python interpreter used to invoke this function will be
destroyed upon returning from this function and as a result none of the
objects created here will be available in the `initialize`, `execute`,
or `finalize` functions.
Parameters
----------
auto_complete_model_config : pb_utils.ModelConfig
An object containing the existing model configuration. You can build
upon the configuration given by this object when setting the
properties for this model.
Returns
-------
pb_utils.ModelConfig
An object containing the auto-completed model configuration
"""
inputs = [{
'name': 'INPUT0',
'data_type': 'TYPE_FP32',
'dims': [4],
# this parameter will set `INPUT0 as an optional input`
'optional': True
}, {
'name': 'INPUT1',
'data_type': 'TYPE_FP32',
'dims': [4]
}]
outputs = [{
'name': 'OUTPUT0',
'data_type': 'TYPE_FP32',
'dims': [4]
}, {
'name': 'OUTPUT1',
'data_type': 'TYPE_FP32',
'dims': [4]
}]
# Demonstrate the usage of `as_dict`, `add_input`, `add_output`,
# `set_max_batch_size`, and `set_dynamic_batching` functions.
# Store the model configuration as a dictionary.
config = auto_complete_model_config.as_dict()
input_names = []
output_names = []
for input in config['input']:
input_names.append(input['name'])
for output in config['output']:
output_names.append(output['name'])
for input in inputs:
# The name checking here is only for demonstrating the usage of
# `as_dict` function. `add_input` will check for conflicts and
# raise errors if an input with the same name already exists in
# the configuration but has different data_type or dims property.
if input['name'] not in input_names:
auto_complete_model_config.add_input(input)
for output in outputs:
# The name checking here is only for demonstrating the usage of
# `as_dict` function. `add_output` will check for conflicts and
# raise errors if an output with the same name already exists in
# the configuration but has different data_type or dims property.
if output['name'] not in output_names:
auto_complete_model_config.add_output(output)
auto_complete_model_config.set_max_batch_size(0)
# To enable a dynamic batcher with default settings, you can use
# auto_complete_model_config set_dynamic_batching() function. It is
# commented in this example because the max_batch_size is zero.
#
# auto_complete_model_config.set_dynamic_batching()
return auto_complete_model_config
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to initialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device
ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
print('Initialized...')
def execute(self, requests):
"""`execute` must be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference is requested
for this model.
Parameters
----------
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
responses = []
# Every Python backend must iterate through list of requests and create
# an instance of pb_utils.InferenceResponse class for each of them.
# Reusing the same pb_utils.InferenceResponse object for multiple
# requests may result in segmentation faults. You should avoid storing
# any of the input Tensors in the class attributes as they will be
# overridden in subsequent inference requests. You can make a copy of
# the underlying NumPy array and store it if it is required.
for request in requests:
# Perform inference on the request and append it to responses
# list...
# You must return a list of pb_utils.InferenceResponse. Length
# of this list must match the length of `requests` list.
return responses
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is optional. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
创建自定义的 Python 环境
Triton 的 Python Backend 的 Python 是一个纯净的 3.10 版本,如果需要使用其他版本的 Python,那么需要自己构建一遍 Python Backend。如果不用换版本,只需要导出环境即可。
以 PyTorch
为例
export PYTHONNOUSERSITE=True
conda create -n triton python=3.10
pip install torch torchvision torchaudio
conda install conda-pack
conda-pack # 运行打包程序,将会打包到运行的目录下面
然后我们在配置文件中添加参数即可
name: "model_a"
backend: "python"
...
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "/home/iman/miniconda3/envs/python-3-6/python3.6.tar.gz"}
}
也可以使用相对路径来设置环境地址
name: "model_a"
backend: "python"
...
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/python3.6.tar.gz"}
}
Business Logic Scripting(BLS)
Triton's ensemble feature supports many use cases where multiple models are composed into a pipeline (or more generally a DAG, directed acyclic graph). However, there are many other use cases that are not supported because as part of the model pipeline they require loops, conditionals (if-then-else), data-dependent control-flow and other custom logic to be intermixed with model execution. We call this combination of custom logic and model executions Business Logic Scripting (BLS).
Triton 的 ensemble 特性支持许多将多个模型组合成 apipeline(或更一般的 DAG、有向无环图)的用例。然而,还有许多其他用例不受支持,因为作为模型管道的一部分,它们需要循环、条件(if-then-else)、data-dependentcontrol-flow 和其他自定义逻辑与模型执行混合在一起。我们称之为自定义逻辑和模型执行的组合 Business LogicScript ting(BLS)
。
BLS 应该只在 execute
函数中使用,不支持 initialize
或 finalize
方法。下面的示例显示了如何使用此功能:
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
...
def execute(self, requests):
...
# Create an InferenceRequest object. `model_name`,
# `requested_output_names`, and `inputs` are the required arguments and
# must be provided when constructing an InferenceRequest object. Make
# sure to replace `inputs` argument with a list of `pb_utils.Tensor`
# objects.
inference_request = pb_utils.InferenceRequest(
model_name='model_name',
requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
inputs=[<pb_utils.Tensor object>])
# `pb_utils.InferenceRequest` supports request_id, correlation_id,
# model version, timeout and preferred_memory in addition to the
# arguments described above.
# Note: Starting from the 24.03 release, the `correlation_id` parameter
# supports both string and unsigned integer values.
# These arguments are optional. An example containing all the arguments:
# inference_request = pb_utils.InferenceRequest(model_name='model_name',
# requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
# inputs=[<list of pb_utils.Tensor objects>],
# request_id="1", correlation_id=4, model_version=1, flags=0, timeout=5,
# preferred_memory=pb_utils.PreferredMemory(
# pb_utils.TRITONSERVER_MEMORY_GPU, # or pb_utils.TRITONSERVER_MEMORY_CPU
# 0))
# Execute the inference_request and wait for the response
inference_response = inference_request.exec()
# Check if the inference response has an error
if inference_response.has_error():
raise pb_utils.TritonModelException(
inference_response.error().message())
else:
# Extract the output tensors from the inference response.
output1 = pb_utils.get_output_tensor_by_name(
inference_response, 'REQUESTED_OUTPUT_1')
output2 = pb_utils.get_output_tensor_by_name(
inference_response, 'REQUESTED_OUTPUT_2')
# Decide the next steps for model execution based on the received
# output tensors. It is possible to use the same output tensors
# to for the final inference response too.
可以看到 BLS 的核心 API 都储存在 triton_python_backend_utils
中,我们对其他模型的调用、获取 output 等等核心功能都在这个工具类中,以下是该工具类常用函数。
triton_python_backend_utils
get_input_tensor_by_name
: 根据名称获取输入张量。get_output_tensor_by_name
: 根据名称获取输出张量。get_input_config_by_name
: 获取输入张量的配置。get_output_config_by_name
: 获取输出张量的配置。get_input_names
: 获取所有输入张量的名称。get_output_names
: 获取所有输出张量的名称。Tensor
: 表示输入或输出的张量对象。InferenceResponse
:构建响应张量InferenceRequest
: 构建模型调用请求
Tensor 对象
Tensor 对象 是 Triton Python Backend 中用于表示输入或输出张量的关键对象,提供了操作和管理张量的多种方法。
下面是 Tensor 对象的一些常用 API 列表:
Tensor
: 表示输入或输出的张量对象。name
: 获取张量的名称。dtype
: 获取张量的数据类型。shape
: 获取张量的形状。as_numpy
: 将张量转换为 NumPy 数组。from_numpy
: 从 NumPy 数组创建张量。get_byte_size
: 获取张量的字节大小。to_dlpack
: 将张量转换为 DLPack 对象。from_dlpack
: 从 DLPack 对象创建张量。
构建示例:
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def initialize(self, args):
self.model_config = args['model_config']
def execute(self, requests):
responses = []
for request in requests:
input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT_TENSOR")
output_tensor = pb_utils.Tensor("OUTPUT_TENSOR", input_tensor.as_numpy())
response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
responses.append(response)
return responses
def finalize(self):
print("Cleaning up...")
InferenceRequest 和 InferenceResponse
InferenceRequest 和 InferenceResponse 是 Triton Python Backend 中用于处理推理请求和响应的关键类,提供了操作推理过程所需的各种方法。
InferenceRequest
-
__init__(self, model_name, model_version, requested_output_names, inputs, outputs)
- 创建一个新的推理请求。
- 参数:
model_name
: 模型名称。model_version
: 模型版本。requested_output_names
: 请求的输出名称列表。inputs
: 输入张量列表。outputs
: 输出张量列表。
-
model_name(self)
- 返回推理请求的模型名称。
-
model_version(self)
- 返回推理请求的模型版本。
-
requested_output_names(self)
- 返回推理请求的输出名称列表。
-
inputs(self)
- 返回推理请求的输入张量列表。
-
outputs(self)
- 返回推理请求的输出张量列表。
InferenceResponse
-
__init__(self, output_tensors, error_message=None)
- 创建一个新的推理响应。
- 参数:
output_tensors
: 输出张量列表。error_message
: 错误信息(如果有)。
-
output_tensors(self)
- 返回推理响应的输出张量列表。
-
error_message(self)
- 返回推理响应的错误信息。
-
has_error(self)
- 检查推理响应是否包含错误。
使用 BLS 部署 TrOCR-Seal-Recognition
GitHub 仓库:Gmgge/TrOCR-Seal-Recognition: 基于 transformer 的 ocr 识别,在公章 (印章识别, seal recognition)拓展应用 (github.com)
TrOCR-Seal-Recognition
是一个基于 TrOCR-Chinese 项目训练的端到端印章识别项目,项目仓库提供的预训练模型已经有了一个基础的识别印章的能力,我们通过 Triton 为其搭建一个服务化的接口供外部调用。
下载的模型包含 decoder_model.onnx
和 encoder_model.onnx
,由于下载的预训练模型已经为我们转换成 ONNX 格式了,我们可以直接基于 Triton ONNX BackEnd 的能力进行集成,以下是集成步骤。
模型部署
我们参照 Triton 官方模型仓库的文档即可快速搭建出模型部署的整个结构,以下是部署的模型以及配置配置。
.
├── seal_decoder
│ ├── 1
│ │ └── model.onnx
│ └── config.pbtxt
├── seal_encoder
├── 1
│ └── model.onnx
└── config.pbtxt
在配置前我们需要获取模型的输入输出的维度格式,使用 ONNX 自带的 API 可轻易的获取到模型训练时定义的输入输出张量(动态模型获取的都是 0),在上一节中已经描述过该 API,故此省略。
- decoder
name: "seal_decoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1,-1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1,-1 ]
},
{
name: "encoder_hidden_states"
data_type: TYPE_FP32
dims: [ -1,-1,384 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1,-1,3584 ]
}
]
instance_group: [
{
count: 1 # 数量
kind: KIND_GPU # 类型
gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
}
]
- encoder
name: "seal_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
{
name: "pixel_values"
data_type: TYPE_FP32
dims: [ 1,3,-1,-1 ]
}
]
output [
{
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "1533",
data_type: TYPE_FP32
dims: [ -1, 384 ]
}
]
instance_group: [
{
count: 1 # 数量
kind: KIND_GPU # 类型
gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
}
]
然后直接启动 triton 服务即可完成模型的部署。
使用 BLS 部署印章识别模型
GitHub:Gmgge/TrOCR-Seal-Recognition: 基于 transformer 的 ocr 识别,在公章 (印章识别, seal recognition)拓展应用 (github.com)
TrOCR-Seal-Recognition
是一个基于 TrOCR
的印章端到端识别模型,仓库内提供了 ONNX 的预训练模型,现在我们将他使用 Triton 的 BLS 部署,实现一次调用即出结果的目的。
从官方下载的模型包含两个模型,encoder_model.onnx
和 decoder_model.onnx
。
ONNX 模型部署
ONNX 模型的部署跟上章节所述,编写配置文件并构建模型仓库的结构即可,以下是两个模型的模型仓库配置
- encoder_model.onnx
name: "seal_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
{
name: "pixel_values"
data_type: TYPE_FP32
dims: [ -1,-1,-1,-1 ]
}
]
output [
{
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "1533",
data_type: TYPE_FP32
dims: [ -1, 384 ]
}
]
instance_group: [
{
count: 1 # 数量
kind: KIND_GPU # 类型
gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
}
]
- decoder_model.onnx
name: "seal_decoder"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1,-1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1,-1 ]
},
{
name: "encoder_hidden_states"
data_type: TYPE_FP32
dims: [ -1,-1,384 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1,-1,3584 ]
}
]
instance_group: [
{
count: 1 # 数量
kind: KIND_GPU # 类型
gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
}
]
以下是当前模型仓库的结构
.
├── seal_bls
│ ├── 1
│ │ ├── model.py
│ │ └── vocab.json
│ └── config.pbtxt
├── seal_decoder
│ ├── 1
│ │ └── model.onnx
│ └── config.pbtxt
└── seal_encoder
├── 1
│ └── model.onnx
└── config.pbtxt
BLS 代码的编写
基于 BLS 的定义,我们知道 BLS 就是 Python Backend 的一种使用方式,我们许要通过编写 python 代码来实现我们对模型的编排、前处理、后处理的功能。
TrOCR-Seal-Recognition 官方对印章识别模型的调用有详细的代码调用步骤,我们通过将他移植到 BLS 脚本之中即可完成 BLS 的开发。
- onnx_test.py
import argparse
import json
import os
import statistics
import cv2
import numpy as np
import onnxruntime
from scipy.special import softmax
def read_vocab(path):
"""
加载词典
"""
with open(path, encoding="utf-8") as f:
vocab = json.load(f)
return vocab
def do_norm(x):
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]
x = x / 255.0
x[0, :, :] -= mean[0]
x[1, :, :] -= mean[1]
x[2, :, :] -= mean[2]
x[0, :, :] /= std[0]
x[1, :, :] /= std[1]
x[2, :, :] /= std[2]
return x
def decode_text(tokens, vocab, vocab_inp):
"""
decode trocr
"""
s_start = vocab.get('<s>')
s_end = vocab.get('</s>')
unk = vocab.get('<unk>')
pad = vocab.get('<pad>')
text = ''
for tk in tokens:
if tk == s_end:
break
if tk not in [s_end, s_start, pad, unk]:
text += vocab_inp[tk]
return text
class OnnxEncoder(object):
def __init__(self, model_path):
self.model = onnxruntime.InferenceSession(model_path, providers=onnxruntime.get_available_providers())
def __call__(self, image):
onnx_inputs = {self.model.get_inputs()[0].name: np.asarray(image, dtype='float32')}
onnx_output = self.model.run(None, onnx_inputs)[0]
return onnx_output
class OnnxDecoder(object):
def __init__(self, model_path):
self.model = onnxruntime.InferenceSession(model_path, providers=onnxruntime.get_available_providers())
self.input_names = {input_key.name: idx for idx, input_key in enumerate(self.model.get_inputs())}
def __call__(self, input_ids,
encoder_hidden_states,
attention_mask):
onnx_inputs = {"input_ids": input_ids,
"attention_mask": attention_mask,
"encoder_hidden_states": encoder_hidden_states}
onnx_output = self.model.run(['logits'], onnx_inputs)
return onnx_output
class OnnxEncoderDecoder(object):
def __init__(self, model_path):
self.encoder = OnnxEncoder(os.path.join(model_path, "model.onnx"))
self.decoder = OnnxDecoder(os.path.join(model_path, "model.onnx"))
self.vocab = read_vocab(os.path.join(model_path, "vocab.json"))
self.vocab_inp = {self.vocab[key]: key for key in self.vocab}
self.threshold = 0.88 # 置信度阈值,由于为进行负样本训练,该阈值较高
self.max_len = 50 # 最长文本长度
def run(self, image):
"""
rgb:image
"""
image = cv2.resize(image, (384, 384))
pixel_values = cv2.split(np.array(image))
pixel_values = do_norm(np.array(pixel_values))
pixel_values = np.array([pixel_values])
encoder_output = self.encoder(pixel_values)
ids = [self.vocab["<s>"], ]
mask = [1, ]
scores = []
for i in range(self.max_len):
input_ids = np.array([ids]).astype('int64')
attention_mask = np.array([mask]).astype('int64')
decoder_output = self.decoder(input_ids=input_ids,
encoder_hidden_states=encoder_output,
attention_mask=attention_mask
)
pred = decoder_output[0][0]
pred = softmax(pred, axis=1)
max_index = pred.argmax(axis=1)
if max_index[-1] == self.vocab["</s>"]:
break
scores.append(pred[max_index.shape[0] - 1, max_index[-1]])
ids.append(max_index[-1])
mask.append(1)
print("解码单字评分:{}".format(scores))
print("解码平均评分:{}".format(statistics.mean(scores)))
# if self.threshold < statistics.mean(scores):
text = decode_text(ids, self.vocab, self.vocab_inp)
# else:
# text = ""
return text
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='onnx model test')
parser.add_argument('--model', type=str,
help="onnx 模型地址")
parser.add_argument('--test_img', type=str, help="测试图像")
args = parser.parse_args()
model = OnnxEncoderDecoder(args.model)
img = cv2.imread(args.test_img)
img = img[..., ::-1] # BRG to RGB
res = model.run(img)
print(res)
import json
import os
import numpy as np
import triton_python_backend_utils as pb_utils
from scipy.special import softmax
from torch.utils.dlpack import from_dlpack, to_dlpack
# from torch.utils.dlpack import from_dlpack
class TritonPythonModel:
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to intialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
print('Toguide Seal Recognition BLS model initializing...')
self.model_config = json.loads(args['model_config'])
cur_path = os.path.abspath(__file__)
dir_path = os.path.dirname(cur_path)
self.vocab = self.read_vocab(os.path.join(dir_path, "vocab.json"))
self.vocab_inp = {self.vocab[key]: key for key in self.vocab}
self.max_len = 50
def execute(self, requests):
print('Toguide Seal Recognition BLS model executing...')
response = []
for request in requests:
encoder_model_name = "seal_encoder"
decoder_model_name = "seal_decoder"
input = pb_utils.get_input_tensor_by_name(
request, 'pixel_values')
encoder_response = self.request_execute([input], encoder_model_name, ["last_hidden_state", "1533"])
ids = [self.vocab["<s>"], ]
mask = [1, ]
scores = []
for i in range(self.max_len):
input_ids_tensor = pb_utils.Tensor("input_ids", np.array([ids]).astype('int64'))
attention_mask_tensor = pb_utils.Tensor("attention_mask", np.array([mask]).astype('int64'))
hidden_state_tensor = pb_utils.get_output_tensor_by_name(encoder_response,
"last_hidden_state")
hidden_state_torch_tensor = self.pb_tensor_transform(hidden_state_tensor)
# hidden_state_tensor = self.pb_tensor_transform(hidden_state_tensor)
encoder_hidden_states_tensor = pb_utils.Tensor.from_dlpack("encoder_hidden_states",
to_dlpack(hidden_state_torch_tensor))
decoder_response = self.request_execute(
[input_ids_tensor, attention_mask_tensor, encoder_hidden_states_tensor],
decoder_model_name, ["logits"])
logits_torch_tensor = self.pb_tensor_transform(pb_utils.get_output_tensor_by_name(decoder_response,
"logits"))
pred = logits_torch_tensor.cpu().numpy()[0]
pred = softmax(pred, axis=1)
max_index = pred.argmax(axis=1)
if max_index[-1] == self.vocab["</s>"]:
break
scores.append(pred[max_index.shape[0] - 1, max_index[-1]])
ids.append(max_index[-1])
mask.append(1)
# print("Decoding single-character scoring:{}".format(scores))
# print("Average rating decoding:{}".format(statistics.mean(scores)))
text = self.decode_text(ids)
utf8_bytes = self.string_to_utf8_bytes(text)
inference_response = pb_utils.InferenceResponse(
output_tensors=[pb_utils.Tensor("OUTPUT_STRING", utf8_bytes)])
response.append(inference_response)
return response
def string_to_utf8_bytes(self, s):
return np.frombuffer(s.encode('utf-8'), dtype=np.uint8)
def decode_text(self, tokens):
"""
decode trocr
"""
s_start = self.vocab.get('<s>')
s_end = self.vocab.get('</s>')
unk = self.vocab.get('<unk>')
pad = self.vocab.get('<pad>')
text = ''
for tk in tokens:
if tk == s_end:
break
if tk not in [s_end, s_start, pad, unk]:
text += self.vocab_inp[tk]
return text
# BLS
def request_execute(self, frames_tensor, model_name_string, model_output_name):
# frames_tensor: tensor
inference_request = pb_utils.InferenceRequest(
model_name=model_name_string,
requested_output_names=model_output_name,
inputs=frames_tensor
)
inference_response = inference_request.exec()
if inference_response.has_error():
raise pb_utils.TritonModelException(inference_response.error().message())
return inference_response
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is optional. This function allows
the model to release any resources used for the model.
"""
print('Toguide Seal Recognition BLS model finalizing...')
def read_vocab(self, path):
"""
加载词典
"""
with open(path, encoding="utf-8") as f:
vocab = json.load(f)
return vocab
def pb_tensor_transform(self, pb_tensor):
if pb_tensor.is_cpu():
# print(f'bls pb_tensor is from cpu', flush=True)
return pb_tensor.as_numpy()
else:
pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
# print(f'bls pb_tensor is from {pytorch_tensor.device}', flush=True)
return pytorch_tensor
# return pytorch_tensor.cpu().numpy()
然后我们跟部署其他模型一样,编写配置文件并将 model.py
放置于版本文件夹内即可。
注意:BLS 使用的是 Python Backend,所以配置文件内的 backend 需要修改为 python!!
config.pbtxt
name: "seal_bls"
backend: "python"
max_batch_size: 0
input [
{
name: "pixel_values"
data_type: TYPE_FP32
dims: [ -1,-1,-1,-1 ]
}
]
output [
{
name: "OUTPUT_STRING"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
instance_group: [
{
count: 1 # 数量
kind: KIND_GPU # 类型
gpus: [ 0 ] # 如果参数项为GPU,则该列表将指定对应序号下的可见CUDA设备来运行模型
}
]
遇到的问题
Tensor is stored in GPU and cannot be converted to NumPypi.org
Tensor 被存储在 GPU 上,不能转成 Numpy。然后,Triton 没有提供其他接口去获取数据。目前没有比较好的解决办法,用一个笨方法来解决,存在一定的性能损耗,不过不算很大。这只能等 Triton 那边把相应的接口做出来了。我先将 Tensor 通过 dlpack 转成 Pytorch 的 Tensor,然后调用 numpy 方法。
def pb_tensor_to_numpy(pb_tensor):
if pb_tensor.is_cpu():
return pb_tensor.as_numpy()
else:
pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
return pytorch_tensor.cpu().numpy()
评论区