The buzz around “edge AI”, which means something slightly different to almost
everyone you talk to, is well past reaching a fever pitch. Regardless of what
edge AI means to you, the one commonality is typically that the hardware on
which inference is being performed is constrained in one or more dimensions,
whether it be compute, memory, or network bandwidth. Perhaps the most
constrained of these platforms are microcontrollers.
I have found that, while there is much discourse around “running AI” (i.e.
performing inference) on microcontrollers, there is a general lack of
information about what these systems are actually capable of, and how new
hardware advancements impact that equation. It is my hope with this series to
peel back some of the layers of terminology and explore what actually happens
between supplying inputs to a model and receiving outputs. Along the way, we’ll
ground our exploration in performing inference with real models on real
constrained hardware.
While “weights” get the majority of the attention with AI models, they alone are
not sufficient for performing inference. Depending on how a model is distributed
and what runtime is used, additional data or metadata may be supplied alongside
the model, or may be defined explicitly in software that interacts with the
weights. The most popular runtime for microcontrollers is Tensorflow Lite for
Microcontrollers
(tflite-micro),
which is an optimized version of Tensorflow
Lite.
Note: Google recently rebranded Tensorflow Lite to
LiteRT,
and tflite-micro to LiteRT for Microcontrollers.
tflite-micro uses the .tflite file format, which encodes data using
FlatBuffers. Unlike some other model file formats,
.tflite files include not only the tensors that encapuslate model weights, but
also the computation graph, which informs the runtime of what operations to
use when performing inference. In order to do so, there needs to be a defined
set of operators. This is somewhat analagous to instructions defined in an
instruction set architecture
(ISA) for a
processor. With an ISA, a compiler will take a higher level programming language
and map the behavior onto instructions available in the ISA. Tensorflow supports
an extensive set of built-in
operators, while Tensorflow Lite, and
thus tflite-micro, supports only a
subset.
Continuing the analaogy, many processors implement specific versions of the ARM
architecture, but that
doesn’t mean that processors implementing the same ISA are equivalent. Every
instruction that is supported has to be implemented in hardware, and decisions
about how the processor is designed can impact performance on multiple
dimensions. Similarly, while Tensorflow Lite defines a set of operators, the
implementation of those operators, which are referred to as kernels, may vary.
Kernels are implemented in software, but depending on the underlying hardware, a
kernel might take many instructions to execute, or may be able to optimized to
leverage dedicated hardware support.
A simple example is the addition operator
(TFL::AddOp). We’ll
cover how operators and kernels are registered and invoked in a future post, but
let’s start by taking a look at the default tflite-micro addition operator
logic.
tensorflow/lite/micro/kernels/add.cc
TfLiteStatus AddEval(TfLiteContext* context, TfLiteNode* node) {
auto* params = reinterpret_cast<TfLiteAddParams*>(node->builtin_data);
TFLITE_DCHECK(node->user_data != nullptr);
const OpDataAdd* data = static_cast<const OpDataAdd*>(node->user_data);
const TfLiteEvalTensor* input1 =
tflite::micro::GetEvalInput(context, node, kAddInputTensor1);
const TfLiteEvalTensor* input2 =
tflite::micro::GetEvalInput(context, node, kAddInputTensor2);
TfLiteEvalTensor* output =
tflite::micro::GetEvalOutput(context, node, kAddOutputTensor);
if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) {
TF_LITE_ENSURE_OK(
context, EvalAdd(context, node, params, data, input1, input2, output));
} else if (output->type == kTfLiteInt8 || output->type == kTfLiteInt16) {
TF_LITE_ENSURE_OK(context, EvalAddQuantized(context, node, params, data,
input1, input2, output));
} else {
MicroPrintf("Type %s (%d) not supported.", TfLiteTypeGetName(output->type),
output->type);
return kTfLiteError;
}
return kTfLiteOk;
}
TFLMRegistration Register_ADD() {
return tflite::micro::RegisterOp(AddInit, AddPrepare, AddEval);
}
As can be observed in AddEval(), the type of output we are expecting
influences the implementation of the operator. To illustrate how the underlying
hardware impacts performance, let’s focus on the case in which we expect
kTfLiteInt8 (signed 8-bit integer) or kTfLiteInt16 (signed 16-bit integer)
output, meaning that we’ll call EvalAddQuantized().
tensorflow/lite/micro/kernels/add.cc
TfLiteStatus EvalAddQuantized(TfLiteContext* context, TfLiteNode* node,
TfLiteAddParams* params, const OpDataAdd* data,
const TfLiteEvalTensor* input1,
const TfLiteEvalTensor* input2,
TfLiteEvalTensor* output) {
tflite::ArithmeticParams op_params = {};
op_params.left_shift = data->left_shift;
op_params.input1_offset = data->input1_offset;
op_params.input1_multiplier = data->input1_multiplier;
op_params.input1_shift = data->input1_shift;
op_params.input2_offset = data->input2_offset;
op_params.input2_multiplier = data->input2_multiplier;
op_params.input2_shift = data->input2_shift;
op_params.output_offset = data->output_offset;
op_params.output_multiplier = data->output_multiplier;
op_params.output_shift = data->output_shift;
SetActivationParams(data->output_activation_min, data->output_activation_max,
&op_params);
bool need_broadcast = reference_ops::ProcessBroadcastShapes(
tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorShape(input2), &op_params);
switch (output->type) {
case kTfLiteInt8: {
if (need_broadcast) {
reference_integer_ops::BroadcastAdd4DSlow(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int8_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int8_t>(output));
} else {
reference_integer_ops::Add(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int8_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int8_t>(output));
}
break;
}
case kTfLiteInt16: {
if (need_broadcast) {
reference_ops::BroadcastAdd4DSlow(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int16_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int16_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int16_t>(output));
} else {
reference_ops::Add(op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int16_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int16_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int16_t>(output),
false);
}
break;
}
default:
MicroPrintf("Type %s (%d) not supported.",
TfLiteTypeGetName(output->type), output->type);
return kTfLiteError;
}
return kTfLiteOk;
}
For kTfLiteInt8 output when broadcast is not required, we make a call to
reference_integer_ops::Add().
“Broadcasting” is the process of making arrays to have compatible shapes for
arithmetic operations. For example, matrix
addition requires that the two
input matrices have the same dimensions.
tensorflow/lite/kernels/internal/reference/add.h
template <typename T>
inline void Add(const ArithmeticParams& params,
const RuntimeShape& input1_shape, const T* input1_data,
const RuntimeShape& input2_shape, const T* input2_data,
const RuntimeShape& output_shape, T* output_data) {
T activation_min, activation_max;
GetActivationParams(params, &activation_min, &activation_max);
const int flat_size =
MatchingElementsSize(input1_shape, input2_shape, output_shape);
for (int i = 0; i < flat_size; ++i) {
output_data[i] = ActivationFunctionWithMinMax(
input1_data[i] + input2_data[i], activation_min, activation_max);
}
}
As you might expect, this implementation effectively boils down to iterating
through the two input tensors and calling input1_data[i] + input2_data[i].
This can be thought of as a lowest common denominator implementation in that it
doesn’t leverage any hardware-specific functionality; any processor can perform
sequential addition. However, as evidenced by the effectively unlimited demand
in the Graphics Processing Unit
(GPU) market, there are
significant performance gains to be had by parallelizing operations in hardware.
Fortunately, many of the operations that are necessary for performing inference
are “embarassingly
parallel”. For example,
rather than iterating through tensors to perform sequential addition, which may
take many processor cycles, we could point a processor to the two inputs and, if
supported by the hardware, the entire matrix addition operation could be
completed in a “single” cycle.
It is unlikely that the operation would literally take one cycle given the
complexity of modern processors, but the point is that the runtime of the
entire operation could be reduced to the same order of magnitude of cycles as
one step of the sequential addition implementation.
Obviously microcontrollers don’t have massive GPUs like those installed in cloud
provider datacenters. However, many do implement architecture extensions that
enable these common operations to be accelerated. Because Tensorflow Lite
enables different kernel implementations, this hardware acceleration can be
leveraged when supported.
Many microcontrollers implement Arm
Cortex-M cores. For example, chips
like the Raspberry Pi RP2350 and
Nordic Semiconductor nRF54H20
implement multiple Arm
Cortex-M33 cores. The
former implements the Armv8-M Digital Signal Processing (DSP) Extension, which
adds support for Single Instruction Multiple Data
(SIMD)
instructions. More capable chips, like the Alif Ensemble
E3 implement
Cortex-M55 cores, with
support for the Armv8-M Vector Extension (MVE), also referred to as Arm
Helium. The E3 also includes dedicated
accelerators in the form of Arm’s Ethos-U Neural Processing Units
(NPU).
Arm provides software that allows for hardware that supports one or more of
these extensions to accelerate Tensorflow Lite kernel implementations. For
example, the CMSIS-NN library offers
kernel implementations that do not leverage optimization (i.e. pure
C),
leverage just the DSP
extension,
or leverage the MVE
extension
(which requires the implementation of the DSP extension). Tensorflow Lite has
kernel “ports” that integrate the CMSIS-NN functionality. Let’s take a look at
how the add operation differs when using CMSIS-NN kernels.
The
setup
looks largely the same as the first add operation kernel we observed. However,
when we reach EvalAddQuantizedInt8(), we can start to see where hardware
acceleration is leveraged.
tensorflow/lite/micro/kernels/cmsis_nn/add.cc
TfLiteStatus EvalAddQuantizedInt8(TfLiteContext* context, TfLiteNode* node,
TfLiteAddParams* params, const OpData* data,
const TfLiteEvalTensor* input1,
const TfLiteEvalTensor* input2,
TfLiteEvalTensor* output) {
tflite::ArithmeticParams op_params;
UpdateOpParams(&op_params, data);
bool need_broadcast = reference_ops::ProcessBroadcastShapes(
tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorShape(input2), &op_params);
if (need_broadcast) {
reference_integer_ops::BroadcastAdd4DSlow(
op_params, tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorData<int8_t>(input2),
tflite::micro::GetTensorShape(output),
tflite::micro::GetTensorData<int8_t>(output));
} else {
arm_elementwise_add_s8(
tflite::micro::GetTensorData<int8_t>(input1),
tflite::micro::GetTensorData<int8_t>(input2), op_params.input1_offset,
op_params.input1_multiplier, op_params.input1_shift,
op_params.input2_offset, op_params.input2_multiplier,
op_params.input2_shift, op_params.left_shift,
tflite::micro::GetTensorData<int8_t>(output), op_params.output_offset,
op_params.output_multiplier, op_params.output_shift,
op_params.quantized_activation_min, op_params.quantized_activation_max,
MatchingElementsSize(tflite::micro::GetTensorShape(input1),
tflite::micro::GetTensorShape(input2),
tflite::micro::GetTensorShape(output)));
}
return kTfLiteOk;
}
The arm_elementwise_add_s8() function is provided by CMSIS-NN, and the
implementation leverages different hardware functionality depending on what
extensions are available.
Source/BasicMathFunctions/arm_elementwise_add_s8.c
arm_cmsis_nn_status arm_elementwise_add_s8(const int8_t *input_1_vect,
const int8_t *input_2_vect,
const int32_t input_1_offset,
const int32_t input_1_mult,
const int32_t input_1_shift,
const int32_t input_2_offset,
const int32_t input_2_mult,
const int32_t input_2_shift,
const int32_t left_shift,
int8_t *output,
const int32_t out_offset,
const int32_t out_mult,
const int32_t out_shift,
const int32_t out_activation_min,
const int32_t out_activation_max,
const int32_t block_size)
{
#if defined(ARM_MATH_MVEI)
int32_t count = block_size;
while (count > 0)
{
int32x4_t vect_1;
int32x4_t vect_2;
mve_pred16_t p = vctp32q((uint32_t)count);
vect_1 = vldrbq_z_s32(input_1_vect, p);
vect_2 = vldrbq_z_s32(input_2_vect, p);
vect_1 = vaddq_s32(vect_1, vdupq_n_s32(input_1_offset));
vect_2 = vaddq_s32(vect_2, vdupq_n_s32(input_2_offset));
vect_1 = vshlq_r_s32(vect_1, left_shift);
vect_2 = vshlq_r_s32(vect_2, left_shift);
vect_1 = arm_requantize_mve(vect_1, input_1_mult, input_1_shift);
vect_2 = arm_requantize_mve(vect_2, input_2_mult, input_2_shift);
vect_1 = vaddq_s32(vect_1, vect_2);
vect_1 = arm_requantize_mve(vect_1, out_mult, out_shift);
vect_1 = vaddq_n_s32(vect_1, out_offset);
vect_1 = vmaxq_s32(vect_1, vdupq_n_s32(out_activation_min));
vect_1 = vminq_s32(vect_1, vdupq_n_s32(out_activation_max));
input_1_vect += 4;
input_2_vect += 4;
vstrbq_p_s32(output, vect_1, p);
output += 4;
count -= 4;
}
#else
int32_t loop_count;
int32_t input_1;
int32_t input_2;
int32_t sum;
#if defined(ARM_MATH_DSP)
int32_t a_1, b_1, a_2, b_2;
int32_t offset_1_packed, offset_2_packed;
int8_t r1, r2, r3, r4;
offset_1_packed = (input_1_offset << 16U) | (input_1_offset & 0x0FFFFL);
offset_2_packed = (input_2_offset << 16U) | (input_2_offset & 0x0FFFFL);
loop_count = block_size >> 2;
while (loop_count > 0)
{
/* 4 outputs are calculated in one loop. The order of calculation is follows the order of output sign extension
intrinsic */
input_1_vect = read_and_pad_reordered(input_1_vect, &b_1, &a_1);
input_2_vect = read_and_pad_reordered(input_2_vect, &b_2, &a_2);
a_1 = SADD16(a_1, offset_1_packed);
b_1 = SADD16(b_1, offset_1_packed);
a_2 = SADD16(a_2, offset_2_packed);
b_2 = SADD16(b_2, offset_2_packed);
/* Sum 1 */
input_1 = (b_1 & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = (b_2 & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r1 = (int8_t)sum;
/* Sum 3 */
input_1 = ((b_1 >> 16) & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = ((b_2 >> 16) & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r3 = (int8_t)sum;
/* Sum 2 */
input_1 = (a_1 & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = (a_2 & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r2 = (int8_t)sum;
/* Sum 4 */
input_1 = ((a_1 >> 16) & 0x0FFFF) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = ((a_2 >> 16) & 0x0FFFF) << left_shift;
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
r4 = (int8_t)sum;
arm_nn_write_s8x4_ia(&output, PACK_S8x4_32x1(r1, r2, r3, r4));
loop_count--;
}
loop_count = block_size & 0x3;
#else
loop_count = block_size;
#endif
while (loop_count > 0)
{
/* C = A + B */
input_1 = (*input_1_vect++ + input_1_offset) << left_shift;
input_2 = (*input_2_vect++ + input_2_offset) << left_shift;
input_1 = arm_nn_requantize(input_1, input_1_mult, input_1_shift);
input_2 = arm_nn_requantize(input_2, input_2_mult, input_2_shift);
sum = input_1 + input_2;
sum = arm_nn_requantize(sum, out_mult, out_shift);
sum += out_offset;
sum = MAX(sum, out_activation_min);
sum = MIN(sum, out_activation_max);
*output++ = (int8_t)sum;
/* Decrement loop counter */
loop_count--;
}
#endif /* ARM_MATH_MVEI */
return (ARM_CMSIS_NN_SUCCESS);
}
For example, if the DSP extension is present, the parallel signed 16-bit
addition (SAAD16) instruction provided by the extension is used to reduce the
number of loops necessary by packing 8-bit signed integers into 16-bit
arguments, then calculating 4 outputs on a single iteration. If the MVE
extension is present, vector addition instructions (VADD) can be used
directly, making the calculation even more efficient.
These optimizations are made available via configuration when compiling
tflite-micro. They can be applied to any models that utilize the operations,
without the need to modify the model when moving from one architecture to
another. Some optimizations do require modifying a model. For example, when
using microcontrollers, such as the previously mentioned Alif Ensemble E3, that
include Arm’s Ethos-U NPUs you can run .tflite models through the Vela
compiler.
Converted models replace sequences of built-in operators with a custom ETHOSU
operator
and a command
stream.
The application processor notifies the NPU of the address of the command stream
and other relevant data, then triggers it to perform inference.
Unlike the addition operator, there is no fallback kernel implementation; models
converted via the Vela compiler cannot run on microcontrollers that do not have
Ethos-U NPUs. For those that do, we can see the previously described logic in
the Eval() implementation for the ETHOSU custom operator.
tensorflow/lite/micro/kernels/ethos_u/ethosu.cc
TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
TFLITE_DCHECK(node->user_data != nullptr);
TFLITE_DCHECK(context != nullptr);
TFLITE_DCHECK(context->GetScratchBuffer != nullptr);
// Get base addresses.
TfLiteEvalTensor* tensor;
int i = 0;
int num_tensors = 0;
void* cms_data;
uint8_t co_type;
int result;
const OpData* data = static_cast<const OpData*>(node->user_data);
uint64_t* base_addrs = static_cast<uint64_t*>(
context->GetScratchBuffer(context, data->base_addr_idx));
size_t* base_addrs_size = static_cast<size_t*>(
context->GetScratchBuffer(context, data->base_addr_size_idx));
const uint8_t* custom_data =
static_cast<uint8_t const*>(node->custom_initial_data);
auto root = flexbuffers::GetRoot(custom_data, node->custom_initial_data_size);
co_type = root.AsInt8();
if (co_type != CO_TYPE_ETHOSU) {
MicroPrintf("CO_TYPE != ETHOSU");
return kTfLiteError;
}
// Get command stream data address.
tensor = context->GetEvalTensor(context, node->inputs->data[0]);
cms_data = reinterpret_cast<void*>(tensor->data.uint8);
// Get addresses to weights/scratch/input data.
for (i = 1; i < node->inputs->size; ++i) {
tensor = context->GetEvalTensor(context, node->inputs->data[i]);
base_addrs[num_tensors] =
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(tensor->data.uint8));
size_t byte_size = 1;
for (int k = 0; k < tensor->dims->size; k++) {
byte_size = byte_size * tensor->dims->data[k];
}
base_addrs_size[num_tensors] = byte_size;
num_tensors++;
}
// Get addresses to output data.
for (i = 0; i < node->outputs->size; ++i) {
tensor = context->GetEvalTensor(context, node->outputs->data[i]);
base_addrs[num_tensors] =
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(tensor->data.uint8));
size_t byte_size = 1;
for (int k = 0; k < tensor->dims->size; k++) {
byte_size = byte_size * tensor->dims->data[k];
}
base_addrs_size[num_tensors] = byte_size;
num_tensors++;
}
// When Vela optimizes a tflite file it will assign the tensors like this:
//
// +-------+------------------------+ +--------+-------------+
// | INPUT | Description | | OUTPUT | Description |
// +-------+------------------------+ +--------+-------------+
// | 0 | Ethos-U command stream | | 0..m | Outputs |
// | 1 | TFLM model | +--------+-------------+
// | 2 | TFLM arena |
// | 3 | Ethos-U fast scratch |
// | 4..n | Inputs |
// +-------+------------------------+
//
// This code will assign the NPU base addresses like this:
//
// +--------------+----------------------+
// | Base address | Description |
// +--------------+----------------------+
// | 0 | TFLM model |
// | 1 | TFLM arena |
// | 2 | Ethos-U fast scratch |
// | 3..n | Input tensors |
// | n..m | Output tensors |
// +--------------+----------------------+
//
// The number of base address will be limited to 8.
//
// NOTE! The command stream produced by Vela will access the IFM and OFM
// buffers using base address 1. This means that it is not possible to point
// the input and output tensors outside of the TFLM arena.
num_tensors = std::min(num_tensors, 8);
struct ethosu_driver* drv = ethosu_reserve_driver();
result = ethosu_invoke_v3(drv, cms_data, data->cms_data_size, base_addrs,
base_addrs_size, num_tensors,
GetMicroContext(context)->external_context());
ethosu_release_driver(drv);
if (-1 == result) {
return kTfLiteError;
} else {
return kTfLiteOk;
}
}
We’ve now seen the full spectrum of operator optimization, from kernels that are
implemented purely in C, to those that leverage hardware instructions provided
in architecture extensions, and finally to those that offload inference to a
wholly separate processor. In future posts, we’ll explore how operators are
encoded in .tflite files, and how the runtime ultimately invokes the
underlying kernels.