THE LATEST EDGE AI TECHNOLOGY TRENDS

EDGE AI IS ABOUT RUNNING 'AI MODELS' ON 'EDGE DEVICES' USED IN THE FIELD RATHER THAN IN THE CLOUD. IN RECENT YEARS, EDGE AI HAS ATTRACTED A LOT OF ATTENTION AND ITS USE CASES ARE INCREASING. WE ARE ALSO RECEIVING AN INCREASING NUMBER OF REQUESTS FROM OUR CUSTOMERS FOR CONSULTING AND IMPLEMENTATION OF EDGE AI.

Based on our practical experience of Edge AI development and implementation, this two-part series, including this article, will provide an overview of the trends in Edge AI technology and the points to consider when implementing it.
(1) Trends in edge AI technology (this article)
( 2) Five points to consider when implementing edge AI

1. TRENDS IN EDGE AI TECHNOLOGY

Examples of cases where AI can run on edge devices include automated driving systems such as automated car driving, automated drone navigation, automated construction equipment piloting, etc., and image analysis systems such as anomaly detection using surveillance cameras and automated instrument reading using cameras.

WE ARE ALSO STARTING TO SEE AI IMPLEMENTED IN CONSUMER EDGE DEVICES SUCH AS SMARTPHONE APPS, SMART SPEAKERS, SMART WATCHES AND EVEN RUNNING DEEP LEARNING MODELS IN WEB BROWSERS.

INCREASINGLY, AI IS RUN ON EDGE DEVICES

THIS ARTICLE INTRODUCES THE TECHNOLOGIES AND HARDWARE USED IN EDGE AI IMPLEMENTATIONS AND DESCRIBES THE TRENDS AND PERFORMANCE OF HARDWARE IN THIS CONTEXT. IT WILL ALSO INTRODUCE SOME USE CASES.

2. TECHNOLOGY AND HARDWARE FOR IMPLEMENTING AI MODELS ON EDGE DEVICES

AS THE NEED FOR EDGE AI INCREASES, VARIOUS TECHNOLOGIES AND HARDWARE HAVE BEEN DEVELOPED TO IMPLEMENT AI MODELS IN EDGE DEVICES. HERE ARE THREE OF THE MAJOR ONES

The first is model weight reduction technology. This is a technique to reduce the weight of an AI model so that it meets the required specifications and constraints before being loaded onto an edge device.

The second is a compiler. A compiler is needed to ensure that the lightweight AI model runs optimally on the edge device.

The third is a variety of hardware, such as support for special lightweight AI models such as quantization or branch trimming. Hardware is very important and will be discussed in more detail.

In recent years, GPUs and AI accelerators have been increasingly used as hardware to enable edge AI.

GPU

There are two types of GPUs: those for so-called servers (workstations) and those for embedded applications. In Edge AI we refer to embedded GPUs, which offer slightly lower performance than their server counterparts, but with the advantage of lower power consumption.

NVIDIA's Jetson series of embedded GPUs are divided into several grades according to their performance (*1): Nano is the lowest-end version, while TX2, AGX Xavier, etc. have higher performance and are almost at the same level as servers. In terms of applications, the Nano is intended for use in embedded applications such as surveillance cameras, while the AGX Xavier is intended for more complex applications such as automated driving.
1: https://www.nvidia.com/ja-jp/autonomous-machines/embedded-systems/

EMBEDDED, GPU

AI ACCELERATORS

There are various types of accelerators for accelerating AI applications, three of which are introduced in this article.
The first is Intel's Neural Compute Stick 2, which contains the company's Myriad chip (*2).
The second is Google's Coral (Edge TPU ) accelerator (*3).
And the third is a Field Programmable Gate Array ( FPGA ), often from Xilinx.
Neural Compute Stick 2 and Coral are extremely small and relatively inexpensive.
2: https: //ark.intel.com/content/www/jp/ja/ark/products/140109/intel-neural-compute-stick-2.html
3: https://coral.ai/products/

EDGE AI, ACCELERATORS

OTHER AI CHIPS AND EDGE AI DEVICES

There is also a wide range of hardware dedicated to specific industries.
Examples include Renesas Electronics' R-Car series, which complies with the ISO 26262 ASIL B/D automotive functional safety standard, Hailo's AI processors, Ambarella's edge AI video processing SoCs and associated platforms, and Blaize's GSP (Graph Streaming Processor) for edge AI. (graph streaming processor) for edge AI.

Both are focused on the benefits of performance, system efficiency and power consumption in edge devices. In addition, mobile devices such as smartphones and tablets are increasingly equipped with processors (SoCs) for neural network processing, and the focus is on unit cost, device size and power consumption.

3. HARDWARE PERFORMANCE FOR EDGE AI

GPUS PERFORM INFERENCE ABOUT 20 TIMES FASTER THAN CPUS

NVIDIA's developer blog from 2017 provides comparative data on the number of images processed by GPU and CPU. Depending on the batch size, the number of images processed per second with NVIDIA's GPU (Tesla P100) was up to 20 times faster than with the CPU alone (*4).

The Jetson TX2, with a GPU based on the same Pascal architecture as the P100, has also been shown to achieve the same level of inference power as a server-class CPU in terms of images processed per second at very low power*5.
4: https: //developer.nvidia.com/blog/deploying-deep-learning-nvidia-tensorrt/
5: https://developer.nvidia.com/blog/jetson-tx2-delivers-twice-intelligence-edge/

AI ACCELERATORS ENABLE INFERENCES TO BE MADE MORE THAN 30 TIMES FASTER THAN ON THE CPU

As an example of the use of Coral accelerators, a comparison of the model inference speed between an embedded CPU (Arm Cortex) and a Coral accelerator (*6) is also presented. Although there are differences between AI models, the accelerator is up to 30 times faster than the CPU, and inference time is significantly reduced.
6: https://solution.murata.com/ja-jp/technology/coral

In our demonstrations, when Coral was attached to the Raspberry Pi to perform human skeletal estimation (Pose Estimation), the frame rate of the CPU (Arm Cortex) on the Raspberry Pi was as low as 0.1 to 0.2 FPS (5 to 6 seconds). This corresponds to processing about one image per 5-6 seconds). On the other hand, when using the Coral accelerator, the frame rate was about 10 FPS or more, indicating that the use of AI accelerators in embedded systems can significantly increase processing speed.

Computing performance of each hardware

HERE IS A BRIEF INTRODUCTION TO THE PERFORMANCE OF EACH GPU AND AI ACCELERATOR. THE FIGURE BELOW SHOWS THE COMPUTING PERFORMANCE OF EACH HARDWARE. FLOPS AND TOPS ARE OFTEN USED AS A MEASURE OF COMPUTER PERFORMANCE, WHERE FLOPS IS THE NUMBER OF FLOATING POINT SUM OPERATIONS PER SECOND (FP32 OR FP16) AND TOPS IS THE NUMBER OF SUM OPERATIONS PER SECOND FOR THE IDEAL TYPE OF ARITHMETIC UNIT.

Hardware performance comparison

Source of performance data: https://www.nvidia.com/ja-jp/autonomous-machines/embedded-systems/
https://ark.intel.com/content/www/jp/ja/ark/products/140109/intel-neural-compute-stick-2.html
https://coral.ai/products/

There is also a lot of hardware available that supports int8 operations. In general, neural network inference is often carried out on FP32/FP16, but it is known that with proper quantization, inference on int8 can be done without any problems to meet the accuracy requirements.

With 4 TOPS, Coral is able to process images at a rate of 70 images per second (70 FPS) in the object detection task (MobileNet V2 SSD on Coral Dev Board, *7). By using multiple accelerators, it is also possible to run multiple tasks in parallel without increasing the load on the host CPU.
7: https://coral.ai/docs/edgetpu/benchmarks/

4. EDGE AI USE CASES IN A WIDE RANGE OF INDUSTRIES

HERE ARE SOME EXAMPLES OF EDGE AI IN ACTION. (CASE STUDIES 1 TO 3 ARE NOT OUR CASE STUDIES.)

Case Study 1: Kurazushi's Plate Counting System

A system has been developed that uses image recognition technology using Raspberry Pi + Coral to understand customer needs based on data such as the type of dish selected by the customer and the timing of the selection. Compared to management by conventional IR-RFID (infrared and wireless IC tag) systems, this system has been reported to improve reliability (*8).
8: https://coral.ai/news/kura-sushi

Case study 2: Reducing crop losses at Farmwave

The system uses a camera-equipped crop harvester fitted with a Raspberry Pi and Coral device to monitor the crop harvest in real time. If the yield is low, the parameters of the machine (angle and speed of the harvesting rotor and sieve opening, etc.) can be adjusted in real time to maximise the yield (*9).
*9: https://coral.ai/news/farmwave

Case Study 3: Perimeter Monitoring of Komatsu Mobile Cranes

Using NVIDIA's GPU (Jetson TX2), a system has been developed that processes camera images mounted around the entire circumference of a 2.9t mobile crane, detects people in the vicinity in real time, and issues a warning (*10). This system is expected to improve safety on site.
10: https://car.watch.impress.co.jp/docs/news/1143321.html

Case study 4: Human flow analysis using surveillance cameras ( Arayaアラヤcase study)

Using Raspberry Pi and Coral, we are developing a system to track and map people in 2D in real time from surveillance camera images. We use the Edge TPU to process high-resolution Full HD images and achieve inference at around 2 FPS. This is almost fast enough for normal human walking speed, and can be used for shop entry/exit, traffic counting, congestion and usage monitoring, person demographic analysis and behaviour detection.
Click here for more information: Human flow analysis solution using multiple cameras and image recognition AI

Application example 5: Learning lightweight models using distillation

WE STUDIED MODEL COMPRESSION BY DISTILLATION, ONE OF THE METHODS USED TO IMPROVE THE GENERALIZATION PERFORMANCE OF DEEP LEARNING MODELS AND KNOWLEDGE TRANSFER TECHNIQUES FOR ANOMALY DETECTION AI MODELS AT A CUSTOMER'S FACTORY. WE HAVE CONDUCTED RESEARCH AND DEVELOPMENT ON THIS APPROACH BECAUSE WE BELIEVE IT IS AN EFFECTIVE APPROACH WHEN A COMPACT AND LIGHTWEIGHT MODEL WITH HIGH ACCURACY IS NEEDED TO REALIZE EDGE AI.

[Edge AI Implementation Issues
Edge AI-related technologies, such as model compression, may not always be faster due to compression or may significantly reduce accuracy due to compression, making it necessary to determine the appropriate method. We aimed to develop more lightweight and accurate models to promote edge

*Distillation is a method of inheriting the knowledge (prediction results) of a model once learned into another smaller model. This is expected to create a smaller model with accuracy comparable to the larger model. It refers to transferring the knowledge of a teacher model (large model) to a student model (small model, even if it is lightweight) to increase the accuracy of the target model and accelerate learning.

A model for abnormality detection includes a model that classifies images of product parts into normal and abnormal labels.
For this model, we aimed to achieve the following by having the lightweight model learn tasks in order to bring the accuracy of the student model closer to that of the teacher model using distillation, and to obtain the know-how of the client in the series of steps of creating a program and distilling the program.
Increase the number of models that can be implemented and reduce the number of devices used.
Improve processing speed for anomaly detection

By using not only the inference results of the teacher model, but also the output distribution of the model's intermediate layer for training, we have achieved a small student model with performance similar to the teacher model.

Compared to training without distillation, training on a lightweight model with distillation reduced the false positive rate by about 40%.

With the recent proliferation of edginess, the demand for accuracy in small models is increasing.
The technology of distillation fits this demand, and we can assist customers who do not have knowledge or implementation experience in distillation technology in their development.

Use Case 6: Accelerating Models with Quantization/TensorRT

The product detection model used in the customer's retail application was problematic in terms of processing time due to the large volume of images being processed.
To speed up the processing time, quantization was used.
As a result, the processing speed can be improved without degradation of accuracy, which is useful for customers who want to increase speed while maintaining high accuracy in model edging.

In some cases, edge AI cannot be put to practical use unless the AI on the terminal side can perform processing at high speed.
By implementing quantization (Int8)/FP16 and TensorRT in parallel, we have improved processing speed with almost no degradation in accuracy.

By performing quantization (Int8-ization)/FP16-ization and TensorRT-ization in parallel on the customer's model, we were able to improve processing speed with virtually no degradation in accuracy.

Processing Speed
Converted to TensorRT (Darknet to TensorRT) and evaluated for speed and accuracy, achieving up to nearly 2x inference processing speed improvement.

Accuracy
Accuracy was almost unchanged, while processing speed was improved.
As a verification, an out-of-stock check was conducted, and the accuracy results with FP16 were the same as those with conventional accuracy.

The following optimizations were performed to speed up the model
Quantization (Int8)/FP16
TensorRT
Quantization is a method to reduce model weight by expressing parameters such as weights in smaller bits.
In the case of 2-bit quantization, min/max is set and the range is divided into 4 parts, and the pre-quantized value is assigned to the near value after quantization.

*Quantization is an approximate representation of the magnitude of a signal in terms of discrete values. For example, when quantizing at 2 bits, the range is divided into 4 parts by setting min and max as shown in the figure, and the pre-quantized value is assigned to a closer value after quantization. It is not necessary that min, max be symmetrical at 0. The range may not be divided equally. QAT (relearning during quantization) is a method to compensate for the loss of accuracy due to quantization.
In quantization with TensorRT, min, max must be symmetrical at 0 and the range of min, max must be equally divided.

TensorRT is a high-performance deep learning inference optimization and execution library from NVIDIA.

Optimization of TensorRT
・Use a program that parses the network structure file (.cfg) and configures the onnx model
Yolo layer*1 is not in oonnx, so it is converted to oonnx model without Yolo layer.
Converted from oonnx model using TensorRT library,
Yolo layer implemented in cuda is added as a custom layer.
NMS processing during inference uses NMS implemented in Darknet.

EDGE AI IMPLEMENTATION REQUIRES NOT ONLY HIGH ACCURACY ON THE TERMINAL SIDE, BUT ALSO SHORT PROCESSING TIME FOR PRACTICAL USE. QUANTIZATION IS USED TO SPEED UP THE PROCESSING TIME. WHEN CONSIDERING SOLUTIONS USING EDGE AI TO IMPROVE BUSINESS EFFICIENCY, IT CAN BE USED TO SOLVE PROBLEMS FOR CUSTOMERS IN A WIDE RANGE OF INDUSTRIES WHO WANT TO MAINTAIN HIGH ACCURACY AND HIGH SPEED PROCESSING.

5. summary

IN RECENT YEARS, GPUS AND AI ACCELERATORS HAVE BECOME THE MAINSTREAM HARDWARE USED IN EDGE AI. THESE ENABLE A SIGNIFICANT INCREASE IN PROCESSING SPEED COMPARED TO CPUS AND ARE EXPECTED TO BE IMPLEMENTED IN APPLICATIONS THAT USE CAMERA IMAGES IN PARTICULAR.

In the following article, we will discuss five aspects to consider when implementing Edge AI.

You can also find out more about our Edge AI services at Arayaアラヤ.

Click here to read more

(1) Trends in edge AI technology (this article)
( 2) Five points to consider when implementing edge AI