Deep learning with Raspberry Pi and alternatives - Q-engineering
Q-engineering
Q-engineering
Go to content
images/empty-GT_imagea-1-.png
Deep learning with Raspberry Pi

Deep learning with Raspberry Pi and alternatives in 2019

Introduction

This page assists you to build your deep learning modal on a Raspberry Pi or an alternative like Google Coral or Jetson Nano. For more general information about deep learning and its limitations, please see deep learning. This page is not a detailed step-by-step recipe for installing certain software. It deals more with the general principles, so you have a good idea of how it works and on which board your network can run.
Tensor
A widely used software package for deep learning is TensorFlow. Let start with the name. What is a tensor?  
You can have a list of numbers. This is called a vector in mathematics.

Vector

If you add a dimension to this list, you get a matrix.

Matrix

This way you can, for example, display a black and white image. Each value represents a pixel value. The number of rows equals the height, the number of columns matches the width of the image. If you add yet again an extra dimension to the matrix, you get a tensor.

Tensor

A stack of 2D matrices on top of each other. Or to put it another way, a matrix in which the individual numbers are replaced by a vector, a list of numbers. An example is an RGB picture. Each individual pixel (element in the matrix) consists of three elements; an R, G and B component. This is the most simplified definition of a tensor, an n-dimensional array of numbers.
There is a subtle difference in the definition between tensors in TensorFlow and mathematics.
In mathematics, a tensor is not just a collection of numbers in a matrix. Here a tensor must obey certain transformation rules. These rules have to do with altering the coordinate system in which the tensor lives without altering its outcome. Most tensors are 3D and have the same number of elements as a Rubric cube. Each individual cube predicts how a physical object will deform under stress (tensor) by a set of orthogonal vectors.
If the observer takes another position in the real world, the deformations of the object itself don’t change; obvious, it is still the same object. However, all your vectors or formulas will change given this new position. They will change in such a way that the result of a deformation still remains the same. Think of it as the distance between the top of two towers. Where ever you stand, that will not change. Drawing vectors from your position to those tops will however shift according to your position, your origin.
Tensor matrix
There is even a third meaning in this context of a tensor, a neural tensor network.
The tensor in this special neural network establishes a relationship between two entities. A dog has a tail, a dog is a mammal, a mammal needs oxygen, etc.

Neural tensor layer
  
The last two definitions are only given for completeness. Many people think that TensorFlow has something to do with one of these interpretations. This is not the case.

Weight matrix

The most important building block of TensorFlow and other deep-learning software is the n-dimensional array. This section explains the use of these arrays. Every deep learning application consists of a given topology of neural nodes. Each neural node is usually constructed as shown below.

Neural node

Each input is multiplied by a weight and added together. Together with a bias, the result goes to an activation function φ. This can be a simple step operation or a more complex function such as a hyperbolic tangent.

TanH
 
The output is the input for the next layer in the network. A network can be made of many layers, each with thousands of individual neurons.
If you look at one layer, the same input array can be applied to different weight arrays. Each with a different result so that various features can be extracted from a single input.

Neural net
 
In the network above, four inputs (yellow) are all fully connected to the four neurons (blue) of the first layer. These are wired to the five neurons of the next layer. Following another inner layer of six neurons. After two consecutive layers of four and three, the output (orange) with three channels is reached.
Such a scheme results in a vector-matrix multiplication.

GPU matrix mul

Here the input layer of four values (x,y,z,w) is multiplied with the weight matrix. Weights a,b,c,d for the x input, resulting in x' at the output. The weights e,f,g,h for the y' output and so on. There are other ways to describe this multiplication like,

GPU formula

Where v is the input vector (x,y,z,w) and v' the output (x',y',z'). The vector-matrix multiplication is one of the most performed operations in TensorFlow, hence the name.
GPU
Before all dots are put together, first a little detour in GPU hardware. GPU stands for Graphical Processing Unit, a device initially designed to relieve the CPU from the dreary screen rendering task. Over the years the GPUs became much more powerful. Nowadays they have over 21 billion transistors and are capable of performing massive parallel computations. Especially in games where every pixel on the screen is calculated, these calculation capabilities are needed. When moving the position of the viewer, for example, when the hero starts running, all vertices must be recalculated. And this 25 times per second to get smooth transitions. Each vertex needs a rotation and a translation. The formula is:

GPU full matrix

Here (x,y,z,w) is the initial pixel position in 3D and (x',y',z',w') is the new position after the matrix operation. As you can see, this type of arithmetic is the same as for a neural network. There is another point of interest. When you look at x' it is the summation of four products (ax+by+cz+dw). y' on the other hand is also a summation (ex+fy+gz+hw). But to calculate y' one does not need to know the values that determine x' (a,b,c, and d). They have no bearing on each other. You can calculate x' at the same time as y'. And z' and w' for that matter also. In theory, every calculation with no relations on other outcomes can be performed at the same time. Hence the very parallel architecture of GPU. The fastest GPUs today (2019) are capable of a whopping 125 TFLOPs per second.
This is the whole idea behind GPU acceleration. Transfer all tensors to the GPU memory and have the device perform all vector-matrix calculations in a fraction of the time it would cost the CPU. Without the impressive GPU calculation power, deep learning would hardly be possible.

TPU

Driven by the huge market potential of deep learning, some manufacturers replaced the GPU for a TPU, a Tensor Processing Unit. In addition to the vector-matrix multiplication, the GPU also has other tasks to do such as vertex interpolation and shading, H264 compression, driving HDMI monitors, etc. By using all transistors solely for tensor dot products, the throughput increases while the power consumption decreases. The first generation only works with 8-bit integers, the later also with floating points. The TPUs on the embedded boards below are all integer based except the Jetson Nano. Read an in-depth article here.
GPU Pitfalls
There are a few points about GPU arithmetic that must be taken into account.
To begin, stick to the matrices. GPU architecture is designed for that kind of operation. Writing an extensive if-else structure is disastrous for a GPU and the overall performances.
Another point is that memory swaps cost a lot of efficiencies. More and more the transfer of data from the CPU memory (where the images are usually located) and the GPU memory is becoming a serious bottleneck. You read the same over and over again in every document of NVIDIA; the larger the vector-matrix dot product the faster it will be executed.
In this regard, keep in mind that Raspberry and its alternatives usually have one large RAM for both the CPU and the GPU. They share simply the same DDR3 chip(s). Your neural network must not only fit in the program memory, but it must also leave space in the RAM so that the CPU kernel can run. This can sometimes impose restrictions on the network or the number of objects to be recognized. Choose another board with more RAM may be the only solution in that case. All this contrasts with the graphics card in a PC where the GPU has its memory bank.
Another distinction is that the GPU on a video card works with floats or half floats, sometimes also called small floats. The embedded GPU on the Raspberry or the TPU on the alternatives boards works with 8 or 16-bit integers. Your neural network must be adapted to these formats. If this is not possible, choose another board with floating-point arithmetic like the Jetson Nano.
Last advice, don't overclock the GPU. They work normally at a lower frequency than the CPU. Some Mali GPUs in ARM cores run as low as 400 MHz. Overclocking can work in the winter, but the application may falter mid-summer. Remember, it's your vision application at your client that suddenly crashes, not a game you simply restart.
And of course, the comments on the page about computer vision on the Raspberry also apply here.
Showstopper.
You cannot train a deep learning model on a Raspberry Pi or an alternative. Not if you haven't planned a trip around the world. The boards lack the computer capacity to perform the huge amount of floating-point mul-adds required during training. Even a Google Coral cannot train a network because the TPU on this board works only with special pre-compiled TensorFlow networks. Only the last layer in a network can be changed slightly. And although the Jetson Nano has floating-point CUDAs, it is still not very well able to train a network in an acceptable time. Do it overnight is the advice of NVIDIA here. So, in the end, you can only import and run an already trained model on these boards.
Practice.
The first step is to install an operating system, usually a Linux derivative such as Ubuntu or Debian. That is the easy part.
The hard part is installing your deep learning model. You have to figure out if any additional libraries (OpenCV) or drivers (GPU support) are needed. Please note that only the Jetson Nano support CUDA, a package most deep learning software on a PC use. All other boards need different GPU support if you want to accelerate the neural network. The development of GPU drivers for Raspberry Pi or the alternatives is an ongoing process. Check the communities on the net.

The last step is reducing the neural network to acceptable proportions. The famous AlexNet has original 2.3 billion floating-point operations per single frame. This will never run fast on a simple single ARM computer or mobile device. Most models have some sort of reduction strategy. YOLO has Tiny YOLO, Caffe has Caffe2 and TensorFlow has TensorFlow Lite. They all use one or more of the following techniques.
Reduce the input size. Smaller images save a lot of computations on the first layers.  
Decrease the number of objects to classify; it trims the sizes of many internal layers.
Port the neural network from floats to bytes where possible. This also lowers the memory load considerable.
Another strategy is the reduction of the floats to single bits, an XNOR network. This fascinating idea is discussed here.

Comparison of Raspberry Pi and alternatives.

Jetson Nano vs Google Coral vs Intel Neural stick, here the comparison. The three odd ones out in the list are the JeVois, the Intel Neural Stick, and the Google Colar USB accelerator. The first has a camera onboard and can do a lot as you can read here.
The Intel Neural Stick and the Google Colar accelerator are USB dongles with a special TPU chip performing all tensor calculations. The Intel Neural Stick comes with a toolset to migrate a TensorFlow, Caffe or MXNet model into a working Intermediate Representation (IR) image for the Neural Stick.
The Google Coral works with special pre-compiled TensorFlow Lite networks. If the topology of the neural network and its required operations can be described in TensorFlow it may work well on the Google Coral. However, with its sparse 1 Gbyte RAM, memory shortage can still be an issue.
The Google USB accelerator has its special back-end compiler converting a TensorFlow Lite file to an executable model for the dongle TPU.

The Jetson Nano is the only single-board computer with floating-point GPU acceleration. It supports most models because all frameworks such as TensorFlow, Caffe, PyTorch, YOLO, MXNet, and others use the CUDA GPU support library at a given time. The price is also very competitive. This has everything to do with the booming deep learning market where NVIDIA does not want to lose its prominent role.
Not all the models could run on every device. Most of the time due to memory shortage or incompatibility in hardware and/or software. In these scenarios, several solutions are possible. However, they will be time-consuming to develop and often the results will be disappointing.

Benchmarks are always subject to discussion. Some may find other FPS using the same models. It all has to do with the method used. We used Python, NVIDIA used C++, and Google their TensorFlow and TensorFlow Lite. The Raspberry Pi 3 B+ has a 2.0 USB interface onboard. Both neural sticks can handle 3.0, which means that they could perform faster. The new Raspberry Pi 4 B, on the other hand, has USB 3.0, which will result in a higher FPS compared to its predecessor.
ModelFrameworkRaspberry Pi  (use TF-Lite)
Raspberry Pi
Raspberry Pi
Intel Neural Stick 2
Raspberry Pi
Google Coral USB
JeVoisJetson NanoGoogle Coral
(244x244)
TensorFlow
2.4 FPS (Pi 3)
4.3 FPS (Pi 4)
1.7 FPS (Pi 3)
3 FPS (Pi 4)
16 FPS (Pi 3)--36 FPS-
MobileNet-v2
(300x300)
TensorFlow4.4 FPS (Pi 3)
8 FPS (Pi 4)
8 FPS (Pi 3)
8.9 FPS (Pi 4)
30 FPS (Pi 3)46 FPS (Pi 3)30 FPS64 FPS130 FPS
SSD Mobilenet-V2
(300-300)
TensorFlow2.6 FPS (Pi 3)
4.7 FPS (Pi 4)
3.7 FPS (Pi 3)
5.8 FPS (Pi 4)
11 FPS (Pi 3)
41 FPS (Pi 4)
17 FPS (Pi 3)
55 FPS (Pi 4)
-39 FPS48 FPS
Binary model
(300x300)
XNOR6.8 FPS (Pi 3)
12.5 FPS (Pi 4)
------
Inception V4
(299x299)
PyTorch---3 FPS (Pi 3)-11 FPS9 FPS
Tiny YOLO V3
(416x416)
Darknet0.5 FPS (Pi 3)
1 FPS (Pi 4)
1.1 FPS (Pi 3)
1.9 FPS (Pi 4)
--1.2 FPS25 FPS-
OpenPose
(256x256)
Caffe--5 FPS (Pi 3)--14 FPS-
Super Resolution
(481x321)
PyTorch--0.6 FPS (Pi 3)--15 FPS-
VGG-19
(224x224)
MXNet0.5 FPS (Pi 3)
1 FPS (Pi 4)
-5 FPS--10 FPS-
Unet
(1x512x512)
Caffe--5 FPS--18 FPS-

Raspberry Pi and recent alternatives.

Below a selection is made between Raspberry Pi and recent alternatives suitable for implementing deep learning models. Most have extensive GPU or TPU hardware on the chip. Prices are indications (2019).
Raspberry Pi 3 B+
Raspberry Pi 3 B+
4x Cortex-A53
VideoCore IV
24 GOPS
1.2 GHz - 1 GB
☆ (16)
€ 40

Parent of all boards. Still one of the most sold. Lots of code and support available.
Raspberry Pi 4
Raspberry Pi 4 B
4x Cortex-A72
VideoCore IV
24 GOPS
1.5 GHz - 1/2/4 GB
☆ (30)
€ 40/€ 50/€ 60

The successor to the Raspberry Pi 3 with a faster processor, USB 3.0 and GigaEhternet.
JeVois A33
JeVois
4x Cortex-A7
2x Mali-400
1.35 GHz - 256 MB
☆ (22)
€ 70

A complete 32-bit single board computer with an integrated 1.3 MP camera. Due to the GPU capable of deep learning and other machine vision tasks. It's also very small (24x24 mm)

Intel Neural Stick 2
Intel Neural Stick 2
Intel Movidius Myriad X
16 SHAVE cores
1 TOPS

€ 87

Special Intel neural network USB 3 dongle for PC and single boards like Raspberry Pi. Accelerates tensor arithmetic enormously.
Google Coral USB TPU
Google Coral USB
Edge TPU
4.0 TOPS


€ 70

Only the bare Google Coral Edge TPU with a USB 3.0 interface. Capable of exactly the same as the Coral board.
Rockchip RK3399Pro
Rockchip RK3399Pro
6x Cortex-A53
4x Mali-860
2.4 TOPS AI Core
1.5 GHz - 6 GB

€ 320

8 and 16-bit neural network processing unit
HiKey970
HiKey970
4x Cortex-A73 + 4x Cortex-A53
12x Mali-G71
1.92 TOPS AI Core
2.3+1.4 GHz - 6 GB

€ 320

8-bit neural network processing unit
Jetson NANO
Jetson Nano
4x Cortex-A57
128x CUDA
0.472 TOPS
1.43 GHz - 4 GB
€ 100

Single board computer like a Raspberry with special hardware tensor acceleration by the floating point CUDA's
Sophon BM1880
Sophon BM1880
2x Cortex-A53 + RISCV
-
1.0 TOPS AI Core
1.5 + 1.0 GHz - 1 GB

€ 130

8-bit neural network processing unit
Google Coral
Google Coral
4x Cortex-A53 + 1x Cortex M4
GC7000 Lite 3D
4.0 TOPS AI Core
1.5 + 1.0 GHz - 1 GB


€ 160

Raspberry-inspired board with the Edge TPU accelerator.
Google SOM
Google SOM
4x Cortex-A53 + 1x Cortex M4
GC7000 Lite 3D
4.0 TOPS AI Core
1.5 + 1.0 GHz - 1 GB


€ 105

Single tiny (40x48 mm) pluggable module with full I/O and the Edge TPU accelerator.
Deep learning
Google TPU
Raspberry and alt
Info
Raspberry Pi 4
Back to content