# Deep learning with Raspberry Pi and alternatives in 2019

### Introduction

This page assists you to build your deep learning modal on a Raspberry Pi or an alternative like Google Coral or Jetson Nano. For more general information about deep learning and its limitations, please see deep learning. This page is not a detailed step-by-step recipe for installing certain software. It deals more with the general principles, so you have a good idea of how it works and on which board your network can run.

Tensor

A widely used software package for deep learning is TensorFlow. Let start with the name. What is a tensor?

You can have a list of numbers. This is called a vector in mathematics.

If you add a dimension to this list, you get a matrix.

This way you can, for example, display a black and white image. Each value represents a pixel value. The number of rows equals the height, the number of columns matches the width of the image. If you add yet again an extra dimension to the matrix, you get a tensor.

A stack of 2D matrices on top of each other. Or to put it another way, a matrix in which the individual numbers are replaced by a vector, a list of numbers. An example is an RGB picture. Each individual pixel (element in the matrix) consists of three elements; an R, G and B component. This is the most simplified definition of a tensor, an n-dimensional array of numbers.

There is a subtle difference in the definition between tensors in TensorFlow and mathematics.

In mathematics, a tensor is not just a collection of numbers in a matrix. Here a tensor must obey certain transformation rules. These rules have to do with altering the coordinate system in which the tensor lives without altering its outcome. Most tensors are 3D and have the same number of elements as a Rubric cube. Each individual cube predicts how a physical object will deform under stress (tensor) by a set of orthogonal vectors.

If the observer takes another position in the real world, the deformations of the object itself don’t change; obvious, it is still the same object. However, all your vectors or formulas will change given this new position. They will change in such a way that the result of a deformation still remains the same. Think of it as the distance between the top of two towers. Where ever you stand, that will not change. Drawing vectors from your position to those tops will however shift according to your position, your origin.

There is even a third meaning in this context of a tensor, a neural tensor network.

The tensor in this special neural network establishes a relationship between two entities. A dog

*has*a tail, a dog*is*a mammal, a mammal*needs*oxygen, etc.The last two definitions are only given for completeness. Many people think that TensorFlow has something to do with one of these interpretations. This is not the case.

### Weight matrix

The most important building block of TensorFlow and other deep-learning software is the n-dimensional array. This section explains the use of these arrays. Every deep learning application consists of a given topology of neural nodes. Each neural node is usually constructed as shown below.

Each input is multiplied by a weight and added together. Together with a bias, the result goes to an activation function φ. This can be a simple step operation or a more complex function such as a hyperbolic tangent.

The output is the input for the next layer in the network. A network can be made of many layers, each with thousands of individual neurons.

If you look at one layer, the same input array can be applied to different weight arrays. Each with a different result so that various features can be extracted from a single input.

In the network above, four inputs (yellow) are all fully connected to the four neurons (blue) of the first layer. These are wired to the five neurons of the next layer. Following another inner layer of six neurons. After two consecutive layers of four and three, the output (orange) with three channels is reached.

Such a scheme results in a vector-matrix multiplication.

Here the input layer of four values (

*x,y,z,w*) is multiplied with the weight matrix. Weights*a,b,c,d*for the*x*input, resulting in*x'*at the output. The weights*e,f,g,h*for the*y'*output and so on. There are other ways to describe this multiplication like,Where

**v**is the input vector (*x,y,z,w*) and**v'**the output (*x',y',z'*). The vector-matrix multiplication is one of the most performed operations in TensorFlow, hence the name.GPU

Before all dots are put together, first a little detour in GPU hardware. GPU stands for Graphical Processing Unit, a device initially designed to relieve the CPU from the dreary screen rendering task. Over the years the GPUs became much more powerful. Nowadays they have over 21 billion transistors and are capable of performing massive parallel computations. Especially in games where every pixel on the screen is calculated, these calculation capabilities are needed. When moving the position of the viewer, for example, when the hero starts running, all vertices must be recalculated. And this 25 times per second to get smooth transitions. Each vertex needs a rotation and a translation. The formula is:

Here (

This is the whole idea behind GPU acceleration. Transfer all tensors to the GPU memory and have the device perform all vector-matrix calculations in a fraction of the time it would cost the CPU. Without the impressive GPU calculation power, deep learning would hardly be possible.*x,y,z,w*) is the initial pixel position in 3D and (*x',y',z',w'*) is the new position after the matrix operation. As you can see, this type of arithmetic is the same as for a neural network. There is another point of interest. When you look at*x'*it is the summation of four products (*ax+by+cz+dw*).*y'*on the other hand is also a summation (*ex+fy+gz+hw*). But to calculate*y'*one does not need to know the values that determine*x'*(*a,b,c,*and*d*). They have no bearing on each other. You can calculate*x'*at the same time as*y'*. And*z'*and*w'*for that matter also. In theory, every calculation with no relations on other outcomes can be performed at the same time. Hence the very parallel architecture of GPU. The fastest GPUs today (2019) are capable of a whopping 125 TFLOPs per second.### TPU

Driven by the huge market potential of deep learning, some manufacturers replaced the GPU for a TPU, a Tensor Processing Unit. In addition to the vector-matrix multiplication, the GPU also has other tasks to do such as vertex interpolation and shading, H264 compression, driving HDMI monitors, etc. By using all transistors solely for tensor dot products, the throughput increases while the power consumption decreases. The first generation only works with 8-bit integers, the later also with floating points. The TPUs on the embedded boards below are all integer based except the Jetson Nano. Read an in-depth article here.

GPU Pitfalls

There are a few points about GPU arithmetic that must be taken into account.

To begin, stick to the matrices. GPU architecture is designed for that kind of operation. Writing an extensive if-else structure is disastrous for a GPU and the overall performances.

Another point is that memory swaps cost a lot of efficiencies. More and more the transfer of data from the CPU memory (where the images are usually located) and the GPU memory is becoming a serious bottleneck. You read the same over and over again in every document of NVIDIA; the larger the vector-matrix dot product the faster it will be executed.

In this regard, keep in mind that Raspberry and its alternatives usually have one large RAM for both the CPU and the GPU. They share simply the same DDR3 chip(s). Your neural network must not only fit in the program memory, but it must also leave space in the RAM so that the CPU kernel can run. This can sometimes impose restrictions on the network or the number of objects to be recognized. Choose another board with more RAM may be the only solution in that case. All this contrasts with the graphics card in a PC where the GPU has its memory bank.

Another distinction is that the GPU on a video card works with floats or half floats, sometimes also called small floats. The embedded GPU on the Raspberry or the TPU on the alternatives boards works with 8 or 16-bit integers. Your neural network must be adapted to these formats. If this is not possible, choose another board with floating-point arithmetic like the Jetson Nano.

Last advice, don't overclock the GPU. They work normally at a lower frequency than the CPU. Some Mali GPUs in ARM cores run as low as 400 MHz. Overclocking can work in the winter, but the application may falter mid-summer. Remember, it's your vision application at your client that suddenly crashes, not a game you simply restart.

And of course, the comments on the page about computer vision on the Raspberry also apply here.

Showstopper.

You cannot train a deep learning model on a Raspberry Pi or an alternative. Not if you haven't planned a trip around the world. The boards lack the computer capacity to perform the huge amount of floating-point mul-adds required during training. Even a Google Coral cannot train a network because the TPU on this board works only with special pre-compiled TensorFlow networks. Only the last layer in a network can be changed slightly. And although the Jetson Nano has floating-point CUDAs, it is still not very well able to train a network in an acceptable time. Do it overnight is the advice of NVIDIA here. So, in the end, you can only import and run an already trained model on these boards.

Practice.

The first step is to install an operating system, usually a Linux derivative such as Ubuntu or Debian. That is the easy part.

The hard part is installing your deep learning model. You have to figure out if any additional libraries (OpenCV) or drivers (GPU support) are needed. Please note that only the Jetson Nano support CUDA, a package most deep learning software on a PC use. All other boards need different GPU support if you want to accelerate the neural network. The development of GPU drivers for Raspberry Pi or the alternatives is an ongoing process. Check the communities on the net.

The last step is reducing the neural network to acceptable proportions. The famous AlexNet has original 2.3 billion floating-point operations per single frame. This will never run fast on a simple single ARM computer or mobile device. Most models have some sort of reduction strategy. YOLO has Tiny YOLO, Caffe has Caffe2 and TensorFlow has TensorFlow Lite. They all use one or more of the following techniques.

Reduce the input size. Smaller images save a lot of computations on the first layers.

Decrease the number of objects to classify; it trims the sizes of many internal layers.

Port the neural network from floats to bytes where possible. This also lowers the memory load considerable.

Another strategy is the reduction of the floats to single bits, an XNOR network. This fascinating idea is discussed here.

### Comparison of Raspberry Pi and alternatives.

Jetson Nano vs Google Coral vs Intel Neural stick, here the comparison. The three odd ones out in the list are the JeVois, the Intel Neural Stick, and the Google Colar USB accelerator. The first has a camera onboard and can do a lot as you can read here.

The Intel Neural Stick and the Google Colar accelerator are USB dongles with a special TPU chip performing all tensor calculations. The Intel Neural Stick comes with a toolset to migrate a TensorFlow, Caffe or MXNet model into a working Intermediate Representation (IR) image for the Neural Stick.

The Google Coral works with special pre-compiled TensorFlow Lite networks. If the topology of the neural network and its required operations can be described in TensorFlow it may work well on the Google Coral. However, with its sparse 1 Gbyte RAM, memory shortage can still be an issue.

The Google USB accelerator has its special back-end compiler converting a TensorFlow Lite file to an executable model for the dongle TPU.

The Jetson Nano is the only single-board computer with floating-point GPU acceleration. It supports most models because all frameworks such as TensorFlow, Caffe, PyTorch, YOLO, MXNet, and others use the CUDA GPU support library at a given time. The price is also very competitive. This has everything to do with the booming deep learning market where NVIDIA does not want to lose its prominent role.

Not all the models could run on every device. Most of the time due to memory shortage or incompatibility in hardware and/or software. In these scenarios, several solutions are possible. However, they will be time-consuming to develop and often the results will be disappointing.

Benchmarks are always subject to discussion. Some may find other FPS using the same models. It all has to do with the method used. We used Python, NVIDIA used C++, and Google their TensorFlow and TensorFlow Lite. The Raspberry Pi 3 B+ has a 2.0 USB interface onboard. Both neural sticks can handle 3.0, which means that they could perform faster. The new Raspberry Pi 4 B, on the other hand, has USB 3.0, which will result in a higher FPS compared to its predecessor.

### EfficientNet.

EfficientNets are a family of network topologies exclusive tailored for the Coral Edge TPU. As can be read here, the Edge TPU hardware is specially designed to accelerate MAC (multiply-accumulate) operations. Only this specific operation can be done amazingly fast on a TPU. All other operations like loading weights, subtractions, additions or dimension reduction are all time-consuming. To get the maximum out of the TPU hardware, these operations must be reduced to the bare minimum.

And that is exactly the strategy behind EfficientNets. Rather a large 3x3 convolution, then a small 1x1 and 3x3 convolution sequence, which need double loading times. Don't re-use outputs of previous layers like ResNet, which use single additions. Use only simple activation functions like ReLU which are hardcoded implemented in the TPU architecture. All resulting in a fast deep learning network.

Remember, by the way, that the numbers shown are purely the time it takes to execute from input to output. No other processes are taken into account like capturing and scaling images.

Model | Framework | Raspberry Pi (use TF-Lite) | Raspberry Pi (our NCNN) | Raspberry Pi Intel Neural Stick 2 | Raspberry Pi Google Coral USB | JeVois | Jetson Nano | Google Coral |

(224x224) | TensorFlow | 14.6 FPS (Pi 3) 25.8 FPS (Pi 4) | - | 95 FPS (Pi 3) 180 FPS (Pi 4) | 105 FPS (Pi 3) 200 FPS (Pi 4) | - | 216 FPS | 200 FPS |

(244x244) | TensorFlow | 2.4 FPS (Pi 3) 4.3 FPS (Pi 4) | 1.7 FPS (Pi 3) 3 FPS (Pi 4) | 16 FPS (Pi 3) 60 FPS (Pi 4) | 10 FPS (Pi 3) 18.8 FPS (Pi 4) | - | 36 FPS | 18.8 FPS |

MobileNet-v2 (300x300) | TensorFlow | 4.4 FPS (Pi 3) 8 FPS (Pi 4) | 8 FPS (Pi 3) 8.9 FPS (Pi 4) | 30 FPS (Pi 3) | 46 FPS (Pi 3) | 30 FPS | 64 FPS | 130 FPS |

SSD Mobilenet-V2 (300-300) | TensorFlow | 2.6 FPS (Pi 3) 4.7 FPS (Pi 4) | 3.7 FPS (Pi 3) 5.8 FPS (Pi 4) | 11 FPS (Pi 3) 41 FPS (Pi 4) | 17 FPS (Pi 3) 55 FPS (Pi 4) | - | 39 FPS | 48 FPS |

Binary model (300x300) | XNOR | 6.8 FPS (Pi 3) 12.5 FPS (Pi 4) | - | - | - | - | - | - |

Inception V4 (299x299) | PyTorch | - | - | - | 3 FPS (Pi 3) | - | 11 FPS | 9 FPS |

Tiny YOLO V3 (416x416) | Darknet | 0.5 FPS (Pi 3) 1 FPS (Pi 4) | 1.1 FPS (Pi 3) 1.9 FPS (Pi 4) | - | - | 2.2 FPS | 25 FPS | - |

OpenPose (256x256) | Caffe | - | - | 5 FPS (Pi 3) | - | - | 14 FPS | - |

Super Resolution (481x321) | PyTorch | - | - | 0.6 FPS (Pi 3) | - | - | 15 FPS | - |

VGG-19 (224x224) | MXNet | 0.5 FPS (Pi 3) 1 FPS (Pi 4) | - | 5 FPS | - | - | 10 FPS | - |

Unet (1x512x512) | Caffe | - | - | 5 FPS | - | - | 18 FPS | - |

### Raspberry Pi and our deep learning framework.

We have placed a deep learning library and several deep learning networks on GitHub. Together with the simple C++ example code you could build your own deep learning application on a bare Raspberry Pi. It is extremly user friendly. More information on this page.

Raspberry Pi and recent alternatives.

Below a selection is made between Raspberry Pi and recent alternatives suitable for implementing deep learning models. Most have extensive GPU or TPU hardware on the chip. Prices are indications (2019).

Raspberry Pi 3 B+ 4x Cortex-A53 VideoCore IV 24 GOPS 1.2 GHz - 1 GB ★☆☆☆☆ (16) € 40 Parent of all boards. Still one of the most sold. Lots of code and support available. | |

Raspberry Pi 4 B 4x Cortex-A72 VideoCore IV 24 GOPS 1.5 GHz - 1/2/4 GB ★★☆☆☆ (30) € 40/€ 50/€ 60 The successor to the Raspberry Pi 3 with a slightly faster processor, USB 3.0 and GigaEhternet. | |

JeVois 4x Cortex-A7 2x Mali-400 1.35 GHz - 256 MB ★★☆☆☆ (22) € 70 A complete 32-bit single board computer with an integrated 1.3 MP camera. Due to the GPU capable of deep learning and other machine vision tasks. It's also very small (32x40 mm). Read more here. | |

Intel Neural Stick 2 Intel Movidius Myriad X 16 SHAVE cores 1 TOPS ☆☆☆☆☆ ★★★☆☆ € 87 Special Intel neural network USB 3 dongle for PC and single boards like Raspberry Pi. Accelerates tensor arithmetic enormously. | |

Google Coral USB Edge TPU 4.0 TOPS ☆☆☆☆☆ ★★★★☆ € 70 Only the bare Google Coral Edge TPU with a USB 3.0 interface. Capable of exactly the same as the Coral board. | |

RK1808 NPU Rockchip AI core 3.0 TOPS ☆☆☆☆☆ ★★★☆☆ € 78 The neural processor unit from the Rockchip in a USB 3 dongle. It also has 1GB RAM and 8GB EMMC storage on board. | |

Khadas VIM3 4x Cortex-A73 2x Cortex-A53 ARM G52 MP4 GPU 5 TOPS NPU 2.2 GHz - 2/4 GB ★★★★☆ ★★★☆☆ € 99 Superior Raspberry replacement with an 8 and 16-bit neural network processing unit which can, in theory, be faster than Google Coral. Now waiting for good support. | |

Rockchip RK3399Pro 6x Cortex-A53 4x Mali-860 2.4 TOPS NPU 1.5 GHz - 6 GB ★★★★☆ ★★★☆☆ € 320 8 and 16-bit neural network processing unit | |

HiKey970 4x Cortex-A73 + 4x Cortex-A53 12x Mali-G71 1.92 TOPS NPU 2.3+1.4 GHz - 6 GB ★★★★★ ★★★☆☆ € 320 8-bit neural network processing unit | |

Jetson Nano 4x Cortex-A57 128x CUDA 0.472 TOPS 1.43 GHz - 4 GB ★★★☆☆ ★★★★☆ € 100 Single board computer like a Raspberry with special hardware tensor acceleration by the floating point CUDA's | |

Sophon BM1880 2x Cortex-A53 + RISCV - 1.0 TOPS AI Core 1.5 + 1.0 GHz - 1 GB ★★☆☆☆ ★★☆☆☆ € 130 8-bit neural network processing unit | |

Google Coral 4x Cortex-A53 + 1x Cortex M4 GC7000 Lite 3D 4.0 TOPS NPU 1.5 + 1.0 GHz - 1 GB ★★★☆☆ ★★★☆☆ € 160 Raspberry-inspired board with the Edge TPU accelerator. | |

Google SOM 4x Cortex-A53 + 1x Cortex M4 GC7000 Lite 3D 4.0 TOPS NPU 1.5 + 1.0 GHz - 1 GB ★★★☆☆ ★★★☆☆ € 105 |

Deep learning

Google TPU