Deep learning software for Raspberry Pi and alternatives - Q-engineering
Go to content
Deep learning with Raspberry Pi DIY

Deep learning software for Raspberry Pi and alternatives in 2020.


This page is under construction. A comprehensive study of deep learning software for a bare Raspberry Pi 4.
A technique all deep learning software uses is multi-threading. Every available core is employed for the calculations of the tensor products. A simple and elegant way to do this is with OpenMP. Almost every modern compiler has OpenMP on board. With one extra line (#pragma omp parallel for), the next instruction is processed in parallel over all cores.
						#pragma omp parallel for
						for(int i=0; i<100000;i++){
						    Ar[i]+=1.0;   //Add a bias to the array of floats
In the above code, the for-loop is executed in parallel. In case of a quad-core ARM, like a Raspberry Pi, each core now handles only one-quarter of the array. This makes the instruction in theory four times faster.
Caution is required when it comes to shared or global variables. This creates so-called critical sections; parts of the program where the same memory location can be changed simultaneously by different threads. This gives access-conflicts. By using an additional #pragma omp critical, OpenMP can be instructed to treat the following line as a critical section, allowing only one thread to change a variable at the time.
						float Max=0.0;
						#pragma omp parallel for
						for(int i=0; i<100000;i++){
						    //find the highest value in the array and 
						    //place it in the global variable Max
						        //run the check twice in case another 
						        //thread has updated Max in the mean time.
						        #pragma omp critical
						        if(Ar[i]>Max) Max=Ar[i];
The number of cores determines the theoretical acceleration. In practice, some compiler overhead code will slow down the execution a little. However, critical sections are real showstoppers, they make threads wait for each other. This can ultimately result in a multi-thread program running slower than the single-thread variant. Fortunately, the tensor calculations of a deep learning program are very well suited to parallelism.

By using the # pragma directives, the code can always be complied with, even if OpenMP is not supported. That is the beauty of OpenMP. And of course the ease of use. One last remark, you don't have to download and install OpenMP, it comes with your default Raspbian g++/gcc compiler.
The only thing to make OpenMP work, is to set the compiler switch -fopenmp.
$ gcc -o My_program -fopenmp My_program.c


A GPU consists of hundreds or thousands of identical small arithmetic units. They all execute the same instruction at the same time. Only the data on which the operation takes place can vary. The GPU can, therefore, process a large amount of data within a few clock cycles. Very useful when it comes to matrix or tensor calculations for neural networks.
On the other hand, the GPU architecture is not well suited for if-then-else branches, nor instructions on individual data members. Also, the GPU has its own memory. The data must first be moved from the CPU to the GPU memory. And later, when the operations are complete, returned. This takes time, sometimes more than initial gained by GPU parallelism.
To include the GPU functionality in your program, you must use a library. Three possible libraries are CUDA, Vulkan and OpenCL.
As is known, CUDA is specially designed for the NVIDIA GPUs with a CUDA architecture. Because the CUDA development environment fits seamlessly with Visual Studio and has excellent debugging capabilities, it has become the most widely used library in deep learning. In our list, only the Jetson Nano supports CUDA.
Vulkan is a low-level library for a wide range of computer platforms and graphic cards. Although the intended use is the acceleration of rendering graphics, the library can also be used for deep learning. For example, ncnn deep learning software uses Vulkan.
OpenCL is also a low-level GPU library, used for all kinds of graphic cards. It is an open-source project, just like OpenCV, managed by the Khronos Group.
OpenCL wants to supports as many different GPUs as possible. A unique set of low-level software must be hand-crafted written for each GPU architecture to meet this end. Only in this way can each type of GPU be addressed by the same set of high-level library commands.
The MALI GPU, used in many ARM processors, is fully supported. See our list here of Raspberry Pi alternatives with a MALI GPU on board. Downloads and other information can be found here.
The Raspberry Pi uses a Broadcom VideoCore GPU. Despite the huge sales figures, an official OpenCL version has never been written yet.
However, the GitHub project of Daniel Steadelmann can be a use as an subsitute. Install instructions are found on this page.


A computer program is a large set of instructions. Each instruction tells the CPU which operation to perform on which variables. For example, load A and add B, save the result in C. It is proven that many operations are followed by identical instructions, only with different variables. Ergo, the same load-add-store operation but now with D, E and F. The NEON registers inside the ARM core facilitate these types of sequences, the so-called SIMD (Single Instruction Multiple Data) instructions. In other words, instead of an instruction operating on one set of variables, it now works on vectors composed of more variables.
The NEON vectors in the Raspberry Pi 4 are 64 bits wide. This gives a series of possible configurations, all of which can be manipulated with one assembly instruction. For clarity, the f(x) operation in the picture below is the same for all elements.
As you can see, working with 8-bit numbers makes your program 8 times faster. 16-bit data gives an increase of 4 and 32-bit (floats) still doubling the performance. There is some little overhead when composing the vectors out of single bytes. Just like the reverse operation, where the vector is saved in different locations of bytes or words. The calculation problem must, of course, be suitable for vectorization before it can be processed by the NEON assembly. Luckily, deep learning, with its tensor operations, is very well suited for vectorization.
The NEON structure is somewhat similar to the GPU acceleration. Both work with a single instruction on multiple data. However, the GPU only supports a few (matrix) operations, while the NEON assembly has a very flexible nomenclature, giving you many different complex operations on the vectors. Needless to say, to get the most out of the NEON architecture, you need state-of-the-art assembly written by highly skilled experts.
It now becomes clear why TensorFlow Lite and other deep learning software transfer the weights from float to 8-bit numbers when running on a Raspberry Pi or another ARM-NEON based device.


In the following section we will discuss some frameworks that can be used by the Raspberry Pi or its alternatives.
Recently, OpenCV has an excellent deep learning module. The OpenCV DNN module runs with a wide range of models: TensorFlow, Caffe, Torch, Darknet or ONNX.  It only plays already trained models, so it is not possible to train networks with new data. A nice feature is that it has on other dependencies, only the OpenCV library is used when running your model. As known, OpenCV can run on many different computers, from a Raspberry Pi or alternative to a Windows or Linux PC. All of them can now be used to play deep learning models. And on top of that, there is hardware acceleration possible, OpenCV DNN supports CUDA and OpenCL. More information and software examples can be found at our page here.
OpenCV logo
OpenCV camera examples
Raspberry and alt
Raspberry Pi 4
Back to content