Deep learning algorithms for Raspberry Pi and alternatives.
This page discusses some of the commonly used techniques and algorithms used to enhance deep learning applications on a Raspberry Pi or an alternative. You can roughly follow two strategies if you want to speed up your network, using hardware and/or specific custom software algorithms.
First, let's look at the fundamental operation involving neural networks, the convolution, also known as tensor product. Ninety per cent of the time, the deep learning software is performing this kind of operation. Needless to say that here the most profit is to be gained.
As is known, the convolution is a moving matrix multiplication over an input layer through a kernel, generating the output layer. Below a beautiful gif from Rijul Vohra that illustrates it perfectly.
One of the obvious techniques to improve the speed of a deep learning application is the use of additional hardware. There are several options. You can use a dedicated board like the Google Coral, Jetson Nano, Khadas VIM3 or a neural stick. All these products are reviewed on this page.
However, all these solutions force you to adapt your software to the chosen option, which can be labor-intensive.
Another technique is to use the already available GPU facilities that most SOCs have on board. Different software libraries can be deployed, such as CUDA, Vulkan or OpenCL. If the chosen framework supports the library, it can make a huge difference in performance. A disappointing note, the Raspberry Pi has a Broadcom VideoCore IV GPU that is not supported by any library. This in contrast with most ARM-cores with their MALI GPU.
A GPU consists of hundreds or thousands of identical small arithmetic units. They all execute the same instruction at the same time. Only the data on which the operation takes place can vary. The GPU can, therefore, process a large amount of data within a few clock cycles. Very useful when it comes to matrix or tensor calculations for neural networks.
On the other hand, the GPU architecture is not well suited for if-then-else branches, nor instructions on individual data members. Also, the GPU has its own memory. The data must first be moved from the CPU to the GPU memory. And later, when the operations are complete, returned. This takes time, sometimes more than initial gained by GPU parallelism.
As said earlier, to include the GPU functionality in your program, you must use a library. Three possible libraries are CUDA, Vulkan and OpenCL.
As is known, CUDA is specially designed for the NVIDIA GPUs with a CUDA architecture. Because the CUDA development environment fits seamlessly with Visual Studio and has excellent debugging capabilities, it has become the most widely used library in deep learning. In our list, only the Jetson Nano supports CUDA.
Vulkan is a low-level library for a wide range of computer platforms and graphic cards. Although the intended use is the acceleration of rendering graphics, the library can also be used for deep learning. For example, ncnn or MNN deep learning software uses Vulkan.
OpenCL is also a low-level GPU library, used for all kinds of graphic cards. It is an open-source project, just like OpenCV, managed by the Khronos Group.
OpenCL wants to supports as many different GPUs as possible. A unique set of low-level software must be hand-crafted written for each GPU architecture to meet this end. Only in this way can each type of GPU be addressed by the same set of high-level library commands.
The Raspberry Pi uses a Broadcom VideoCore GPU. Despite the huge sales figures, an official OpenCL version has never been written yet.
Almost every ARM core has more than one CPU. It makes sense to employ every available core for the calculations of the tensor products. This technique is called multi-threading. A simple and elegant way to do this is with OpenMP. Almost every modern compiler has OpenMP on board. With one extra line (#pragma omp parallel for), the next instruction is processed in parallel over all cores.
In the above code, the for-loop is executed in parallel. In case of a quad-core ARM, like a Raspberry Pi, each core now handles only one-quarter of the array. This makes the instruction in theory four times faster.
Caution is required when it comes to shared or global variables. This creates so-called critical sections; parts of the program where the same memory location can be changed simultaneously by different threads. This gives access-conflicts. By using an additional #pragma omp critical, OpenMP can be instructed to treat the following line as a critical section, allowing only one thread to change a variable at the time.
The number of cores determines the theoretical acceleration. In practice, some compiler overhead code will slow down the execution a little. However, critical sections are real showstoppers, they make threads wait for each other. This can ultimately result in a multi-thread program running slower than the single-thread variant. Fortunately, the tensor calculations of a deep learning program are very well suited to parallelism.
By using the # pragma directives, the code can always be complied with, even if OpenMP is not supported. That is the beauty of OpenMP. And of course the ease of use. One last remark, you don't have to download and install OpenMP, it comes with your default Raspbian g++/gcc compiler.
The only thing to make OpenMP work, is to set the compiler switch -fopenmp.
$ gcc -o My_program -fopenmp My_program.c
Another thread mechanism is the POSIX threads or, often called, pthreads. The POSIX threads originate from UNIX platforms but are nowadays found in many operating systems including Linux. There is even a special version for Windows.
Their functionality is almost identical to OpenMP, with one big difference. Where OpenMP is used to split loops over the threads and making a copy of the same code in each thread, pthread allows you to give each thread its unique functionality. Both, the GNU and Clang compiler used by Raspbian, support phreads.
Most ARM cores have special registers for parallel operations, the so-called NEON architecture. A computer program is a large set of instructions. Each instruction tells the CPU which operation to perform on which variables. For example, load A and add B, save the result in C. It is proven that many operations are followed by identical instructions, only with different variables. Ergo, the same load-add-store operation but now with D, E and F. The NEON registers inside the ARM core facilitate these types of sequences, the so-called SIMD (Single Instruction Multiple Data) instructions. In other words, instead of an instruction operating on one set of variables, it now works on vectors composed of more variables.
The NEON vectors in the Raspberry Pi 4 are 64 bits wide. This gives a series of possible configurations, all of which can be manipulated with one assembly instruction. For clarity, the f(x) operation in the picture below is the same for all elements.
As you can see, working with 8-bit numbers makes your program 8 times faster. 16-bit data gives an increase of 4 and 32-bit (floats) still doubling the performance. There is some little overhead when composing the vectors out of single bytes. Just like the reverse operation, where the vector is saved in different locations of bytes or words. The calculation problem must, of course, be suitable for vectorization before it can be processed by the NEON assembly. Luckily, deep learning, with its tensor operations, is very well suited for vectorization.
The NEON structure is somewhat similar to the GPU acceleration. Both work with a single instruction on multiple data. However, the GPU only supports a few (matrix) operations, while the NEON assembly has a very flexible nomenclature, giving you many different complex operations on the vectors. Needless to say, to get the most out of the NEON architecture, you need state-of-the-art assembly written by highly skilled experts.
It now becomes clear why TensorFlow Lite and other deep learning software transfer the weights from float to 8-bit numbers when running on a Raspberry Pi or another ARM-NEON based device.
Not only the hardware can accelerate performance, but also specific software algorithms can significantly improve your deep learning network.
The well known Strassen's and Winograd's convolution algorithms will be discussed here. The whole idea of both algorithms is to replace multiplications with additions. Because a CPU can much faster add numbers then multiply them.
The first and the most general algorithm is Strassen's algorithm.
Suppose you have two block matrices, A and B, that are multiplied to get matrix C. Below you can see the math works, eight multiplications and four additions.
The idea of Strassen is to compile several block matrices algebraically that the number of multiplications is reduced. As you can see in the scheme, there are now seven multiplications.
Two important remarks. The whole scheme can be processed in parallel by the code. No M-term depends on another. Second, this whole scheme can be easily expanded to more elements than the two shown here.
If you look at the performance gains of multiplying two n x n matrices, you will see the factor drop from 3 to 2.81, given the formulas below.
Winograd and Coppersmith further improved the reduction of the matrix convolution in an algebra scheme.
Here we reduced the multiplications from 6 (the normal dot products) to 4. This scheme can also expand to large matrices, leaving you with a computational cost of .
Another nice feature is the g-terms. They are all related to the kernel and will not change during the convolution. Therefore, they can all be pre-calculated which speeds up the code also.
The numbers may seem trivial, but with Winograd properly coded, you can get a performance boost of more than a factor of 2, according to this study from Intel AI.
More and detailed information can be found in Fast Algorithms for Convolutional Neural Networks by Lavin and Gray.
Even more important than the mathematics used is the implementation of the code. Especially if you want to get the most out of simple ARM cores with their limited resources. All major frameworks have special assembly routines for their convolutions.
An example of how sophisticated this type of software development can be can be seen below where all the caches and registers are involved in calculating the Strassen M-terms.