Deep learning with RPi and alternatives - Q-engineering
Go to content
Deep learning with Raspberry Pi

Deep learning with Raspberry Pi and alternatives in 2024

Last updated: Februari 23, 2024


This page assists you to build your deep learning modal on a Raspberry Pi or an alternative like Google Coral or Jetson Nano. For more general information about deep learning and its limitations, please see deep learning. This page deals more with the general principles, so you have a good idea of how it works and on which board your network can run. Detailed step-by-step recipes for installing of software can be found at deep learning software for Raspberry Pi 4 and alternatives.
A widely used software package for deep learning is TensorFlow. Let start with the name. What is a tensor?  
You can have a list of numbers. This is called a vector in mathematics.


If you add a dimension to this list, you get a matrix.


This way you can, for example, display a black and white image. Each value represents a pixel value. The number of rows equals the height, the number of columns matches the width of the image. If you add yet again an extra dimension to the matrix, you get a tensor.


A stack of 2D matrices on top of each other. Or to put it another way, a matrix in which the individual numbers are replaced by a vector, a list of numbers. An example is an RGB picture. Each individual pixel (element in the matrix) consists of three elements; an R, G and B component. This is the most simplified definition of a tensor, an n-dimensional array of numbers.
There is a subtle difference in the definition between tensors in TensorFlow and mathematics.
In mathematics, a tensor is not just a collection of numbers in a matrix. Here a tensor must obey certain transformation rules. These rules have to do with altering the coordinate system in which the tensor lives without altering its outcome. Most tensors are 3D and have the same number of elements as a Rubric cube. Each individual cube predicts how a physical object will deform under stress (tensor) by a set of orthogonal vectors.
If the observer takes another position in the real world, the deformations of the object itself don’t change; obvious, it is still the same object. However, all your vectors or formulas will change given this new position. They will change in such a way that the result of a deformation still remains the same. Think of it as the distance between the top of two towers. Where ever you stand, that will not change. Drawing vectors from your position to those tops will however shift according to your position, your origin.
Tensor matrix
There is even a third meaning in this context of a tensor, a neural tensor network.
The tensor in this special neural network establishes a relationship between two entities. A dog has a tail, a dog is a mammal, a mammal needs oxygen, etc.

Neural tensor layer
The last two definitions are only given for completeness. Many people think that TensorFlow has something to do with one of these interpretations. This is not the case.

Weight matrix

The most important building block of TensorFlow and other deep-learning software is the n-dimensional array. This section explains the use of these arrays. Every deep learning application consists of a given topology of neural nodes. Each neural node is usually constructed as shown below.

Neural node

Each input is multiplied by a weight and added together. Together with a bias, the result goes to an activation function φ. This can be a simple step operation or a more complex function such as a hyperbolic tangent.

The output is the input for the next layer in the network. A network can be made of many layers, each with thousands of individual neurons.
If you look at one layer, the same input array can be applied to different weight arrays. Each with a different result so that various features can be extracted from a single input.

Neural net
In the network above, four inputs (yellow) are all fully connected to the four neurons (blue) of the first layer. These are wired to the five neurons of the next layer. Following another inner layer of six neurons. After two consecutive layers of four and three, the output (orange) with three channels is reached.
Such a scheme results in a vector-matrix multiplication.

GPU matrix mul

Here the input layer of four values (x,y,z,w) is multiplied with the weight matrix. Weights a,b,c,d for the x input, resulting in x' at the output. The weights e,f,g,h for the y' output and so on. There are other ways to describe this multiplication like,

GPU formula

Where v is the input vector (x,y,z,w) and v' the output (x',y',z'). The vector-matrix multiplication is one of the most performed operations in TensorFlow, hence the name.


Before all dots are put together, first a little detour in GPU hardware. GPU stands for Graphical Processing Unit, a device initially designed to relieve the CPU from the dreary screen rendering task. Over the years the GPUs became much more powerful. Nowadays they have over 21 billion transistors and are capable of performing massive parallel computations. Especially in games where every pixel on the screen is calculated, these calculation capabilities are needed. When moving the position of the viewer, for example, when the hero starts running, all vertices must be recalculated. And this 25 times per second to get smooth transitions. Each vertex needs a rotation and a translation. The formula is:

GPU full matrix

Here (x,y,z,w) is the initial pixel position in 3D and (x',y',z',w') is the new position after the matrix operation. As you can see, this type of arithmetic is the same as for a neural network. There is another point of interest. When you look at x' it is the summation of four products (ax+by+cz+dw). y' on the other hand is also a summation (ex+fy+gz+hw). But to calculate y' one does not need to know the values that determine x' (a,b,c, and d). They have no bearing on each other. You can calculate x' at the same time as y'. And z' and w' for that matter also. In theory, every calculation with no relations on other outcomes can be performed at the same time. Hence the very parallel architecture of GPU. The fastest GPUs today (2024) are capable of a whopping 125 TFLOPs per second.
This is the whole idea behind GPU acceleration. Transfer all tensors to the GPU memory and have the device perform all vector-matrix calculations in a fraction of the time it would cost the CPU. Without the impressive GPU calculation power, deep learning would hardly be possible.


Driven by the huge market potential of deep learning, some manufacturers replaced the GPU for a TPU, a Tensor Processing Unit. In addition to the vector-matrix multiplication, the GPU also has other tasks to do such as vertex interpolation and shading, H264 compression, driving HDMI monitors, etc. By using all transistors solely for tensor dot products, the throughput increases while the power consumption decreases. The first generation only works with 8-bit integers, the later also with floating points. The TPUs on the embedded boards below are all integer based except the Jetson Nano. Read an in-depth article here.
GPU Pitfalls
There are a few points about GPU arithmetic that must be taken into account.
To begin, stick to the matrices. GPU architecture is designed for that kind of operation. Writing an extensive if-else structure is disastrous for a GPU and the overall performances.
Another point is that memory swaps cost a lot of efficiencies. More and more the transfer of data from the CPU memory (where the images are usually located) and the GPU memory is becoming a serious bottleneck. You read the same over and over again in every document of NVIDIA; the larger the vector-matrix dot product the faster it will be executed.
In this regard, keep in mind that Raspberry and its alternatives usually have one large RAM for both the CPU and the GPU. They share simply the same DDR4 chip(s). Your neural network must not only fit in the program memory, but it must also leave space in the RAM so that the CPU kernel can run. This can sometimes impose restrictions on the network or the number of objects to be recognized. Choose another board with more RAM may be the only solution in that case. All this contrasts with the graphics card in a PC where the GPU has its memory bank.
Another distinction is that the GPU on a video card works with floats or half floats, sometimes also called small floats. The embedded GPU on the Raspberry or the TPU on the alternatives boards works with 8 or 16-bit integers. Your neural network must be adapted to these formats. If this is not possible, choose another board with floating-point arithmetic like the Jetson Nano.
Last advice, don't overclock the GPU too much. They work normally at a lower frequency than the CPU. Some Mali GPUs in ARM cores run as low as 400 MHz. Overclocking can work in the winter, but the application may falter mid-summer. Remember, it's your vision application at your client that suddenly crashes, not a game you simply restart.
And of course, the comments on the page about computer vision on the Raspberry also apply here.
You cannot train a deep learning model on a Raspberry Pi or an alternative. Not if you haven't planned a trip around the world. The boards lack the computer capacity to perform the huge amount of floating-point mul-adds required during training. Even a Google Coral cannot train a network because the TPU on this board works only with special pre-compiled TensorFlow networks. Only the last layer in a network can be changed slightly. And although the Jetson Nano has floating-point CUDAs, it is still not very well able to train a network in an acceptable time. Do it overnight is the advice of NVIDIA here. So, in the end, you can only import and run an already trained model on these boards.
Cloud services.
As mentioned before, training is not an option on a Raspberry Pi, nor on any other small SBC. However, there is an escape route. All major technology companies have cloud services. Many of them also include the option to run a Linux virtual machine equipped with a GPU. Now you have a state-of-the-art CPU with CUDA acceleration at your fingertips. One of the best free services has Google, with a free 15 GB on GDrive and minimal 12 hours of free computer time per day. Now it is possible to train your deep learning models to a certain extent with just a simple Raspberry Pi. Transfer training (partially adjusting your weights without changing the topology) is doable because it is a relatively easy task that you can do well in a few hours. Training a complex GAN, on the other hand, takes more resources. It will likely force you to buy additional power.
The first step is to install an operating system, usually a Linux derivative such as Ubuntu or Debian. That is the easy part.
The hard part is installing your deep learning model. You have to figure out if any additional libraries (OpenCV) or drivers (GPU support) are needed. Please note that only the Jetson Nano support CUDA, a package most deep learning software on a PC use. All other boards need different GPU support if you want to accelerate the neural network. The development of GPU drivers for Raspberry Pi or the alternatives is an ongoing process. Check the communities on the net.

The last step is reducing the neural network to acceptable proportions. The famous AlexNet has original 2.3 billion floating-point operations per single frame. This will never run fast on a simple single ARM computer or mobile device. Most models have some sort of reduction strategy. YOLO has Tiny YOLO, Caffe has Caffe2 and TensorFlow has TensorFlow Lite. They all use one or more of the following techniques.
Reduce the input size. Smaller images save a lot of computations on the first layers.  
Decrease the number of objects to classify; it trims the sizes of many internal layers.
Port the neural network from floats to bytes where possible. This also lowers the memory load considerable.
Another strategy is the reduction of the floats to single bits, an XNOR network. This fascinating idea is discussed here.

Comparison of Raspberry Pi and alternatives.

Jetson Nano vs Google Coral vs Intel Neural stick, here the comparison. The three odd ones out in the list are the JeVois, the Intel Neural Stick, and the Google Colar USB accelerator. The first has a camera onboard and can do a lot as you can read here.
The Intel Neural Stick and the Google Colar accelerator are USB dongles with a special TPU chip performing all tensor calculations. The Intel Neural Stick comes with a toolset to migrate a TensorFlow, Caffe or MXNet model into a working Intermediate Representation (IR) image for the Neural Stick.
The Google Coral works with special pre-compiled TensorFlow Lite networks. If the topology of the neural network and its required operations can be described in TensorFlow it may work well on the Google Coral. However, with its sparse 1 Gbyte RAM, memory shortage can still be an issue.
The Google USB accelerator has its special back-end compiler converting a TensorFlow Lite file to an executable model for the dongle TPU.

The Jetson Nano is the only single-board computer with floating-point GPU acceleration. It supports most models because all frameworks such as TensorFlow, Caffe, PyTorch, YOLO, MXNet, and others use the CUDA GPU support library at a given time. The price is also very competitive. This has everything to do with the booming deep learning market where NVIDIA does not want to lose its prominent role.
Not all the models could run on every device. Most of the time due to memory shortage or incompatibility in hardware and/or software. In these scenarios, several solutions are possible. However, they will be time-consuming to develop and often the results will be disappointing.

Benchmarks are always subject to discussion. Some may find other FPS using the same models. It all has to do with the method used. We used Python, NVIDIA used C++, and Google their TensorFlow and TensorFlow Lite. The Raspberry Pi 3 B+ has a 2.0 USB interface onboard. Both neural sticks can handle 3.0, which means that they could perform faster. The new Raspberry Pi 4 B, on the other hand, has USB 3.0, which will result in a higher FPS compared to its predecessor.

The numbers shown in the table are purely the time it takes to execute from input to output. No other processes are taken into account like capturing and scaling images. No overclocking is used, by the way.
ModelFrameworkRaspberry Pi  (TF-Lite)
Raspberry Pi
Raspberry Pi
Intel Neural Stick 2
Raspberry Pi
Google Coral USB
JeVoisJetson NanoGoogle Coral
14.6 FPS (Pi 3)
25.8 FPS (Pi 4)
95 FPS (Pi 3)
180 FPS (Pi 4)
105 FPS (Pi 3)
200 FPS (Pi 4)
-216 FPS200 FPS
2.4 FPS (Pi 3)
4.3 FPS (Pi 4)
1.7 FPS (Pi 3)
3 FPS (Pi 4)
16 FPS (Pi 3)
60 FPS (Pi 4)
10 FPS (Pi 3)
18.8 FPS (Pi 4)
-36 FPS18.8 FPS
TensorFlow8.5 FPS (Pi 3)
15.3 FPS (Pi 4)
8 FPS (Pi 3)
8.9 FPS (Pi 4)
30 FPS (Pi 3)46 FPS (Pi 3)30 FPS64 FPS130 FPS
SSD Mobilenet-V2
TensorFlow7.3 FPS (Pi 3)
13 FPS (Pi 4)
3.7 FPS (Pi 3)
5.8 FPS (Pi 4)
11 FPS (Pi 3)
41 FPS (Pi 4)
17 FPS (Pi 3)
55 FPS (Pi 4)
-39 FPS48 FPS
Binary model
XNOR6.8 FPS (Pi 3)
12.5 FPS (Pi 4)
Inception V4
PyTorch---3 FPS (Pi 3)-11 FPS9 FPS
Tiny YOLO V3
Darknet0.5 FPS (Pi 3)
1 FPS (Pi 4)
1.1 FPS (Pi 3)
1.9 FPS (Pi 4)
--2.2 FPS25 FPS-
Caffe4.3 FPS (Pi 3)
10.3 FPS (Pi 4)
-5 FPS (Pi 3)--14 FPS-
Super Resolution
PyTorch--0.6 FPS (Pi 3)--15 FPS-
MXNet0.5 FPS (Pi 3)
1 FPS (Pi 4)
-5 FPS--10 FPS-
Caffe--5 FPS--18 FPS-
TensorFlow2.0 FPS (Pi 3)
3.6 FPS (Pi 4)

Raspberry Pi and deep learning.

We have placed a deep learning library and several deep learning networks on GitHub. Together with the simple C++ example code, you could build your deep learning application on a bare Raspberry Pi. It is extremely user friendly. More information on this page.
Above an impression of a TensorFlow Lite model (MobileNetV1_SSD 300x300 with the COCO training set) running on a bare Raspberry Pi.
With a 64-bits operating system like Ubuntu, you get 24 FPS, if you overclock to 1925 MHz.
With the regular 32-bits system like Raspbian, you get 17 FPS, once overclocked to 2000 Mhz.

Raspberry Pi and recent alternatives.

Below a selection is made between Raspberry Pi and recent alternatives suitable for implementing deep learning models. Most have extensive GPU or TPU hardware on the chip. Please note that the listed price can fluctuate a lot. The prices shown are from just after the worldwide severe shortage of chips. The GPU speed is given in TOPS which stands for Tera Operations Per Second. The highest score will be, of course, when you are using 8-bit integers. Most suppliers give this 8-bit score. If you want to have an impression in TFLOPS (Tera Floating Operations Per Second), divide the number by four. Although some GPUs aren't capable of processing single 8-bits, like the Jetson Nano, the score is still in TOPS, just for comparison reasons.
Raspberry Pi Pico
2x Cortex-M0+ CPU
133 MHz - 264 KB
€ 5

Just some I/O, an RP2040 MCU and 2 MB of flash. Can it be used for deep learning? Barely. However, TensorFlow TinyML has some examples here.
Raspberry Pi Zero 2 W
4x Cortex-A53 CPU
VideoCore IV GPU
1.0 GHz - 512 MB
€ 15

The Raspberry Pi 3B+ on the RPi Zero footprint. Ideal for low cost, small deep learning applications in tiny housing.
Raspberry Pi 3 B+
Raspberry Pi 3 B+
4x Cortex-A53 CPU
VideoCore IV GPU
1.2 GHz - 1 GB
€ 40

Parent of all boards. Still one of the most sold. Lots of code and support available.
Raspberry Pi 4
Raspberry Pi 4 B
4x Cortex-A72 CPU
VideoCore VI GPU
1.5 GHz - 1/2/4/8 GB

€ 50/€ 50/€ 60/€ 90

The successor to the Raspberry Pi 3 with a slightly faster processor, USB 3.0 and GigaEhternet.
Raspberry Pi 5
Raspberry Pi 5
4x Cortex-A76 CPU
VideoCore VII GPU
2.4 GHz - 4/8 GB
€ 70/€ 93

The successor to the Raspberry Pi 4 with a (much) faster processor, two camera ports, PCIe 2.0, USB 3.0 and GigaEhternet.
Jetson nano B2
Jetson Nano B01
4x Cortex-A57 CPU
128x CUDA
1.88 TOPS
1.43 GHz - 4 GB

€ 216

Identical to the Jetson Nano A02 board, except it has two camera ports, which makes it ready for binocular applications like stereo recording, depth sensing, 3D object tracking and image stitching. More NVIDIA boards.
Jetson Orin Nano 4GB
6x Cortex-A78 CPU
512x CUDA
1.43 GHz - 4/8 GB

€ 242 / € 450

The successor to the Jetson Nano. Except this board has a lot more AI power. If you want to start deep learning at the edge, here's your board.Note, you need also a carrier board which makes a total of € 450,= for a development kit. More NVIDIA boards.
Radxa Zero 3W
4x Cortex-A55
Mali-G52 GPU
1.4 GHz - 1/2/4/8 GB

€ 16/€ 21/€ 30/€ 44

With a form factor of the Raspberry Pi Zero, this little board beats all its competitors. The RK3366 has an NPU for deep learning acceleration.
For € 21,= you get a board identical to the Raspberry Pi 4, with an additional 0.6 TOPS NPU.
Note: the NPU need 2 GB or more RAM. Due to the poor support, drivers and software may be a problem,
Rock 3 C
Rock 3C
4x Cortex-A55
Mali-G52 GPU
1.6 GHz - 1/2 GB

€ 40/€ 50

We mentioned the Rock 3C only because it is the cheapest fully stacked board with an RK3366.
For € 40,= you get a board almost identical to the Raspberry Pi 4, with an additional 1 TOPS NPU. As usual, drivers and software are the bottleneck.
Rock 5A
4x Cortex-A76 + 4x Cortex-A55 CPU
Mali-G610 MP4 GPU
2.5 GHz - 4/8/16 GB

€ 100/€ 120/€ 160

The Rock5A is targeted as the next-generation Raspberry Pi 4. Build with the Rockchip RK3588, giving you better and faster CPUs. The NPU (neural processing unit) supports INT4/INT8/INT16/FP16 mixed operations. Runs Android, Debian, Ubuntu etc. Due to the weak support, software drivers can be a problem.
Rock 5B
4x Cortex-A76 + 4x Cortex-A55 CPU
Mali-G610 MP4 GPU
2.5 GHz - 4 GB

€ 150

A slightly larger board than the Rock 5A, with the same Rockchip RK3588. More I/O is available compared to the Rock 5A. Thanks to the metal housing, the cooling of the SoC is no problem.
Orange Pi 5
Orange Pi 5
4x Cortex-A76 + 4x Cortex-A55 CPU
Mali-G610 MP4 GPU
2.5 GHz - 4/8/16 GB

€ 77/€ 100/€ 127

Almost identical to the Rock 5 board with the Rockchip RK3588S. The price is slightly lower compared to the Rock 5. Runs on Android, Debian, Ubuntu etc. Due to the weak support, software drivers can be a problem.  Don't forget to cool your RK3588. Camera.
Google Coral
Google Coral
4x Cortex-A53 + 1x Cortex M4 CPU
GC7000 Lite 3D GPU
1.5 + 1.0 GHz - 1/4 GB

€ 125

Raspberry-inspired board with the Edge TPU accelerator. Note the limited RAM (1 GB), while deep learning is memory hungry.
Google Coral Mini
4x Cortex-A35 CPU
IMG PowerVR GE8300 GPU
1.3 GHz - 2 GB


A smaller, simpler and cheaper board. Less CPU performance gives lower power consumption. Usual I/O and the original Edge TPU accelerator.
Google Coral Micro
Cortex-M7 + Cortex-M4
1.0 GHz - 512 MB


A simple micro controller with the original Edge TPU accelerator. On board a 324x324 color camera. WiFi and Ehternet are separate add-on boards. With only 512MB of RAM, it's like a Ferrari with a 5-gallon gas tank. Best, you only run tiny quantized (int8) TensorFlow Lite models.
Khadas VIM3
4x Cortex-A73 + 2x Cortex-A53 CPU
2.2 GHz - 2/4 GB

€ 110

Superior Raspberry replacement with an 8 and 16-bit neural network processing unit .
Update Oct 2022: Sadly, at the moment, there's only one outdated framework (2020) available for the NPU, which isn't capable of running modern Yolo models.
Myriad X
16 SHAVE cores

€ 85

OpenCV AI Kit, with an integrated Sony 12 MPixel IMX378 camera and a Myriad X VPU. Suitable for most vision tasks, such as simple deep learning.
Myriad X
16 SHAVE cores

€ 125

OpenCV AI Kit with an integrated Sony 12 MPixel IMX378 camera and a Myriad X VPU. Compared to the OAK-1 it has two additional OV9282 camera's with global shutter, making it ready for depth sensing, 3D object tracking and image stitching.
Myriad X
16 SHAVE cores

€ 170

The OAK-D, but now in a beautiful housing. The Myriad X VPU is still used as a workhorse. Only the cameras have been upgraded to the Sony 13 MPixel IMX214 and two OV7251 for depth measurement.
Intel Neural Stick 2
Intel Neural Stick 2
Intel Movidius Myriad X
16 SHAVE cores

€ 81

Special Intel neural network USB 3 dongle for PC and single boards like Raspberry Pi. Accelerates tensor arithmetic enormously. Fully supported by OpenCV.
Google Coral USB TPU
Google Coral USB
Edge TPU
4.0 TOPS

€ 82

Only the bare Google Coral Edge TPU with a USB 3.0 interface. Capable of the same as the Coral board.
Orange Pi AI stick
Orange Pi AI Stick Lite
Lightspeeur NPU
2.8 TOPS

€ 22

The Rockchip RK3399 neural processor unit in a USB 3 dongle. Supports various deep learning models such as VGG, SSD by an Orange Pi convertor tool.
RK1018 TPU
RK1808 NPU
Rockchip AI core
3.0 TOPS

€ 78 (Sold out)

The neural processor unit from the Rockchip RK3399 in a USB 3 dongle. It also has 1GB RAM and 8GB EMMC storage on board.
Sipeed Maix Go
2x RISC-V 64-bit CPU
0,5 TOPS
800 MHz - 8 MB

€ 35 (Sold out)

A very cute board with camera, microphone, speaker, I/O, USB and on top an NPU accelerator. Not a RPi or a Nano, but still perfect for simple deep learning tasks. Working with MicroPython. Most interesting is the low power consumption of 300 mW.
JeVois A33
4x Cortex-A7 CPU
2x Mali-400 GPU
1.35 GHz - 256 MB
€ 70

A complete 32-bit single board computer with an integrated 1.3 MP camera. Due to the GPU capable of deep learning and other machine vision tasks. It's also very small (32x40 mm). Read more here.
Sophon BM1880
Sophon BM1880
2x Cortex-A53 + RISCV CPU
1.0 TOPS AI Core
1.5 + 1.0 GHz - 1 GB

€ 116

8-bit neural network processing unit
Google SOM
Google SOM
4x Cortex-A53 + 1x Cortex M4 CPU
GC7000 Lite 3D GPU
1.5 + 1.0 GHz - 1 GB

€ 90

Single tiny (40x48 mm) pluggable module with full I/O and the Edge TPU accelerator.
Deep learning algorithms for Raspberry Pi
Deep learning software for Raspberry Pi
Back to content