# Deep learning with FPGA aka Binary Neural Networks.

### Introduction.

Is it wise to run your deep learning app on an FPGA? To be honest, no, it is not.

Not only are FPGAs notoriously hard to program, even with today's tools. You need also to have a thorough knowledge of the hardware at hand. Although an FPGA itself costs only a few dollars, some design boards are relatively expensive. And the design can't be altered as easy as in other boards like the Google Coral, Jetson Nano, or Raspberry Pi.

There is only one reason to use FPGA for deep learning; in handheld, battery powered devices. An FPGA is the only hardware device capable of massive computations at a very low power consumption rate. A complete neural network can be implemented with a power consumption of 1 mW. A comparison with other known platforms is shown below.

FPGA versus CPU.

An FPGA (

**F**ield**P**rogrammable**G**ate**A**rray) is an electronic component. It consists of millions of primitive digital gates. The connections between the gates determine the functionality of the design. These connections can be programmed, hence the name 'field programmable'. FPGAs are usually used for complex arithmetic intensive tasks such as FFT or image stitching.The CPU in a computer processes a long sequence of small arithmetic steps. The faster this sequence is executed, the faster the program will run.

An FPGA, on the other hand, works in parallel. One transaction at the input can cause a chain reaction spread over many gates, thereby executing many complex calculations in one step.

The CPU is flexible. If the program is changed, the functionality changes accordingly. The functionality of an FPGA is fixed due to the pre-programmed wiring. During startup the connection table is loaded into the FPGA, after that it remains unchanged. Therefore is an FPGA best consider as hardware.

Another important difference is the size of binary numbers. A CPU has a fixed size. The whole architecture uses 16, 32 or 64 bit numbers. The numbers in an FPGA are flexible. Because all the wiring in the chip works at a bit level, it is no problem if, for example, 21 bits are required.

There is one thing in common between an FPGA and CPU. Both do have not much memory inside the chip. This has all to do with the production process. High volume memory is build with DRAM techniques. The chip wafers of these DRAM are not compatible with the wafer technology used to build CPUs or FPGAs. Leaving no other option than to use SRAM. Despite its superior speed, it requires more space on the die. Hence the lack of Gbyte memory in an FPGA or CPU. Before further investigation into FPGA designs, first, a small detour to neural nodes is made.

Neural Node.

A deep learning network consists of many layers. Each layer consists of many neural nodes. These nodes are the basic building block of every neural network. Below a simple graph.

The inputs X1..X4 are multiplied by their corresponding weights W1..W4. Next, the numbers are added up. The sums output is modeled and limited by some activation function, here the ReLU. Every input has an influence on the output. The weight determines the extent of this influence. All weights together dictate the functionality of the network. For example, which objects it will classify. Modern deep-learning networks with their many layers have a huge number of weights. They must all be easily accessible to get good performance. This is one of the many challenges of deep learning architecture.

CPU replacment.

The first and most simple strategy is replacing the CPU by an FPGA. Below a schematic overview. There is no need to design a whole CPU in the FPGA, only the essential functionality.

The FPGA architecture consists of a memory interface, fetching continuous weights for the systolic neural network array. The weights are still stored in an external DDR memory. The camera interface streams the images from the camera also in the systolic array. A fast systolic array can easily be designed using the Google Edge TPU structure. The last part is a necessary interface with the world outside. Many FPGAs have their connection tables in flash memory inside the chip. The optional SPI memory is used as a backup for this table during startup.

A multiplication in an FPGA is relative easily programmed. The floating-point variant, however, uses many more digital gates than an integer variant. Because an FPGA has only a limited number of gates on board, the integer version is mostly used. Therefore all weights must be converted in advance from floating-point to integer. Luckily, Google edge TPU and Intel Neural Stick support these operations.

The CPU replacement hardly changes the functionality or performance of the network itself. It merely reduces the power supply with more than 75%. For example, an ECP5 FPGA from Lattice uses approximately 200 mA running at full speed.

Binary Neural Network.

The second method reduces the power consumption even further by using the internal SRAM inside the FPGA instead of the external DDR. The whole neural network is now located on a single chip. With an FPGA like iCE40 power consumption drops to 1 mW.

To get all weights inside the much smaller embedded memory size, they will have to be compressed even further. A possible technique is converting all the weights to a single bit. This so-called binary neural node is shown below.

In a binary neural node, only the weights are replaced by a single bit number, -1 or +1.

The output is calculated just as previous, first multiplying the inputs with these -1 or +1, next adding the results, and at least an activation function who limits the output. At first glance, it seems a fairly radical method with dubious results. In practice, however, the results are surprisingly good when the correct training procedures are followed. It probably has this to do with the stronger generalization in the network by the single value.

The output is calculated just as previous, first multiplying the inputs with these -1 or +1, next adding the results, and at least an activation function who limits the output. At first glance, it seems a fairly radical method with dubious results. In practice, however, the results are surprisingly good when the correct training procedures are followed. It probably has this to do with the stronger generalization in the network by the single value.

The executing time on the FPGA is the same as in the standard configuration. A proper FPGA design always has one cycle multiplication, whether floating points or integers. For a CPU things are different. Here a multiplication between a real and an integer is twice as fast as a real with another real. This binary neural network not only reduces the memory load with 32, but it also has a computation saving of 2. Below is a graph of the memory savings of a number of known networks.

XNOR network.

A second binary neural network reduces not only the weight to bits but also the inputs. This method provides extreme speedup at the cost of a little accuracy. Below a schematic overview of the three types is shown.

In a standard neural network, all weights of a neural node are a column of a matrix. In the scheme above shows 9 neurons each with 7 inputs. The output is the multiplication of a vector (the input) with the matrix (the weights), also called a dot product. After some activation function, the final output is available. In a binary neural network, the weights are replaced with either an -1 or a +1, but the same operations are executed to get the output. In an XNOR network not only the weights are replaced by bits, but also the input can only have the value -1 or +1. The vector-matrix multiplications can now be replaced by a simple logical XNOR operation.

As the digital diagram above shows, an XNOR works as a multiplier when the '0' is replaced by '-1'. In other words, during programming consider 0 as -1.

A single XNOR operation on a 32-bits CPU performs now 32 multiplications in one clock cycle.

The next step is the accumulation of the vector-matrix results. In the case of the XNOR, the positive outcomes are counted. This forms an output vector with integers. After thresholding, these integers are converted back into binary numbers. The threshold is adaptive and depends on the average absolute weight values and input values.

Most XNOR networks do not binarize the first input layer. When converting an RGB image in a single bit right at the beginning, too much information will be lost. Usually, XNOR nets neither binarizes the last layer either. Some studies show great improvements when this layer remains real-valued.

It is obvious that the XNOR network runs a lot faster since no matrix multiplications are needed anymore. As mentioned earlier, an FPGA with its systolic design has not so many problems with a matrix. A CPU configuration, on the other hand, gets a large performance boost. Now even a simple Raspberry Pi, or smartphone, can run a deep learning network with acceptable frame rates.

### Variations.

Because deep learning is a hot topic nowadays, a lot of research is going on. Also in XNOR and binary networks. There is currently no definitive standard algorithm. Many researchers have made their contribution to the field. All hoping the get the best results with the least memory and computation effort. Below just a few versions are shown. There are many more, Google is your friend here.

The first shown is the standard CNN. The input is a yellow 4x4 matrix. The orange weight matrix is 2x2 and the 3x3 convolution output I

**·**W is shown in blue. Note that the input here is a matrix while the input in the previous diagram is a vector.The second example is the basic binary version (https://arxiv.org/abs/1511.00363). The weights are replaced by their sign. The output I

**·**W approximates the previous output more or less.The third variation is the basic XNOR neural network. Now the inputs are also replaced by their sign. The output is still the dot product of the vector-matrix multiplication. The result is a rough estimate of the actual value.

The next XNOR network tries to improve the results with an adjustable scaling factor α. This α is the average of all the real weight values α = 1/n

**·**Σ W (https://arxiv.org/abs/1602.02830). The individual outputs are still too far away from their desired values, but the average of the IΘW**·**α matrix approximates the real-value counterpart very well.In the last example is a second correction factor is introduced (https://arxiv.org/abs/1603.05279). This K is the average of the absolute input values according to the weight matrix dimensions K = 1/n

**·**Σ ||W||. The intermediate matrix*dot product*is generated by an iteration algorithm used in the calculations. Now the output comes much closer to the real value. This accuracy has a price. The correction value K is a real-valued convolution operation on the input. It needs to be recalculated for every new input. This delays the execution time of the network. The XNORαK is employed by AI2GO. They achieve ± 7 FPS on a Raspberry Pi.The α and K factor serves a special purpose. Some network topologies such as ResNet adds output from different layers together before thresholding them. Without proper scaling by the α and K factor, this would not work. If the network topology doesn't dictate layer addition, both factors could be omitted.

### Training.

It is possible to convert an already-trained network into a binary or XNOR network. The most used procedure is simply getting the sign of the real value. A positive weight gives a +1, and a negative value -1. At the same time, the average of all absolute weight values of one layer is taken. This forms the threshold used later by converting the positive count into binary numbers.

If necessary, the network can be retrained. This is somewhat different compared to a standard neural network. One needs to maintain both the real-valued weights and the binary weights. Forwards, the XNOR mechanism generates the output. The error gradient is back-propagated via the real-valued weights. At the same time, the weights are limited between -1 and +1 to prevent to grow to extremes. Before the next training epoch, the newly calculated real-valued weights are transferred to binary values as described above. There is a whole range of refinements in practice; it is a subject still under investigation.

Implementation.

We listed some software implementations below. We take no responsibility whatsoever. We even didn't test then. Nor is it clear how well suited these are for a FPGA, Raspberry Pi or another alternative board. See it more as a first step in your BNN, XNOR project. https://github.com/itayhubara/BinaryNet.tf

https://github.com/AngusG/tensorflow-xnor-bnn

MXNet:

https://github.com/hpi-xnor/BMXNet

https://github.com/hpi-xnor/BMXNet-v2

PyTorch:

https://github.com/dizcza/lbcnn.pytorch

Python:

https://github.com/wonnado/binary-nets

https://github.com/HatDu/LBCNN

https://github.com/jaygshah/Binary-Neural-Networks

https://github.com/AngusG/tensorflow-xnor-bnn

MXNet:

https://github.com/hpi-xnor/BMXNet

https://github.com/hpi-xnor/BMXNet-v2

PyTorch:

https://github.com/dizcza/lbcnn.pytorch

Python:

https://github.com/wonnado/binary-nets

https://github.com/HatDu/LBCNN

https://github.com/jaygshah/Binary-Neural-Networks

Java:

https://github.com/hpi-xnor/android-image-classification

C++:

https://github.com/hpi-xnor/ios-image-classification

https://github.com/hpi-xnor/android-image-classification

C++:

https://github.com/hpi-xnor/ios-image-classification

Verilog (FPGA):

https://github.com/EEESlab/combinational-bnn

The spectacular memory savings and computational speedups can only be achieved by fine-tuning all your algorithms with respect to the registers, ALU and other low-level hardware. For instance, many small XNOR operations can best be cast in one large single instruction with the proper matrix segmentation. Also, the popcount can be done in one instruction (POPCNT on an Intel x86 with ABM or VCNT on ARM with NEON). In short, it will be a real art. There are many articles on the internet on this subject.

The implementation in an FPGA can be streamlined by the tools supplied by the manufacturer. Never the less, FPGA designs are always a real challenge in many areas. Please contact us with you need any assistance.

https://github.com/EEESlab/combinational-bnn

The spectacular memory savings and computational speedups can only be achieved by fine-tuning all your algorithms with respect to the registers, ALU and other low-level hardware. For instance, many small XNOR operations can best be cast in one large single instruction with the proper matrix segmentation. Also, the popcount can be done in one instruction (POPCNT on an Intel x86 with ABM or VCNT on ARM with NEON). In short, it will be a real art. There are many articles on the internet on this subject.

The implementation in an FPGA can be streamlined by the tools supplied by the manufacturer. Never the less, FPGA designs are always a real challenge in many areas. Please contact us with you need any assistance.

Conclusion.

The final results of the different methods are shown in the table. The binary network is a good solution for an FPGA low power design. Once properly trained, it performs almost identical compared to a standard network. With a little accurateness penalty, the XNOR alternative can not only run on an FPGA but even on a 32-bits ARM Cortex core like Raspberry Pi.

Boards and devices.

There are two major FPGA manufactures: Lattice and Xilinx. Altera has been recently purchased by Intel. At the moment, they have no specific FPGA suitable for deep learning nor any software tools available for the public.

HM01B0 UPduino iCE40UP5K 320x320 image sensor HM01B0 5280 LUT 1024 Kbit RAM 8 DSP € 40 The most elementary board. An FPGA, an image sensor and two microphones. With the USB connection, simple deep learning BNNs can be downloaded and run on a few mA. | |

iCE40 UltraPlus 4x iCE40UP5K LCD screen VGA image sensor OVM7692 8 MB flash 5280 LUT 1024 Kbit RAM 8 DSP € 70 The iCE40 is a special very low power, lightweight FPGA. You will need only a few mA to run an application. With a proper BNN, it will run simple deep learning applications such as face recognition, key phrase detection, or hand gesture detection. | |

ECP 5 Embedded Vision LFE5UM-85F 84 K LUT 208 sysMEM (18 Kb) 3744 Kbit RAM 128 18x18 MULs € 310 A complete FPGA embedded vision solution from Lattice. Two HD cameras, a high-end FPGA, HDMI output. With the supplied tools TensorFlow or Caffe models can run on this platform. | |

Zynq UltraScale+™ MPSoC 4x Cortex-A53 2x Cortex-R5 2x Mali-400 2 GB 504K LUT 38Mbit RAM 1728 DSP slices € 800 Xilinx development board used for computer vision and deep learning. The chip integrates a quad and dual ARM cortex cores with programmable logic. The Baidu EdgeBoard uses the same chip. The FPGA logic is here already preprogrammed as TPU. | |

Baidu EdgeBoard Zynq UltraScale+™ MPSoC 2.4 TOPS € 775 The Chinese counterpart of Google is Baidu. Just like Google, they also promote deep learning. Their TensorFlow is called PaddlePaddle (PArallel Distributed Deep LEarning 飞桨). http://www.paddlepaddle.org. And just like Google Coral Dev Board, have they also there own EdgeBoard. The Development of a dedicated ASIC, like the Edge TPU, is very expensive, so Baidu chooses to work with Xilinx (Zynq UltraScale+™ MPSoC ). The FPGA integrates quad real-time ARM cortex-A53 processors with programmable logic preprogrammed for deep learning. http://ai.baidu.com/tech/hardware/deepkit. |

Google TPU