Abstract

Many modern applications that perform classification, prediction or clustering employ Neural Networks (NN) for these tasks. They are often used in datacenters or mobile devices, where high performance and energy-efficiency is crucial. Due to their parallelity, GPUs bring a big performance improvement (in comparison to CPUs) for these applications. Nevertheless, they still have a limitation in their fixed architecture. Especially, because the field is changing quickly and fast innovation must be made possible.

Field-Programmable Gate Arrays (FPGAs) offer more flexibility and can specialise the architecture design to the specific neural network application in order to archive the best performance. Furthermore they are highly energy-efficient and are therefore well suited for NNs in embedded systems or large scale server clusters.

However, deploying neural networks on FPGAs (like programming parallel accelerators in general), with focus on high performance, is a complex step. Due to their flexible architecture, FPGAs allow for so many options for tweaking the performance but in return require a lot of hardware specific expertise and take costly development time to be configured properly. Rather than manual development, automation should take place here to accelerate the development process and also drive down costs. Furthermore, developers do not want to manually adapt their implementations for various accelerators (e.g. CPU, GPU, FPGA). Instead, performance portability is desirable.

Lift addresses these challenges by offering a high-level functional, data-parallel language, which allows the user to efficiently develop an application independently of the target hardware platform. Then, rewrite rules in the Lift compiler open a vast design space of possible implementations for this abstract system specification. This design space is explored to find a suitable solution, which satisfies the performance and energy requirements. In order to exploit the parallel structure of NNs, the compiler employs pipelining mechanisms and allocates distributed on-chip memory on the FPGA. Timing behaviour and scheduling is introduced until finally a Hardware description language (HDL) code is emitted, that can be used to generate the bitstream for the FPGA.