成人VR视频

Event

PhD defence of Tzung-Han Juang 鈥 Enabling Efficient Resource Sharing with Functional IR for Mapping Neural Networks on FPGAs

Wednesday, July 8, 2026 10:00to12:00
McConnell Engineering Building Room 603, 3480 rue University, Montreal, QC, H3A 0E9, CA

Abstract

The rapid development of machine learning has driven the demand for high computational performance, making Graphics Processing Units (GPUs) essential for workloads such as deep neural networks. However, alternative architectures such as Field Programmable Gate Arrays (FPGAs) remain critical in resource-constrained and power-limited settings. Despite their advantages, FPGA programming remains challenging, as both traditional Hardware Description Languages (HDLs) and modern high-level frameworks, such as Spatial or Lift-HLS, lack explicit abstractions for coarse-grained resource sharing, which limits the efficient implementation of neural network applications.

This thesis adopts a functional programming-based approach to raise the level of abstraction in FPGA accelerator design while preserving performance.

Programs are lowered into an existing functional IR that captures both parallelism and memory behavior, including asynchronous off-chip accesses and synchronous on-chip buffering. The IR is extended with coarse-grained function sharing, enabling efficient deployment of neural network workloads while exposing architectural characteristics for systematic optimization and performance analysis. Concretely, this thesis makes three main contributions.

First, hardware resource usage is reduced through coarse-grained function sharing in the functional IR. Based on Let-bindings and 饾渾-abstractions, shared computations are represented in a function-call-based execution model. Compiler rewrite rules and transformation passes eliminate redundant hardware and generate valid design points, including optimizations such as duplicate-path removal and function fusion to reduce sharing overhead. This enables full neural network deployment on a single FPGA while achieving competitive performance compared to layer-specialized and hand-crafted designs.

Second, optimizations such as data partitioning have a significant impact on performance, as they directly affect data reuse patterns and the efficient utilization of hardware resources. A divide-and-conquer primitive enables the symbolic expression of partitioning strategies, with semi-automated insertion of tunable parameters. These parameters are propagated through the compiler pipeline and evaluated using a cost model, avoiding expensive synthesis-driven evaluation while enabling efficient design-space exploration. Experiments on Intel Arria 10 FPGAs demonstrate competitive performance on VGG and TinyYOLO benchmarks.

Finally, the Let-based sharing introduces routing congestion that worsens as the number of function invocations increases. To address this issue, a novel sharing mechanism, Reduce-based sharing, improves runtime flexibility with respect to the number of layers while reducing routing congestion during synthesis. Combined with SwitchApply over instruction streams, this approach enables programmable function units with shared control and datapaths. Upper-bounded streams further enhance programmability by reducing control overhead for data-shape management, thereby improving routability. Evaluations on networks ranging from LeNet-5 to ResNet demonstrate consistently routable designs and speedups of up to 3.4脳 over prior work.

Overall, this thesis demonstrates that a functional IR-driven approach bridges high-level programmability and hardware efficiency, enabling scalable FPGA accelerator design. The evaluation is conducted on classical convolutional neural network models, whose core operators (convolution and fully connected layers) remain fundamental building blocks in modern machine learning workloads, supporting the broader relevance of the results. This thesis represents a step towards bridging high-level machine learning frameworks and low-level hardware design.

Back to top