CNN Accelerator on FPGA (PYNQ-Z1)

This project implements a hardware-accelerated 3-layer Fully Convolutional Neural Network (FCN) on the PYNQ-Z1 FPGA board. The accelerator is capable of performing inference on handwritten digits in real-time, using hardware modules for convolution, ReLU activation, pooling, and classification.

Overview

The goal of this project was to design and deploy a complete CNN architecture directly onto an FPGA. The process included:

Training the model in PyTorch to obtain quantized weights
Implementing each layer (Conv, ReLU, MaxPooling) in SystemVerilog
Designing the dataflow using the AXI4-Stream protocol
Using the PYNQ platform for visualization and software-hardware interfacing
Validating functionality in simulation with Verilator

Architecture

3-layer FCN: Conv → ReLU → Pooling (×3) + Fully Connected Layer
Custom FP16 Arithmetic Units: Optimized for resource-constrained environments
DMA Integration: Efficient transfer between PS (ARM processor) and PL (FPGA logic)
AXI4-Stream: Used for internal data movement and external video output

Features

Real-time inference for MNIST-like datasets
Python-PYNQ interface for control and visualization
Hardware simulation using Verilator
Modular design for easy hardware debugging and extension

Repository Structure


├── notebooks/            # Jupyter notebooks (PYNQ interface)
├── hdl/                  # SystemVerilog source files
├── sim/                  # Verilator testbenches
├── images/               # Sample outputs and diagrams
├── scripts/              # Helper scripts (e.g., packaging, conversion)
└── README.md             # Project documentation

Limitations

While the full architecture was successfully simulated in Verilator, we were unable to deploy the entire CNN to the PYNQ-Z1 board due to hardware constraints:

The complete network exceeded available DSPs and LUTs on the PYNQ-Z1.
Only individual layers could fit in isolation on the board.

Future Work

If we were to revisit this project, we would consider two possible directions:

Scaling Up: Use a larger FPGA with more DSPs and logic resources to accommodate the full network in parallel.
Dynamic Reuse via Software: Shift more control to the software layer (e.g., Jupyter notebooks), dynamically loading kernel weights and layer configurations to reuse the same convolution hardware across layers. This would reduce hardware usage but introduce a significant performance bottleneck due to slower data transfer between the software (Zynq PS) and hardware (PL) domains.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
CNN		CNN
FPGA/Verilog		FPGA/Verilog
.gitattributes		.gitattributes
.gitignore		.gitignore
Alt_View.ipynb		Alt_View.ipynb
Alt_View2.ipynb		Alt_View2.ipynb
Draw_and_Visualize.ipynb		Draw_and_Visualize.ipynb
Improved_View.ipynb		Improved_View.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNN Accelerator on FPGA (PYNQ-Z1)

Overview

Architecture

Features

Repository Structure

Limitations

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CNN Accelerator on FPGA (PYNQ-Z1)

Overview

Architecture

Features

Repository Structure

Limitations

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages