

# Efficient Hardware Design Using HDL & HLS

Bridging Software Expertise to Hardware Acceleration

Wuqiong Zhao Mav 10. 2025

Department of ECE, UC San Diego



## 01 Introduction

02 Hardware Architecture

| 03 | Hardware | Description | Language | (HDL) |
|----|----------|-------------|----------|-------|
|----|----------|-------------|----------|-------|

04 High-Level Synthesis (HLS)

05 Design Automation



## Introduction

## Shifting Landscape of Computation

#### The End of an Era?:

- Moore's Law slowing (transistor density).
- Dennard scaling ended (power density).
- $\cdot \Rightarrow$  Performance gains from general-purpose CPUs/GPUs are diminishing.

## The Rise of Data-Intensive Workloads:

- AI/ML (e.g., LLMs), Big Data, IoT.
- Massive computational power and efficiency.

## The Need for Specialization:

- CPUs/GPUs are not always optimal.
- Domain-specific architectures offer a path to continued performance and efficiency gains.

#### Attempts in Industry

New hardware targets for existing AI/ML software:

- $\cdot$  Google **TPU**
- Google hls4ml 🗗
- AMD Vitis™ ₫
- · AMD FINN 🗗
- Intel OpenVINO™ ぱ

#### Analogy

- Software developers write software for fixed hardware.
- Hardware designers design the hardware itself.
- Full-stack engineers can do both, in an orchestrated way!

For the task you are aiming, do you want the hardware work in a different way than the general-purpose one?



**Programmable Hardware Fabric**: Imagine a vast array of configurable logic blocks (CLBs) and programmable interconnects.

**Customization After Manufacturing**: Unlike ASICs (Application-Specific Integrated Circuits) which are fixed, FPGAs can be *reprogrammed* for different functionalities.

#### Key Advantages for Custom Architectures:

- Tailored to Algorithm: Design hardware directly for specific algorithms.
- Massive Parallelism: Exploit fine-grained parallelism inherent in algorithms.
- Low Latency: Data can flow through custom datapaths without OS overhead or general-purpose instruction processing.
- **Power Efficiency**: Optimize for specific operations, reducing overhead.

"FPGAs offer a blank canvas for digital architects."

## ▶ FPGA — Field Programmable Gate Array



The different parts of an FPGA. Adapted from https://ni.scene7.com/is/image/ni/swvvifhq55851?scl=1.

More detailed composition of an FPGA:

- Configurable Logic Blocks (CLBs): Basic building blocks of FPGAs (LUTs/FFs).
- Interconnects: Programmable connections between CLBs.
- I/O Blocks: Interfaces for external communication.
- DSP Blocks: Specialized for high-performance arithmetic operations.
- Memory Blocks: Embedded memory resources.
- Clock Management: Resources for clock distribution and management.
- Configuration Memory: Stores the configuration data for the FPGA.
- Power Management: Resources for power distribution and management.
- **Embedded Processors**: Some FPGAs include soft or hard processors for general-purpose computing.

### ASIC = Application-Specific Integrated Circuit.

ASICs are most suitable for high-volume production with *a fixed design*. Typical use cases include baseband signal processing, video encoding/decoding, and other fixed-function accelerators.

#### **FPGAs are Promising**

- With the fast evolution of AI/ML algorithms, ASICs for fixed functions are outdated quickly. **Reprogramming with FPGAs** is more flexible and cost-effective;
- FPGAs are more capable with the advance of technology;
- FPGA + fixed ASIC cores (like video codec) is a good combination.

Like writing software, we do not write the assembly code directly. Instead, we can describe the hardware *gate logic* or *higher-level behavior*.

They are tools, and it is hardware architecture designs that matter.

## Hardware Architecture

## 🔁 Hardware Architecture — Key to Efficiency

- **1 Parallelism**: Process multiple operations simultaneously:
  - Data parallelism (vector operations, etc.);
  - Task parallelism (concurrent execution paths).
- **2** Pipelining: Overlap execution stages for higher throughput:
  - Operation pipelining (breaking complex operations into stages);
  - Strategic register insertion for improved timing and throughput.
- **3** Memory Hierarchy: Optimize data access patterns:
  - Registers, caches, local buffers, external memory;
  - Minimize expensive memory accesses.
- **4** Data Flow Optimization: Minimize data movement:
  - Local processing units near data sources;
  - Direct streaming between components.
- **5 Resource Sharing**: Balance area vs. performance w/ reconfigurable modules.
  - **Specialization**: Custom datapaths for specific algorithms.

## Example: General Matrix Multiplication (GEMM) Using Systolic Array



Advantages of systolic array:

- Parallelism: Multiple processing elements (PEs) work simultaneously.
- **Data Locality**: Data flows through the array, reducing memory access time.
- Scalability: Can be expanded to handle larger matrices.

For matrices with special characteristics, like symmetry, we can further optimize the systolic array. An example FSM for a coin-controlled turnstile:



For hardware design simplicity, a **2or 3-process FSM** is recommended (personal preference). For a 2-process FSM, it involves

- Sequential logic: Stores the current state.
- Combinational logic: Determines the (*i*) next state and (*ii*) outputs based on the current state and inputs.

# Hardware Description Language (HDL)

Hardware Description Language (HDL): A specialized programming language used to describe the structure and behavior of digital circuits.

Key HDLs: Verilog, VHDL, SystemVerilog.

## Key Differences from Software Languages:

- Describes *parallel* hardware structures, not sequential steps;
- Code represents physical circuit components and connections;
- Timing is explicit and critical.

Abstraction Levels:

- **Behavioral**: Algorithmic description;
- Register-Transfer Level (RTL): Data flow between registers;
- Gate-level: Logic gates and connections.

Examples are in Verilog.

+ Combinational logic: direct, memoryless boolean functions.

- Output depends solely on current input values;
- No memory/state changes propagate immediately;
- Described with continuous assignments: **assign out** = a & b;.

Sequential logic: state-holding elements triggered by clock signals.

- Updates state values at specific clock transitions (rising/falling edges);
- Stores values in registers/flip-flops between clock cycles;
- Described in always a (posedge clk) blocks.

## ➢ Other Key HDL concepts:

- Modules: Encapsulation units with defined interfaces (in/out);
- Signal assignments:
  - Blocking (=): Sequential evaluation (seldom used in module design);
  - Non-blocking (<=): Parallel evaluation (crucial for sequential logic);
- **Simulation vs. Synthesis**: Not everything that simulates can be synthesized into hardware.

## $\S$ HDL Workflow on FPGAs

- 1. HDL Design Entry: Write RTL code in Verilog/VHDL.
- 2. Functional Simulation: Verify logic functionality.
- 3. Synthesis: Convert HDL to optimized gate-level netlist.
- 4. Implementation:
  - Translation: Map netlist to target device;
  - Placement: Position logic elements;
  - Routing: Connect logic elements.
- 5. Timing Analysis: Verify timing constraints (& locate *critical path* Ø).
- 6. Bitstream Generation: Utilize the placement constraint file.
- 7. Device Programming: Upload bitstream to FPGA.

#### For ASICs

This is even more complicated with additional steps for fabrication!

High-Level Synthesis (HLS)

**High-level synthesis (HLS)** allows software developers to create hardware using familiar programming languages.

## Key Benefits:

- Program in C/C++ instead of Verilog/VHDL;
- Faster development cycle (hours vs. days);
- Higher level of abstraction;
- Easier debugging and verification;
- Software-to-hardware transform.

## Parallel Programming for FPGAs: hlsbook.ucsd.edu 🖒

## AMD Vitis HLS Workflow

- 1. Design using C++ w/ directives;
- 2. **CSim**: C++ Simulation;
- 3. Syn: synthesize to RTL;
- 4. **CoSim**: software-hardware co-simulation;
- 5. Impl: implementation.

## ⚠ C++ but not Exactly C++

ightarrow The conflicting nature of *sequential* software and *parallel* hardware.

## Key Differences:

- No dynamic memory allocation;
- Limited recursion support;
- Loops often unrolled into parallel hardware;
- Function calls may be inlined as circuits;
- Limited standard library support.

# Common Pitfalls:

- Sequential thinking leads to poor hardware;
- Ignoring variable bit-width optimization;
- Inefficient loop design creates bottlenecks;
- Non-synthesizable logics.

#### Important to Remember Using HLS

HLS tools translate direct algorithm implementations. Hardware-aware coding requires understanding architecture implications to ensure performance.

**HLS Pragmas/Directives**: Hardware-specific annotations that guide the synthesis process without changing functional behavior.

## Common Directive Categories for Vitis HLS:

- Interface: AXI, memory ports
  - #pragma HLS INTERFACE axis port=data
- Loop Optimization: unroll, pipeline, merge
  - #pragma HLS PIPELINE II=1
  - #pragma HLS UNROLL factor=4
- Array Optimization: partition, reshape
  - #pragma HLS ARRAY\_PARTITION variable=buffer dim=1 complete
- Function Inlining/Dataflow:
  - #pragma HLS DATAFLOW

#### Impact on Design

- Resource utilization (LUT, DSP, BRAM, etc.)
- Throughput (initiation interval)
- Latency (cycle count)
- Clock frequency (due to critical path)

#### Trade-offs

More parallelism  $\implies$  higher performance but more resources.

▲ Caveat: HLS tools change frequently. Check out <u>Vitis HLS User Guide (UG1399)</u> for latest information. 👍 FLAMES HLS Library: Flexible Linear Algebra with Matrix-Empowered Synthesis

# FLAMES High-Level Synthesis

Vitis HLS Support | C++14/17 | Template-Based | Header Only





© 2025 IEEE. Reprinted, with permission, from "Flexible High-Level Synthesis Library for Linear Transformations", DOI: 10.1109/TCSII.2024.3366282. [1]

FLAMES high-level synthesis (HLS) library provides class and template based interface for linear algebra. [1]

For Neumann series approximation (NSA),  $A^{-1} = \lim_{n\to\infty} \sum_{i=0}^{n} (-D^{-1}E)^{i}D^{-1}$ , where A = D + E,  $D \triangleq A \circ I$  is the diagonal part while E is the off-diagonal part. [1]

| #  | Formula                                       | FLAMES HLS C++ Implementation                |                           |
|----|-----------------------------------------------|----------------------------------------------|---------------------------|
| 1  | $D=A\circI$                                   | <pre>auto D = mat.diagMat_();</pre>          |                           |
| 2  | E=A-D                                         | <pre>auto E = mat.offDiag_();</pre>          |                           |
| 3  | $\mathbf{D}_{l} = \mathbf{D}^{-1}$            | <pre>auto D_I = D.inv();</pre>               |                           |
| 4  | $P=-D_{I}E$                                   | auto $P = -D_I * E;$                         | Classes are templated,    |
| 5  | X = P (Iter. 1)                               | auto $X = P_{-} = P;$                        | and there is a little C++ |
| 6  | for <i>i</i> = 2, , <i>n</i>                  | <pre>for (int i = 2; i &lt;= n; ++i) {</pre> | metaprogramming.          |
| 7  | $\mathbf{P}^{i} = \mathbf{P}^{i-1}\mathbf{P}$ | P_ *= P;                                     |                           |
| 8  | $X = X + P^i$                                 | X += P_;                                     | Overloaded functions      |
| 9  | end                                           | }                                            | are provided for matrix   |
| 10 | $A^{-1} = XD_1 + D_1$                         | A_inv = X * D_I + D_I;                       | operations.               |

## 👍 Timeline Trace Example



NSA CoSimulation Timeline for an 8 imes 8 real matrix with 4 iterations.

Website: flames.autohdw.com 🖸 | GitHub: autohdw/flames 🖸 | PDF 🖸

Hardware-friendly designs: [1]

- **①** Optimized RAM usage: fixing no return value optimization (RVO) problem;
- **2** Configurable parallelism: using pragmas to control parallelism;
- Optimized matrix operations: function overloading.

## - Limitations of FLAMES

- Just a proof of concept.
- Data streaming not considered.
- Pragmas configurations are mostly limited to the global scope.

# **Design Automation**

Difficult to get AI tools to write correct & efficient implementations of hardware!

- AI models trained with little accessible RTL/HLS code;
- Hardware requires precise timing, resource awareness, and physical constraints (case-by-case implementation & optimization);
- AI tools struggle with synthesizable vs. simulation-only constructs.

#### **Current Best Practice**

Use AI for initial templates and algorithmic sketches, but rely on hardware expertise for implementation details and optimizations.

What are eDSLs? Specialized languages embedded within a host language.

#### Benefits for Hardware Design:

- Combines host language power with hardware abstractions;
- Automated design generation and verification;
- Tight integration with software ecosystem;
- Type safety and compile-time checks.

#### Notable Hardware eDSLs

- Chisel (Scala-based HDL)
- AHDW [2] (Automatic HDW language for Verilog target)
- **PyTV/Verithon** (Python-templated Verilog)

## AHDW VS Code Extension



AHDW VS Code Extension.

```
[AHDW] Project name set to 'example1' (version 0.1.0).
[AHDW] AHDW Sources:
[AHDW] $ example.ahdw
[AHDW] Parameter 'key1':
[AHDW] * TYPE: STRING
[AHDW] * DEFAULT: default value
[AHDW] * REQUIRED: FALSE
[AHDW] > VALUE: example (user-specified)
[MESSAGE] AHDW is powerful!
[AHDW] Processing AHDW Source 'example.ahdw' ...
[AHDW] Set output directory as 'ahdw out'.
[AHDW] Set output file name as 'file name example 1 5.v'.
[ERROR] Syntax error in 'example.ahdw' (line 10) [Cannot parse expression]:
[ERROR] module module_{{a*2}}_foo #(
[ERROR]
                        \sim\sim\sim\sim\sim\sim\sim
```

## PyTV — Python-Templated Verilog

GitHub: autohdw/pytv 같 | Website: docs.rs/pytv 갑



//! a = 1 + 2; # Python inline assign wire\_`a` = wire\_b; // Verilog with variable/expression /\*! b = a \*\* 2; # Python block \*/

# Conclusion

## System Integration:

- Seamless CPU/GPU-FPGA heterogeneous platforms;
- Automated hardware-software co-design frameworks;
- System-on-chip (SoC) platforms bridging SW/HW communities;
- Other diverse applications like SmartNIC.

## 2 Hardware-Aware Software:

- ML frameworks with native hardware acceleration paths;
- · Domain-specific compilers optimizing for custom hardware;
- Automated hardware-specific code transformations;
- Cross-layer optimizations spanning SW/HW boundaries.

#### Key Takeaways:

- With Moore's Law slowing, specialized hardware offers a new performance frontier;
- PFGAs provide a flexible middle ground between general-purpose processors and ASICs;
- **3** HDLs enable direct hardware description but require hardware thinking;
- **HLS** bridges software and hardware domains, making acceleration more accessible;
- **5** Design automation tools like FLAMES and eDSLs further simplify hardware design.

- W. Zhao, C. Li, Z. Ji, Z. Guo, X. Chen, Y. You, Y. Huang, X. You, and C. Zhang, "Flexible high-level synthesis library for linear transformations," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 71, no. 7, pp. 3348–3352, Jul. 2024.
- W. Zhao, C. Li, Z. Ji, Y. You, X. You, and C. Zhang, "Automatic timing-driven top-level hardware design for digital signal processing," in *2023 IEEE 15th International Conference on ASIC (ASICON)*, Nanjing, China, Oct. 2023.

# Thanks!

#### PDF online: https://go.wqzhao.org/hdl-hls-slides-25