# Top-Down Synthesis

To run the examples accompanying this lecture, please checkout the git repository

```
$ ssh -X <username>@eceubuntu.uwaterloo.ca
$ mkdir $HOME/ece327-lectures
$ cd $HOME/ece327-lectures
$ git clone gitlab@git.uwaterloo.ca:ece327-lectures-s19/03b-synth.git
$ cd 03b-synth/code
```

In this lecture, we consider the process of performing top-down synthesis of hardware. Given a high-level functional specification, we can explore design choices that meet functional needs. For instance, we want to develop a piece of hardware to perform 4x4 matrix-matrix multiplication. We consider two high-level strategies for address this design challenge.

## Reduction Tree

The reduction tree architecture exploits parallelism in the dot product operation.
A matrix-matrix multiplication can be decomposed into several dot product operations.
For instance if you are multiplying `A`

and `B`

matrices to compute `D=A.B+C`

, we can describe the C psuedocode.

```
for(int i=0;i<4;i++) {
for(int j=0;j<4;j++) {
D[i,j] = C[i,j];
for(int k=0;k<4;k++) {
D[i,j] += A[i,k] * B[k,j];
}
}
}
```

Here to compute `D[0,0]`

, we perform a dot product on `A[0*]`

(row) and `B[*0]`

(column).

Each dot production can operate in parallel. Within a dot product operation, we can further exploit parallelism in the multiply-add operations.

We can look at `dot_prod.sv`

file. Here we use SystemVerilog syntax to allow floating-point number representation.

```
$ cat dot_prod.sv
```

Note the use of single-precision `shortreal`

type and double-precision `real`

type. To accumulate larger precision values, we expand from single->double precision in this code.
Also note that the design is fixed to size 4x1 (a), 1x4 (b).

```
module dot_prod
(input shortreal a0,
input shortreal a1,
input shortreal a2,
input shortreal a3,
input shortreal b0,
input shortreal b1,
input shortreal b2,
input shortreal b3,
input real c,
output real d
);
```

Internally, the code performs a fixed-length 4-long dot product.

```
assign d1 = a0*b0 + a1*b1;
assign d2 = a2*b2 + a3*b3;
assign d = (d1 + d2) + c;
```

We can synthesize a variant of this design using `[31:0]`

unsigned numbers instead of `real`

(Real types can be simulated, but not directly synthesized).

```
$ vivado -s dot_prod_synth.tcl
```

We then look at `tensor_core.sv`

to assemble a hardware design for 4x4 multiplication.

```
$ cat tensor_core.sv
```

Here, we use 2D signals of `real`

and `shortreal`

type.
Multi-dimensional signals need to be defined with endianness 0->3 or 3->0 consideration.
This is not directly of any consequence for this design, but when consumed in arithmetic operations, this defines LSB->MSB directionality.
Verilog has some restrictions of what it permits on the top-level design ports for multi-dimensional signals.
In Lab4, our `systolic.sv`

file cannot have a multi-dimensional port at the top-level for Xilinx Vivado block design. This is a quirky limitation of the tool.
I would encourage you to be familiar with the simpler SystemVerilog syntax as tools will eventually mature to support this user-friendly feature.

```
module tensor_core (
input shortreal a [0:3][0:3],
input shortreal b [0:3][0:3],
input real c [0:3][0:3],
output real d [0:3][0:3]
);
```

Internally, the tensor core will generate a 2D array of `dot_product`

blocks.
We will use nested `generate`

Verilog statements to construct this.

```
genvar i,j;
generate for (i=0; i <= 3; i = i + 1) begin: row
for (j=0; j <= 3; j = j + 1) begin: col
dot_prod dot_prod_inst(
.a0(a[i][0]),
.a1(a[i][1]),
.a2(a[i][2]),
.a3(a[i][3]),
.b0(b[0][j]),
.b1(b[1][j]),
.b2(b[2][j]),
.b3(b[3][j]),
.c(c[i][j]),
.d(d[i][j]));
end
end endgenerate
```

The `generate`

statement gets unrolled statically at compile time, so the generated hardware is of fixed size.
Look back at the C pseudocode from earlier to convince yourself this does what it claims to do.

We can synthesize the complete 4x4 multiplier with Vivado using unsigned types (`[31:0]`

instead of `real`

).

```
$ vivado -s tensor_core_synth.tcl
```

## Systolic Array

A systolic array is a 2D arrangement of simple multiply-accumulate blocks. Data
is pumped into the array in `systolic`

fashion. The term `systolic`

derives its
meaning from the rhythmic pumping of blood from the heart. In the digital
design, the rhythmic pumping refers to how data is pushed into the array. This
is organized in a stream-like fashion, much like how blood flows through the
arteries and veins of the body.

We can inspect one element of the systolic array in the file `systolic_leaf.sv`

as shown below.

```
$ cat code/systolic_leaf.sv
```

The systolic streaming of `A`

matrix happens along the `a_in`

input and the `a_out`

output. For the `B`

matrix happens along the corresponding `b_in`

input and
`b_out`

output. To keep things synthesizable, we use signed 8b numbers, which
also match the precision of Google's TPU v1. Note that outputs `a_out`

and
`b_out`

are wires.

```
module systolic_leaf
(input wire clk,
input wire rst,
input wire signed [7:0] a_in,
input wire signed [7:0] b_in,
output wire signed [7:0] a_out,
output wire signed [7:0] b_out,
output wire signed [31:0] d_out
);
```

Within the block, you will see the internal organization of this simple cell

```
reg signed [31:0] d;
reg signed [7:0] a_tmp;
reg signed [7:0] b_tmp;
always@(posedge clk) begin
if (rst) begin
d <= 0;
a_tmp <= 0;
b_tmp <= 0;
end else begin
d <= d + a_tmp*b_tmp;
a_tmp <= a_in;
b_tmp <= b_in;
end
end
assign a_out = a_tmp;
assign b_out = b_tmp;
assign d_out = d;
```

Here, the `d`

signal will perform a multiply accumulate function. It does this
on registered *copies* of the horizontal and vertical inputs. We register `a_in`

into temporary register `a_tmp`

and `b_in`

into `b_tmp`

. The systolic stream
then links `a_tmp`

to outgoing wire `a_out`

and `b_tmp`

to outgoing wire
`b_out`

. Also the precision of `d`

is larger than the inputs to permit
accumulation of a large stream of numbers. Think about how long this sequence
may be before you run out of bits.

You can refer to the step-by-step snapshot of the systolic design in the lecture
slides to understand *how* the systolic array performs this computation. The key
idea is the reuse of operands across the array. This reuse reduces the need to
perform expensive memory fetches and stores on the matrix elements. The systolic
distribution pipeline fully handles data reuse for you.

```
$ vivado -s systolic_leaf_synth.tcl
```

This will generate the following datapath.

At the macro level, we synthesize the 4x4 array of systolic cells in
`systolic_complete.sv`

.

- We use a set of arrays
`horizontal_wires`

and`vertical_wires`

. We flatten a multi-dimensional signal into a 2D array to illustrate array slicing syntax in Verilog. In SystemVerilog syntax, you can use the simpler 3D array syntax. We will let you figure that out for Lab3 and Lab4.`verilog wire signed [8*SIZE*(SIZE+1)-1 : 0] horizontal_wires; wire signed [8*SIZE*(SIZE+1)-1 : 0] vertical_wires;`

- We use a nested
`generate`

loop to instantiate copies of`systolic_leaf`

cells. We also identify how to extract the correct slice of bits depending on the generate loop indices`i`

and`j`

,`verilog`

```
$ vivado -s systolic_complete_synth.tcl
```

The complete 2D array will be generated as seen below. This image does not look like a neat 4x4 arrangement of blocks due to whatever layout algorithm is being used by Vivado's internal drawing tool.