A closer look at top-down synthesis of matrix multiplication hardware blocks. We consider two design alternatives (1) reduction trees, (2) systolic arrays and show how to describe them in RTL.

Top-Down Synthesis

To run the examples accompanying this lecture, please checkout the git repository

$ ssh -X <username>@eceubuntu.uwaterloo.ca
$ mkdir $HOME/ece327-lectures
$ cd $HOME/ece327-lectures
$ git clone gitlab@git.uwaterloo.ca:ece327-lectures-s19/03b-synth.git
$ cd 03b-synth/code

In this lecture, we consider the process of performing top-down synthesis of hardware. Given a high-level functional specification, we can explore design choices that meet functional needs. For instance, we want to develop a piece of hardware to perform 4x4 matrix-matrix multiplication. We consider two high-level strategies for address this design challenge.

Reduction Tree

The reduction tree architecture exploits parallelism in the dot product operation. A matrix-matrix multiplication can be decomposed into several dot product operations. For instance if you are multiplying A and B matrices to compute D=A.B+C, we can describe the C psuedocode.

  for(int i=0;i<4;i++) {
    for(int j=0;j<4;j++) {
      D[i,j] = C[i,j];
      for(int k=0;k<4;k++) {
        D[i,j] += A[i,k] * B[k,j];

Here to compute D[0,0], we perform a dot product on A[0*] (row) and B[*0] (column).

Each dot production can operate in parallel. Within a dot product operation, we can further exploit parallelism in the multiply-add operations.

We can look at dot_prod.sv file. Here we use SystemVerilog syntax to allow floating-point number representation.

$ cat dot_prod.sv

Note the use of single-precision shortreal type and double-precision real type. To accumulate larger precision values, we expand from single->double precision in this code. Also note that the design is fixed to size 4x1 (a), 1x4 (b).

module dot_prod
    (input shortreal a0,
     input shortreal a1,
     input shortreal a2,
     input shortreal a3,
     input shortreal b0,
     input shortreal b1,
     input shortreal b2,
     input shortreal b3,
     input real c,
     output real d

Internally, the code performs a fixed-length 4-long dot product.

assign d1 = a0*b0 + a1*b1;
assign d2 = a2*b2 + a3*b3;
assign d  = (d1 + d2) + c;

We can synthesize a variant of this design using [31:0] unsigned numbers instead of real (Real types can be simulated, but not directly synthesized).

$ vivado -s dot_prod_synth.tcl


We then look at tensor_core.sv to assemble a hardware design for 4x4 multiplication.

$ cat tensor_core.sv

Here, we use 2D signals of real and shortreal type. Multi-dimensional signals need to be defined with endianness 0->3 or 3->0 consideration. This is not directly of any consequence for this design, but when consumed in arithmetic operations, this defines LSB->MSB directionality. Verilog has some restrictions of what it permits on the top-level design ports for multi-dimensional signals. In Lab4, our systolic.sv file cannot have a multi-dimensional port at the top-level for Xilinx Vivado block design. This is a quirky limitation of the tool. I would encourage you to be familiar with the simpler SystemVerilog syntax as tools will eventually mature to support this user-friendly feature.

module tensor_core (
  input shortreal a [0:3][0:3],
  input shortreal b [0:3][0:3],
  input real c [0:3][0:3],
  output real d [0:3][0:3]

Internally, the tensor core will generate a 2D array of dot_product blocks. We will use nested generate Verilog statements to construct this.

    genvar i,j;
    generate for (i=0; i <= 3; i = i + 1) begin: row
        for (j=0; j <= 3; j = j + 1) begin: col
            dot_prod dot_prod_inst(

    end endgenerate

The generate statement gets unrolled statically at compile time, so the generated hardware is of fixed size. Look back at the C pseudocode from earlier to convince yourself this does what it claims to do.

We can synthesize the complete 4x4 multiplier with Vivado using unsigned types ([31:0] instead of real).

$ vivado -s tensor_core_synth.tcl


Systolic Array

A systolic array is a 2D arrangement of simple multiply-accumulate blocks. Data is pumped into the array in systolic fashion. The term systolic derives its meaning from the rhythmic pumping of blood from the heart. In the digital design, the rhythmic pumping refers to how data is pushed into the array. This is organized in a stream-like fashion, much like how blood flows through the arteries and veins of the body.

We can inspect one element of the systolic array in the file systolic_leaf.sv as shown below.

$ cat code/systolic_leaf.sv

The systolic streaming of A matrix happens along the a_in input and the a_out output. For the B matrix happens along the corresponding b_in input and b_out output. To keep things synthesizable, we use signed 8b numbers, which also match the precision of Google's TPU v1. Note that outputs a_out and b_out are wires.

module systolic_leaf
    (input wire clk,
     input wire rst,
     input wire signed [7:0] a_in,
     input wire signed [7:0] b_in,
     output wire signed [7:0] a_out,
     output wire signed [7:0] b_out,
     output wire signed [31:0] d_out

Within the block, you will see the internal organization of this simple cell

reg signed [31:0] d;
reg signed [7:0] a_tmp;
reg signed [7:0] b_tmp;

always@(posedge clk) begin
  if (rst) begin
    d     <= 0;
    a_tmp <= 0;
    b_tmp <= 0;
  end else begin
    d <= d + a_tmp*b_tmp;
    a_tmp <= a_in;
    b_tmp <= b_in;

assign a_out = a_tmp;
assign b_out = b_tmp;
assign d_out = d;

Here, the d signal will perform a multiply accumulate function. It does this on registered copies of the horizontal and vertical inputs. We register a_in into temporary register a_tmp and b_in into b_tmp. The systolic stream then links a_tmp to outgoing wire a_out and b_tmp to outgoing wire b_out. Also the precision of d is larger than the inputs to permit accumulation of a large stream of numbers. Think about how long this sequence may be before you run out of bits.

You can refer to the step-by-step snapshot of the systolic design in the lecture slides to understand how the systolic array performs this computation. The key idea is the reuse of operands across the array. This reuse reduces the need to perform expensive memory fetches and stores on the matrix elements. The systolic distribution pipeline fully handles data reuse for you.

$ vivado -s systolic_leaf_synth.tcl

This will generate the following datapath.


At the macro level, we synthesize the 4x4 array of systolic cells in systolic_complete.sv.

  1. We use a set of arrays horizontal_wires and vertical_wires. We flatten a multi-dimensional signal into a 2D array to illustrate array slicing syntax in Verilog. In SystemVerilog syntax, you can use the simpler 3D array syntax. We will let you figure that out for Lab3 and Lab4. verilog wire signed [8*SIZE*(SIZE+1)-1 : 0] horizontal_wires; wire signed [8*SIZE*(SIZE+1)-1 : 0] vertical_wires;
  2. We use a nested generate loop to instantiate copies of systolic_leaf cells. We also identify how to extract the correct slice of bits depending on the generate loop indices i and j, verilog
$ vivado -s systolic_complete_synth.tcl

The complete 2D array will be generated as seen below. This image does not look like a neat 4x4 arrangement of blocks due to whatever layout algorithm is being used by Vivado's internal drawing tool.