



### SIGMA: A <u>Sparse and Irregular GEMM</u> <u>Accelerator with Flexible Interconnects</u> for DNN Training

<u>ERIC QIN</u><sup>\*</sup>, ANANDA SAMAJDAR<sup>\*</sup>, HYOUKJUN KWON<sup>\*</sup>, VINEET NADELLA<sup>\*</sup>, SUDARSHAN SRINIVASAN<sup>#</sup>, DIPANKAR DAS<sup>#</sup>, BHARAT KAUL<sup>#</sup>, TUSHAR KRISHNA<sup>\*</sup>

> \* GEORGIA TECH # INTEL

#### Outline

- Motivation
  - GEMMs in Deep Learning
  - Utilization on TPU (Systolic Array)
- Accelerator Requirements
- SIGMA
  - Interconnects Implementations
  - Full System Design
- Evaluation
- Conclusion

#### Outline

#### • Motivation

- GEMMs in Deep Learning
- Utilization on TPU (Systolic Array)
- Accelerator Requirements
- SIGMA
  - Interconnects Implementations
  - Full System Design
- Evaluation
- Conclusion

#### **Deep Learning Applications**



**Speech Recognition** 



Language Understanding



**Recommender Systems** 

#### **Deep Learning Applications**

# What is the key computation for these Deep Learning applications?



#### Runtime breakdown on V100 GPU



#### Runtime breakdown on V100 GPU



Matrix multiplications (GEMMs) consume around **70%** of the total runtime when training modern deep learning workloads.







Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.



## GEMM is a key compute primitive to accelerate in hardware to speed up training.



Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

#### **SIMT Architectures**



**Nvidia GTX GPUs** 

#### **SIMT Architectures**



Nvidia GTX GPUs

#### **SIMD Architectures**





**Tesla FSDC** 

#### **SIMT Architectures**



**Nvidia GTX GPUs** 

#### SIMD Architectures





Tesla FSDC

#### **Systolic Architectures**



Xilinx xDNN



#### **Nvidia Tensor Cores**



**Google TPU** 

#### **SIMT Architectures**



**Nvidia GTX GPUs** 

#### SIMD Architectures



# Tesla FSDC

#### **Systolic Architectures**



Xilinx xDNN

### 12X THROUGHPUT

#### **Nvidia Tensor Cores**



**Google TPU** 

Recently, systolic array based architectures are popular for accelerating GEMMs.

#### Target comparison: Google TPU



TPU v2 - 4 chips, 2 cores per chip

TPU v3 - 4 chips, 2 cores per chip

### Our target comparison is with the Google TPU, which uses **128 x 128** systolic arrays.

#### Outline

#### • Motivation

- GEMMs in Deep Learning
- Utilization on TPU (Systolic Array)
- Accelerator Requirements
- SIGMA
  - Interconnects Implementations
  - Full System Design
- Evaluation
- Conclusion

#### Systolic Array Architectures



#### Systolic Array Architectures





| Workload    | Application                | Example Dimensions |       |        |
|-------------|----------------------------|--------------------|-------|--------|
|             |                            | Μ                  | Ν     | K      |
| GNMT        | Machine<br>Translation     | 128                | 2048  | 4096   |
|             |                            | 320                | 3072  | 4096   |
|             |                            | 1632               | 36548 | 1024   |
|             |                            | 2048               | 4096  | 32     |
| DeepBench   | General                    | 1024               | 16    | 500000 |
|             | Workload                   | 35                 | 8457  | 2560   |
| Transformer | Language<br>Understanding  | 31999              | 1024  | 84     |
|             |                            | 84                 | 1024  | 4096   |
| NCF         | Collaborative<br>Filtering | 2048               | 1     | 128    |
|             |                            | 256                | 256   | 2048   |

GEMMs used for evaluation.



| Workload    | Application                | Example Dimensions |       |        |
|-------------|----------------------------|--------------------|-------|--------|
|             |                            | Μ                  | Ν     | K      |
| GNMT        | Machine<br>Translation     | 128                | 2048  | 4096   |
|             |                            | 320                | 3072  | 4096   |
|             |                            | 1632               | 36548 | 1024   |
|             |                            | 2048               | 4096  | 32     |
| DeepBench   | General                    | 1024               | 16    | 500000 |
|             | Workload                   | 35                 | 8457  | 2560   |
| Transformer | Language<br>Understanding  | 31999              | 1024  | 84     |
|             |                            | 84                 | 1024  | 4096   |
| NCF         | Collaborative<br>Filtering | 2048               | 1     | 128    |
|             |                            | 256                | 256   | 2048   |

GEMMs used for evaluation.

#### Let's map this GEMM!





















M = 256 TPU (Systolic Array)

The streaming elements get multiplied with the stationary elements. The partial sums



# Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.





| Workload    | Application                | Example Dimensions |       |        |
|-------------|----------------------------|--------------------|-------|--------|
|             |                            | Μ                  | Ν     | K      |
| GNMT        | Machine<br>Translation     | 128                | 2048  | 4096   |
|             |                            | 320                | 3072  | 4096   |
|             |                            | 1632               | 36548 | 1024   |
|             |                            | 2048               | 4096  | 32     |
| DeepBench   | General                    | 1024               | 16    | 500000 |
|             | Workload                   | 35                 | 8457  | 2560   |
| Transformer | Language<br>Understanding  | 31999              | 1024  | 84     |
|             |                            | 84                 | 1024  | 4096   |
| NCF         | Collaborative<br>Filtering | 2048               | 1     | 128    |
|             |                            | 256                | 256   | 2048   |

GEMMs used for evaluation.

#### Let's map another GEMM!
















| M = 2048 TPU (Systolic Array) | K = 32              | <b>75%</b> of the PEs are not utilized for |  |  |
|-------------------------------|---------------------|--------------------------------------------|--|--|
|                               | $ \longrightarrow $ | this GEMM                                  |  |  |

# The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?











duplicate streaming data









Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Reduce partial sum down each column.

### Sparsity in DNN Models



### Sparsity in DNN Models





#### **Transformer Sparsity - Impact on BLEU**

(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

### Sparsity in DNN Models



Weight sparsity ranges from **40%** to **90%**. Activation sparsity is approximately **30%** to **70%** from ReLU, dropout, etc.



| Workload                     | Application                | Example Dimensions |       |        |
|------------------------------|----------------------------|--------------------|-------|--------|
|                              |                            | Μ                  | Ν     | K      |
|                              |                            | 128                | 2048  | 4096   |
| GNMT                         | Machine<br>Translation     | 320                | 3072  | 4096   |
|                              |                            | 1632               | 36548 | 1024   |
|                              |                            | 2048               | 4096  | 32     |
| DeepBench                    | General                    | 1024               | 16    | 500000 |
|                              | Workload                   | 35                 | 8457  | 2560   |
| Transformer Langua<br>Unders | Language                   | 31999              | 1024  | 84     |
|                              | Understanding              | 84                 | 1024  | 4096   |
| NCF                          | Collaborative<br>Filtering | 2048               | 1     | 128    |
|                              |                            | 256                | 256   | 2048   |

GEMMs used for evaluation.

#### Usually these GEMMs are sparse!







Multiplication with an operand that is zero is considered underutilized.





| na dei di sette e contra se secondada de sette se se i se secondada |                     |                                                        |
|---------------------------------------------------------------------|---------------------|--------------------------------------------------------|
| iviuitipiication with an operana that is zero is consid             | erea unaerutilizea. |                                                        |
| M = 2048 TPU (Systolic Array)                                       | K = 32<br>◀━►       | <b>87.5%</b> of the PEs are not utilized for this GEMM |

Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How can we map only nonzeros stationary?















Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Reduce partial sum down each column.












N = 256

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have significant load and reduction latency.

Reduce partial sum down each column.

N = 256

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are both sparse and irregular.

Reduce partial sum down each column.

### Outline

- Motivation
  - GEMMs in Deep Learning
  - Utilization on TPU (Systolic Array)
- <u>Accelerator Requirements</u>
- SIGMA
  - Interconnects Implementations
  - Full System Design
- Evaluation
- Conclusion

| Requirement | Systolic Array Limitation              | SIGMA Desired Traits                                     |
|-------------|----------------------------------------|----------------------------------------------------------|
| Flexibility | <ul> <li>rigid aspect ratio</li> </ul> | <ul> <li>ability to mimic any 2D aspect ratio</li> </ul> |
|             |                                        |                                                          |
|             |                                        |                                                          |

| Requirement      | Systolic Array Limitation                                                                    | SIGMA Desired Traits                                                                                                                                |
|------------------|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| Flexibility      | <ul> <li>rigid aspect ratio</li> </ul>                                                       | <ul> <li>ability to mimic any 2D aspect ratio</li> </ul>                                                                                            |
| Sparsity Support | <ul> <li>data forwarding every cycle<br/>requires systolic array to map<br/>zeros</li> </ul> | <ul> <li>sparsity support by mapping only<br/>nonzeros stationary</li> <li>ability to create simultaneous variable<br/>sized dot product</li> </ul> |
|                  |                                                                                              |                                                                                                                                                     |

| Requirement      | Systolic Array Limitation                                                                    | SIGMA Desired Traits                                                                                                                                |
|------------------|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| Flexibility      | <ul> <li>rigid aspect ratio</li> </ul>                                                       | <ul> <li>ability to mimic any 2D aspect ratio</li> </ul>                                                                                            |
| Sparsity Support | <ul> <li>data forwarding every cycle<br/>requires systolic array to map<br/>zeros</li> </ul> | <ul> <li>sparsity support by mapping only<br/>nonzeros stationary</li> <li>ability to create simultaneous variable<br/>sized dot product</li> </ul> |
| Scalability      | <ul> <li>O(sqrtN) distribution</li> <li>O(sqrtN) reduction</li> </ul>                        | <ul> <li>O(1) distribution</li> <li>O(logN) reduction</li> </ul>                                                                                    |

Requirement Systolic Array Limitation SIGMA Desired Traits

# With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.

| Scalability | <ul> <li>O(sqrtN) distribution</li> <li>O(sqrtN) reduction</li> </ul> | <ul> <li>O(1) distribution</li> <li>O(logN) reduction</li> </ul> |
|-------------|-----------------------------------------------------------------------|------------------------------------------------------------------|



#### 4 x 4 Systolic Array

- rigid aspect ratio
- fixed size dot product
- O(sqrtN) distribution and reduction



**16 PE SIGMA** 



#### 4 x 4 Systolic Array

- rigid aspect ratio
- fixed size dot product
- O(sqrtN) distribution and reduction



#### **16 PE SIGMA**

\*\* Microarchitecture details on the networks will be discussed later



#### 4 x 4 Systolic Array

- rigid aspect ratio
- fixed size dot product
- O(sqrtN) distribution and reduction

- Distribution
   network allows
   SIGMA to mimic
   any aspect ratio to
   address irregular
   GEMMs
- Ability to send any streaming element to any PE
- O(1) distribution



#### **16 PE SIGMA**

\*\* Microarchitecture details on the networks will be discussed later



#### 4 x 4 Systolic Array

- rigid aspect ratio
- fixed size dot product
- O(sqrtN) distribution and reduction

- Distribution network allows
   SIGMA to mimic any aspect ratio to address irregular
   GEMMs
- Ability to send any streaming element to any PE
- O(1) distribution



**16 PE SIGMA** 

\*\* Microarchitecture details on the networks will be discussed later



#### 4 x 4 Systolic Array

- rigid aspect ratio
- fixed size dot product
- O(sqrtN) distribution and reduction

- Distribution
   network allows
   SIGMA to mimic
   any aspect ratio to
   address irregular
   GEMMs
- Ability to send any streaming element to any PE
- O(1) distribution



\*\* Microarchitecture

details on the networks

will be discussed later

- Reduction network allows SIGMA to reduce variable sized dot products
- Addresses sparsity and irregularity
- O(logN) reduction



Set up the stationary values.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)





Set up the stationary values.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)







*Next cycle: Multicast first row of MK to the corresponding stationary elements.* 

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)







Next cycle: Multicast second row of MK to the corresponding stationary elements.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)







Next cycle: Multicast third row of MK to the corresponding stationary elements.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



4 x 4 Systolic Array



**16 PE SIGMA** 



Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



4 x 4 Systolic Array



**16 PE SIGMA** 



After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding). \*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)







SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.

stationary)

\*\* Assuming MK







Final cycle count.



4 x 4 Systolic Array





SIGMA total runtime: **13** cycles





Final cycle count.

\*\* Assuming MK matrix is streaming and KN matrix is stationarv. (aka

# SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.





Set up the stationary values.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)







*Next cycle: Multicast first row of MK to the corresponding stationary elements.* 

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)







Next cycle: Multicast second row of MK to the corresponding stationary elements.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)









Next cycle: Multicast third row of MK to the corresponding stationary elements.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



4 x 4 Systolic Array



**16 PE SIGMA** 



Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



4 x 4 Systolic Array



**16 PE SIGMA** 





After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).

\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



4 x 4 Systolic Array



103



Again, the systolic array has to map another part of the stationary matrix and stream MK again. \*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



4 x 4 Systolic Array



104



Final cycle count.



\*\* Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



SIGMA total runtime: **13** cycles



105



\*\* Assuming MK matrix is streaming and KN matrix is stationarv. (aka

# SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.





### Outline

- Motivation
  - GEMMs in Deep Learning
  - Utilization on TPU (Systolic Array)
- Accelerator Requirements
- <u>SIGMA</u>
  - Interconnects Implementations
  - Full System Design
- Evaluation
- Conclusion

### O(1) Distribution Topology

#### Unicast

(Loading Stationary Matrix)



#### **Multicast**

(Sending Streaming Matrix)


# O(1) Distribution Topology

### Unicast

(Loading Stationary Matrix)



### **Multicast**

(Sending Streaming Matrix)



# O(1) Distribution Topology

### Unicast

(Loading Stationary Matrix)



### **Multicast**

(Sending Streaming Matrix)



### O(1) Distribution Topology

Unicast

**Multicast** 

SIGMA's distribution can be either a Crossbar or Benes network. We chose Benes because the number of switches scale by O(N logN).



### O(logN) Reduction (Limitation of Adder Tree)



### O(logN) Reduction (Limitation of Adder Tree)





















### Our novel FAN topology is both lightweight and flexible.

Our novel FAN topology is both lightweight and flexible.

It can replace regular adder trees in other hardware accelerators.

Our novel FAN topology is both lightweight and flexible.

It can replace regular adder trees in other hardware accelerators.

More details such as the routing algorithm and overhead analysis can be found in the paper.

### Outline

- Motivation
  - GEMMs in Deep Learning
  - Utilization on TPU
- Accelerator Requirements
- <u>SIGMA</u>
  - Interconnects Implementations
  - Full System Design
- Evaluation
- Conclusion



Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks

• Contains bitmap compression format of GEMM matrices.



Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.



Data and Bitmap SRAM Banks

• Contains bitmap compression format of GEMM matrices.

#### **Global Controller**

 Logic comparisons on bitmaps to determine what nonzero stationary elements are required

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.



Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

#### Data and Bitmap SRAM Banks

• Contains bitmap compression format of GEMM matrices.

#### **Global Controller**

 Logic comparisons on bitmaps to determine what nonzero stationary elements are required

#### Sparsity Filter && Input Data Arbiter

• Reorganizes data for loading stationary elements and sending streaming elements



Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

#### Data and Bitmap SRAM Banks

• Contains bitmap compression format of GEMM matrices.

#### **Global Controller**

 Logic comparisons on bitmaps to determine what nonzero stationary elements are required

#### Sparsity Filter && Input Data Arbiter

• Reorganizes data for loading stationary elements and sending streaming elements

#### **Accumulation SRAM**

• Buffer for partial sum accumulations



Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

#### Data and Bitmap SRAM Banks

• Contains bitmap compression format of GEMM matrices.

#### **Global Controller**

• Logic comparisons on bitmaps to determine what nonzero stationary elements are required

#### Sparsity Filter && Input Data Arbiter

• Reorganizes data for loading stationary elements and sending streaming elements

#### **Accumulation SRAM**

• Buffer for partial sum accumulations

#### <u>SIGMA Engine</u>

• Compute engine



### Outline

- Motivation
  - GEMMs in Deep Learning
  - Utilization on TPU
- Accelerator Requirements
- SIGMA
  - Interconnects Implementations
  - Full System Design
- **Evaluation**
- Conclusion

### Methodology

- Hardware components are written in Verilog
- Post layout area and power numbers are on a 28nm process
- Analytical model for cycle counts assumes uniform random sparsity

| Workload    | Application                | Example Dimensions |       |        |
|-------------|----------------------------|--------------------|-------|--------|
|             |                            | Μ                  | Ν     | K      |
| GNMT        | Machine<br>Translation     | 128                | 2048  | 4096   |
|             |                            | 320                | 3072  | 4096   |
|             |                            | 1632               | 36548 | 1024   |
|             |                            | 2048               | 4096  | 32     |
| DeepBench   | General                    | 1024               | 16    | 500000 |
|             | Workload                   | 35                 | 8457  | 2560   |
| Transformer | Language<br>Understanding  | 31999              | 1024  | 84     |
|             |                            | 84                 | 1024  | 4096   |
| NCF         | Collaborative<br>Filtering | 2048               | 1     | 128    |
|             |                            | 256                | 256   | 2048   |

**GEMMs used for evaluation.** 









# SIGMA performs on average **1.8x** better than systolic array architectures for irregular GEMMs.



### SIGMA vs TPU - Sparse GEMMs



### SIGMA vs TPU - Sparse GEMMs



### SIGMA vs TPU - Sparse GEMMs





# SIGMA performs on average **5.7x** better than systolic array architectures for sparse and irregular GEMMs.



### **SIGMA vs Sparse Accelerators**



### **SIGMA Qualitative Analysis**

| Accelerator          | Limitation                                                                                                         | SIGMA Solution                                                                                                |
|----------------------|--------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| <b>TPU</b> [23]      | Low utilization from no sparsity support and rigid structure                                                       | Flexible interconnects to map non-zero data and irregular GEMMs.                                              |
| <b>EIE</b> [19]      | Not scalable due to all-to-all PE broadcasts and<br>a BW link of one element per cycle                             | Partition compute to Flex-DPEs (small all-to -all networks)<br>connected by a high BW bus                     |
| SCNN [33]            | Requires partitioning to use Cartesian product on GEMMs.<br>High inter-PE communications for accumulating outputs. | Multicast GEMM partial sums close to each other<br>so they can be reduced spatially                           |
| OuterSPACE [32]      | Partial sum accum. within linked list<br>has at best O( <i>NlogN</i> ) complexity                                  | Spatial accum. with our reduction network<br>has $O(log_2N)$ complexity                                       |
| Eyeriss v2 [11]      | Limited weight dist. flexibility and linear reduction                                                              | More flexibility with shared all-to-all network and<br>spatial accumulation with novel reduction network FAN. |
| Packed Systolic [26] | Need algorithmic adjustments and still<br>contains stationary zeros                                                | Bitmap to ensure no zero-valued elements are stationary<br>and no algorithmic changes required.               |
| Cambricon-X [47]     | Basic adder tree limits multiplier utilization,<br>allows one common partial sum at a time                         | FAN enables full multiplier utilization by allowing different partial sums to be accumulated separately.      |

Table III: Qualitative Comparision of SIGMA against state-of-the-art accelerators.
# **SIGMA Qualitative Analysis**

| Accelerator     | Limitation                                                   | SIGMA Solution                                                   |
|-----------------|--------------------------------------------------------------|------------------------------------------------------------------|
| <b>TPU</b> [23] | Low utilization from no sparsity support and rigid structure | Flexible interconnects to map non-zero data and irregular GEMMs. |

# SIGMA performs on average **3x** better than state-of-the-art sparse accelerators. In depth analysis can be found in the paper.

| Cambricon-X [47] | Basic adder tree limits multiplier utilization, | FAN enables full multiplier utilization by allowing different |
|------------------|-------------------------------------------------|---------------------------------------------------------------|
|                  | allows one common partial sum at a time         | partial sums to be accumulated separately.                    |

Table III: Qualitative Comparision of SIGMA against state-of-the-art accelerators.

|                   | TPU-like Syste     | olic Engine | SIGMA                         | Engine |
|-------------------|--------------------|-------------|-------------------------------|--------|
| Technology        | Commercial 28nr    | n           | Commercial 28nm               |        |
| Clock freq.       | 500 MHz            |             | 500 MHz                       |        |
| MACs              | 16384 (128 x 128   | PEs)        | 16384 (128, 128PEs Flex-DPEs) |        |
| Data Type         | BFP16 Mult, FP32   | 2 Add       | BFP16 Mult, FP32 Add          |        |
| Peak TFLOPS       | 16                 |             | 16                            |        |
| Sparsity Support? | No                 |             | Yes                           |        |
| *Effective TFLOPS | 1.88               |             | 10.8                          |        |
| Power (W)         | 12.25 W            |             | 22.33 W                       |        |
| Eff. TFLOPS/ W    | 0.15 Eff. TFLOPS/W |             | 0.48 Eff. TFLOPS/V            | V      |
| Area (mm2)        | Total: 47.27 mm2   |             | Total: 65.10 mm2              |        |
|                   | Adder:             | 14.5 %      | Adder:                        | 10.5 % |
|                   | Multipliers:       | 59.0 %      | Multipliers:                  | 42.5 % |
|                   | Local Memory:      | 1.5 %       | Local Memory:                 | 1.0 %  |
|                   | Layout Overhead:   | 25.0 %      | Dist. NoC Overhead:           | 14.5 % |
|                   |                    |             | Red. NoC Overhead:            | 3.0 %  |
|                   |                    |             | FAN Controller:               | 1.5 %  |
|                   |                    |             | Layout Overhead:              | 27.0 % |



\*\* Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

|                   | TPU-like Syste     | olic Engine | SIGMA                         | Engine |
|-------------------|--------------------|-------------|-------------------------------|--------|
| Technology        | Commercial 28nr    | n           | Commercial 28nm               |        |
| Clock freq.       | 500 MHz            |             | 500 MHz                       |        |
| MACs              | 16384 (128 x 128   | PEs)        | 16384 (128, 128PEs Flex-DPEs) |        |
| Data Type         | BFP16 Mult, FP32   | 2 Add       | BFP16 Mult, FP32 Add          |        |
| Peak TFLOPS       | 16                 |             | 16                            |        |
| Sparsity Support? | No                 |             | Yes                           |        |
| *Effective TFLOPS | 1.88               |             | 10.8                          |        |
| Power (W)         | 12.25 W            |             | 22.33 W                       |        |
| Eff. TFLOPS/ W    | 0.15 Eff. TFLOPS/W |             | 0.48 Eff. TFLOPS/V            | V      |
| Area (mm2)        | Total: 47.27 mm2   |             | Total: 65.10 mm2              |        |
|                   | Adder:             | 14.5 %      | Adder:                        | 10.5 % |
|                   | Multipliers:       | 59.0 %      | Multipliers:                  | 42.5 % |
|                   | Local Memory:      | 1.5 %       | Local Memory:                 | 1.0 %  |
|                   | Layout Overhead:   | 25.0 %      | Dist. NoC Overhead:           | 14.5 % |
|                   |                    |             | Red. NoC Overhead:            | 3.0 %  |
|                   |                    |             | FAN Controller:               | 1.5 %  |
|                   |                    |             | Layout Overhead:              | 27.0 % |



\*\* Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA consumes 38% more area and 82% more power than Systolic Array.

|                   | TPU-like Syste     | olic Engine | SIGMA                         | Engine |
|-------------------|--------------------|-------------|-------------------------------|--------|
| Technology        | Commercial 28nr    | n           | Commercial 28nm               |        |
| Clock freq.       | 500 MHz            |             | 500 MHz                       |        |
| MACs              | 16384 (128 x 128   | PEs)        | 16384 (128, 128PEs Flex-DPEs) |        |
| Data Type         | BFP16 Mult, FP32   | 2 Add       | BFP16 Mult, FP32              | Add    |
| Peak TFLOPS       | 16                 |             | 16                            |        |
| Sparsity Support? | No                 |             | Yes                           |        |
| *Effective TFLOPS | 1.88               |             | 10.8                          |        |
| Power (W)         | 12.25 W            |             | 22.33 W                       |        |
| Eff. TFLOPS/ W    | 0.15 Eff. TFLOPS/W |             | 0.48 Eff. TFLOPS/V            | V      |
| Area (mm2)        | Total: 47.27 mm2   |             | Total: 65.10 mm2              |        |
|                   | Adder:             | 14.5 %      | Adder:                        | 10.5 % |
|                   | Multipliers:       | 59.0 %      | Multipliers:                  | 42.5 % |
|                   | Local Memory:      | 1.5 %       | Local Memory:                 | 1.0 %  |
|                   | Layout Overhead:   | 25.0 %      | Dist. NoC Overhead:           | 14.5 % |
|                   |                    |             | Red. NoC Overhead:            | 3.0 %  |
|                   |                    |             | FAN Controller:               | 1.5 %  |
|                   |                    |             | Layout Overhead:              | 27.0 % |



\*\* Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

|            | TPU-like Systolic Engine | SIGMA Engine    |
|------------|--------------------------|-----------------|
| Technology | Commercial 28nm          | Commercial 28nm |
|            | FAALAN                   | FARTHE          |

# SIGMA consumes more resources but achieves higher effective TFLOPS/W.

|  | Local Memory:<br>Layout Overhead: | 1.5 %<br>25.0 % | Local Memory:<br>Dist. NoC Overhead:<br>Red. NoC Overhead:<br>FAN Controller:<br>Layout Overhead: | 42.5 %<br>1.0 %<br>14.5 %<br>3.0 %<br>1.5 %<br>27.0 % | multiplying the base dense TFLOPs with the average efficiency computed across GEMMs. |
|--|-----------------------------------|-----------------|---------------------------------------------------------------------------------------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------|
|--|-----------------------------------|-----------------|---------------------------------------------------------------------------------------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------|

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

# Outline

- Motivation
  - GEMMs in Deep Learning
  - Utilization on TPU
- Accelerator Requirements
- SIGMA
  - Interconnects Implementations
  - Full System Design
- Evaluation
- <u>Conclusion</u>

• GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

- GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
- High utilization from systolic arrays is challenging because of their rigid structure.

- GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
- High utilization from systolic arrays is challenging because of their rigid structure.
- SIGMA enables high compute utilization on sparse irregular GEMMs.

- GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
- High utilization from systolic arrays is challenging because of their rigid structure.
- SIGMA enables high compute utilization on sparse irregular GEMMs.
- SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN reduction interconnects.

- GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
- High utilization from systolic arrays is challenging because of their rigid structure.
- SIGMA enables high compute utilization on sparse irregular GEMMs.
- SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN reduction interconnects.
- SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

- GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
- High utilization from systolic arrays is challenging because of their rigid structure.
- SIGMA enables high compute utilization on sparse irregular GEMMs.
- SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN reduction interconnects.
- SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.
- Future work: Optimizations such as power gating and software stack design.



# Material Backup

#### **Compressed bitmap equivalent** Н B Ε E G С D 0 Α **Stationary Nonzero data** 1 1 0 0 b d f а С g e **Stationary Bitmap Streaming Nonzero data** 0 **Dim. Registers** 0 0 **M-dim** 3 0 0 N-dim 4 0 0 0 0 K-dim 4 **Streaming Bitmap**

### **Example GEMM**

Μ

i)



### **Walkthrough Section**

### Get bitmap from memory

- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



### **Example GEMM**





| i) | Row wise OR on streaming matr | ix |
|----|-------------------------------|----|
| )  | Get bitmap from memory        |    |

- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



### **Example GEMM**





- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce







### **Example GEMM**





- i) Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce





### **Example GEMM**



i)



- Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce







1

Ω

0

i)

### **Example GEMM**





# **Walkthrough Section**

- Get bitmap from memory
- ii) **Row wise OR on streaming matrix**
- iii) **Element wise AND on stationary matrix**

#### Generate stationary' bitmap iv)

- **Unicast stationary values** vi)
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



### **Example GEMM**



i)



# Walkthrough Section

- Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix

### iv) Generate stationary' bitmap

- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



### **Example GEMM**



i)



# Walkthrough Section

- Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix

### iv) Generate stationary' bitmap

- vi) Unicast stationary values
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



### **Example GEMM**

Μ







### **Example GEMM**

Μ

i)

ii)

iii)



- **Generate stationary' bitmap** iv)
- **Unicast stationary values** vi)
- **Get source destination pairs** vii)
- viii) Multicast streaming values and reduce





**Output Bitmap** 

### **Example GEMM**



### **Walkthrough Section**

- i) Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs

viii) Multicast streaming values and reduce





**Output Bitmap** 



### **Example GEMM**



# **Walkthrough Section**

- i) Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs

viii) Multicast streaming values and reduce

169



### **Example GEMM**



# **Walkthrough Section**

- i) Get bitmap from memory
- ii) Row wise OR on streaming matrix
- iii) Element wise AND on stationary matrix
- iv) Generate stationary' bitmap
- vi) Unicast stationary values
- vii) Get source destination pairs

viii) Multicast streaming values and reduce



Cycle 3: multicast 1st column of streaming matrix and reduce

### **Example GEMM**

Μ

ii)



- iii) **Element wise AND on stationary matrix**
- iv) **Generate stationary' bitmap**
- **Unicast stationary values** vi)
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



Cycle 3: multicast 1st column of streaming matrix and reduce

### **Example GEMM**

Μ

ii)



- iii) **Element wise AND on stationary matrix**
- **Generate stationary' bitmap** iv)
- **Unicast stationary values** vi)
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



*Cycle 4: multicast 2nd column of streaming matrix and reduce* 

### **Example GEMM**



viii) Multicast streaming values and reduce



Cycle 5: multicast 3rd column of streaming matrix and reduce

### **Example GEMM**

Μ

ii)

iv)

vi)



**Generate stationary' bitmap** 

viii) Multicast streaming values and reduce

**Unicast stationary values** 

vii) Get source - destination pairs

### 174



Cycle 6: multicast last column of streaming matrix and reduce

### **Example GEMM**

Μ



- ii) iii) **Element wise AND on stationary matrix**
- **Generate stationary' bitmap** iv)
- **Unicast stationary values** vi)
- vii) Get source destination pairs
- viii) Multicast streaming values and reduce



### **Example GEMM**

Μ



### **Walkthrough Section**

i) Get bitmap from memory
ii) Row wise OR on streaming matrix
iii) Element wise AND on stationary matrix
iv) Generate stationary' bitmap
vi) Unicast stationary values
vii) Get source - destination pairs
viii) Multicast streaming values and reduce





Μ







### **Example GEMM**

Μ



### **Walkthrough Section**

i) Get bitmap from memory
ii) Row wise OR on streaming matrix
iii) Element wise AND on stationary matrix
iv) Generate stationary' bitmap
vi) Unicast stationary values
vii) Get source - destination pairs
viii) Multicast streaming values and reduce



### **Example GEMM**

Μ



### **Walkthrough Section**

i) Get bitmap from memory
ii) Row wise OR on streaming matrix
iii) Element wise AND on stationary matrix
iv) Generate stationary' bitmap
vi) Unicast stationary values
vii) Get source - destination pairs
viii) Multicast streaming values and reduce

