



### ECE 6115 / CS 8803 - ICN Interconnection Networks for High Performance Systems Spring 2020 SMART NETWORK-ON-CHIP

#### Tushar Krishna

Assistant Professor School of Electrical and Computer Engineering Georgia Institute of Technology

tushar@ece.gatech.edu



# BASELINE ROUTER PIPELINE



#### Per Packet

- RC, VA  $\rightarrow$  done by Head flit
- Per Flit
  - BW, SA, BR, ST, LT



### WHY DOES ROUTER DELAY MATTER?

$$T_N = (t_r + t_w) \times H + T_c + T_s$$

- $T_N$ : Network delay
- t<sub>r</sub>: router pipeline delay
- t<sub>w</sub>: wire delay per hop
- H: number of hops
- T<sub>c</sub>: contention delay
- $T_s$ : serialization delay (for multi-flit packets) H  $T_c$

#### Which of these is static?

 $t_r \quad t_w \quad T_s$ 

# Which of these is dynamic (traffic-dependent)?

### CASE STUDY: INTEL SCC (ISSCC 2010)





# CASE STUDY: INTEL SCC (ISSCC 2010)

The 5-port virtual cut-through router (Fig. 5.7.3) used to create the 2D-mesh network employs a credit-based flow-control protocol. Router ports are packetswitched, have 16-byte data links, and can operate at 2GHz at 1.1V. Each input port has five 24-entry queues, a route pre-computation unit, and a virtual-channel (VC) allocator. Route pre-computation for the outport of the next router is done on queued packets. An XY dimension ordered routing algorithm is strictly followed. Deadlock free routing is maintained by allocating 8 virtual channels (VCs) between 2 message classes on all outgoing packets. VC0 through VC5 are kept in a free pool, while VC6 and VC7 are reserved for request classes and response classes, respectively. Input port and output port arbitrations are done concurrently using a wrapped wave front arbiter. Crossbar switch allocation is done in a single clock cycle on a packet granularity. No-load router latency is 4 clock cycles, including link traversal. Individual routers offer 64GB/s interconnect bandwidth, enabling the total network to support 256GB/s of bisection bandwidth.



# **COMMON PIPELINE OPTIMIZATIONS**

- BW + RC in parallel
  - Lookahead Routing

|  | BW | RC | VA | SA | BR | ST |  | LT |  |
|--|----|----|----|----|----|----|--|----|--|
|--|----|----|----|----|----|----|--|----|--|

- SA + VA in parallel
  - VC Select (switch output port winner selects VC from pool of free VCs)
  - Speculative VA (if VA takes long, speculatively allocate a VC while flit performs SA) (Peh and Dally, HPCA 2001)
    - If SA and VA both successful, go for ST
    - If SA or VA fails, retry next cycle
- BR + SA in parallel
  - The winner of Input Arbitration is read out and sent to the input of the crossbar speculatively
- Low-load Bypassing
  - When no flits in input buffer
    - Speculatively enter ST
    - On port conflict, speculation aborted

# EXPRESS VIRTUAL CHANNELS (ISCA 2007)

- Analogy Express Trains and Local Trains
- Flits on Express VCs do not get buffered at intermediate routers
  - Send a "lookahead" to ask local flits to wait (i.e., kill switch allocation)



### **MODERN PIPELINES**



1-cycle for arbitration (tr), 1-cycle for traversal (tw)

Used by Tilera's iMesh, Intel's Ring, NoC prototypes (Park et al., DAC 2012)



### ALTERNATE ROUTER MICROARCHITECTURES

- Output Queues
  - "Virtual" Output Queues (== Virtual Channels)
- Centralized Buffers
- Rotary Router (in paper presentations)



# SMART NOCS



### **MODERN PIPELINES**



1-cycle for arbitration (t<sub>r</sub>), 1-cycle for traversal (t<sub>w</sub>)

Used by Tilera's iMesh, Intel's Ring, NoC prototypes (Park et al., DAC 2012)



### IS THAT THE BEST WE CAN DO?



Can we remove the dependence of latency on hops?

#### Stay tuned!



#### What limits us from designing a 1-cycle network?

#### Is it the wire delay?





Wire Length =  $N \times l$ 

repeater spacing (l)

4

clock period (p)

# **DESIGNING A 1-CYCLE NETWORK**

What limits us from designing a 1-cycle network?

#### Is it the wire delay?



Glol On-chip wires fast enough to transmit across the chip
Cr within 1-2 cycles at 1GHz even as technology scales

Clock frequency expected to remain similar (power wall)





What limits us from designing a 1-cycle network?

Is it the wire delay? No!

#### Classic scaling challenge with wires

Wire-delay increases relative to logic delay

But ...

Wire-delay in cycles expected to remain constant.

Wires fast enough to transmit across chip in 1-2 cycles today and in future.



What limits us from designing a 1-cycle network?

#### Is it the wire delay? No!





#### What limits us from designing a 1-cycle network?











# A "SMART" NETWORK-ON-CHIP

#### Microarchitecture

Bypass path with clockless repeater at each router

#### Flow Control

Compete for and reserve a sequence of shared links cycle-by-cycle

Dynamically create repeated links ("SMART paths") between any two routers





# A "SMART" NETWORK-ON-CHIP

#### Microarchitecture

Bypass path with clockless repeater at each router

#### Flow Control

Compete for and reserve a sequence of shared links cycle-by-cycle

Dynamically create repeated links ("SMART paths") between any two routers

- How well does SMART perform?
  - 88-90% of the performance of an O(n<sup>2</sup>) wire fully-connected (dream) topology with an O(n) wire SMART NoC
  - Baseline Mesh needs to be clocked 5.4 times faster to match SMART

(64-core full-system simulation with real applications)

Microarchitecture and Flow Control details next!

T. Krishna et al. HPCA 2013 IEEE Computer 2013 IEEE Micro Top Picks 2014 NOCS 2014 (Best Paper Award) H Kwon et al., ISPASS 2017



### MICROARCHITECTURE: DATA PATH





# MICROARCHITECTURE: CONTROL PATH

Dedicated repeated links from every router to help setup a SMART path

**Length** =  $HPC_{max}$  hops **Width** =  $log_2(1+HPC_{max})$  bits

e.g.,  $HPC_{max} = 10-16$ 

@45nm, 1GHz, 1mm hop

**HPC**<sub>max</sub> (max Hops Per Cycle): maximum number of "hops" that the underlying wire allows the flit to traverse within a clock cycle







#### 128-bit data path



### SMART FLOW CONTROL

Assume HPC<sub>max</sub> = 3 (max Hops Per Cycle)

#### Request a path of desired length (in hops) over the SSR wires

- Intermediate routers arbitrate between control requests from various routers and setup *buffer*, *mux* and *xbar* for the data path
- No ACK has to be sent back!

#### Send the flit on the data wires

May get partial or full SMART path based on contention that cycle





### TRAVERSAL EXAMPLE 1: $R0 \rightarrow R3$

#### **Cycle 1 (Ctrl): R0 sends SSR = 3** (3-hop path request)

Assume HPC<sub>max</sub> = 3 (max Hops Per Cycle)





### TRAVERSAL EXAMPLE 1: $R0 \rightarrow R3$

**Cycle 1 (Ctrl): R0 sends SSR = 3** (3-hop path request)

Assume HPC<sub>max</sub> = 3 (max Hops Per Cycle)

All routers set *buffer*, *mux* and *xbar* for this request.





### TRAVERSAL EXAMPLE 1: $R0 \rightarrow R3$

**Cycle 1 (Ctrl): R0 sends SSR = 3** (3-hop path request)

All routers set *buffer*, *mux* and *xbar* for this request.

Assume HPC<sub>max</sub> = 3 (max Hops Per Cycle)

Cycle 2 (Data): R0 sends flit to R3



A SMART path is simply a combination of **buffer**, **mux** and **xbar** at all the intermediate routers.



**Solution:** Routers prioritize between path requests based on distance

two alternate schemes: **Prio=Local** and **Prio=Bypass** 



# EXAMPLE 2: $R0 \rightarrow R3$ *AND* $R2 \rightarrow R4$

**Prio = Local**  $\rightarrow 0$  hop > 1 hop > 2 hop ... > *HPC<sub>max</sub>* hop

#### Cycle 2 (Data): R2 sends flit to R4. R0's flit blocked at R2.



#### Cycle 4+ (Data): R2 sends blocked flit to R3.(after local arbitration + sending ctrl)

|        | R0 |        |   |        |       |    |       | R3 |        | T R4 |
|--------|----|--------|---|--------|-------|----|-------|----|--------|------|
| buffer | 0  | buffer | 0 | buffer | 0     | bı | uffer | 1  | buffer | 0    |
| mux    | Х  | mux    | Х | mux    | local | m  | ux    | Х  | mux    | Х    |
| xbar   | X  | xbar   | X | xbar   | W->E  | xŁ | par   | X  | xbar   | Х    |

ICN | Spring 2020 | M07: SMART NoC

 $\ensuremath{\textcircled{\text{C}}}$  Tushar Krishna, School of ECE, Georgia Tech

Feb 17, 2020



# EXAMPLE 2: $R0 \rightarrow R3$ AND $R2 \rightarrow R4$

**Prio = Bypass**  $\rightarrow$  *HPC<sub>max</sub>* hop > (*HPC<sub>max</sub>* - 1) hop > ... > 1 hop > 0 hop

#### Cycle 2 (Data): R0 sends flit to R3. R2's flit waits.



#### Cycle 3 (Data): R2 sends flit to R4.

|        |   |        |   |        |       |   |        | R3     |   |        |   |
|--------|---|--------|---|--------|-------|---|--------|--------|---|--------|---|
| buffer | 0 | buffer | 0 | buffer | 0     | ľ | buffer | 0      | Τ | buffer | 1 |
| mux    | X | mux    | X | mux    | local | n | nux    | bypass |   | mux    | Х |
| xbar   | X | xbar   | X | xbar   | W->E  | X | xbar   | W->E   |   | xbar   | Х |

ICN | Spring 2020 | M07: SMART NoC

 $\ensuremath{\textcircled{\text{C}}}$  Tushar Krishna, School of ECE, Georgia Tech

Feb 17, 2020

### ACHIEVABLE HOPS PER CYCLE (HPC)





# PIPELINE AND IMPLEMENTATION





# THE DEVIL IS IN THE DETAILS

#### Managing Distributed Arbitration

- Could flits get misrouted?
- Could flits not arrive when expected?

#### SMART\_2D

How can flits bypass routers at turns?

#### Buffer Management

- How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way?
- How is buffer availability conveyed?

#### Multi-flit packets

- How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered?
- How do flits know which VC to stop in



# THE DEVIL IS IN THE DETAILS

#### Managing Distributed Arbitration

- Could flits get misrouted?
- Could flits not arrive when expected?

#### SMART\_2D

How can flits bypass routers at turns?

#### Buffer Management

- How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way?
- How is buffer availability conveyed?

#### Multi-flit packets

- How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered?
- How do flits know which VC to stop in

### **MANAGING DISTRIBUTED ARBITRATION**



### MANAGING DISTRIBUTED ARBITRATION



# MANAGING DISTRIBUTED ARBITRATION

- Distributed Consensus: All routers need to take the same decision about multiple contending flits in a distributed manner
- Solution: All routers follow the same static priority between the path setup requests that they receive
  - **Prio = Local**:  $0 \text{ hop } > 1 \text{ hop } > \dots (HPC_{max}-1) \text{ hop } > HPC_{max} \text{ hop}$
  - **Prio = Bypass**:  $HPC_{max}$  hop >  $(HPC_{max}-1)$  hop > ... 1 hop > 0 hop
- Implication: a router will not receive a flit that it does not expect
- But can a router **not receive** a flit that it **does expect**?



## MANAGING DISTRIBUTED ARBITRATION

Can a router *not receive* a flit that it *does expect*?







## **IMPACT OF FALSE NEGATIVES**



**Cycle 2: Flit: R0 \rightarrow R2.** 

forced *starvation* and throughput loss

|        |       |        |        |        |       |        | R3     |        |   |
|--------|-------|--------|--------|--------|-------|--------|--------|--------|---|
| buffer | 0     | buffer | 0      | buffer | 1     | buffer | 0      | buffer |   |
| mux    | local | mux    | bypass | mux    | local | mux    | bypass | mux    | Х |
| xbar   | W->E  | xbar   | W->E   | xbar   | W->E  | xbar   | W->E   | xbar   | Х |

ICN | Spring 2020 | M07: SMART NoC © Tushar Krishna, School of ECE, Georgia Tech



# IMPACT OF FALSE NEGATIVES

Prio = Bypass increases false negatives at high-loads





# THE DEVIL IS IN THE DETAILS

- Managing Distributed Arbitration
  - Could flits get misrouted?
  - Could flits not arrive when expected?

### SMART\_2D

How can flits bypass routers at turns?

### Buffer Management

- How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way?
- How is buffer availability conveyed?

### Multi-flit packets

- How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered?
- How do flits know which VC to stop in



### **BYPASS AT TURNS**





# **BYPASS AT TURNS**





## **CONTROL ARBITRATION**





# PRIORITY AT OUTPUT PORT





## PRIORITY AT INPUT PORT





# THE DEVIL IS IN THE DETAILS

- Managing Distributed Arbitration
  - Could flits get misrouted?
  - Could flits not arrive when expected?
- SMART\_2D
  - How can flits bypass routers at turns?

### Buffer Management

- How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way?
- How is buffer availability conveyed?

### Multi-flit packets

- How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered?
- How do flits know which VC to stop in



# THE DEVIL IS IN THE DETAILS

- Managing Distributed Arbitration
  - Could flits get misrouted?
  - Could flits not arrive when expected?
- SMART\_2D
  - How can flits bypass routers at turns?

### Buffer Management

- How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way?
- How is buffer availability conveyed?

### Multi-flit packets

- How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered?
- How do flits know which VC to stop in



### **BUFFER MANAGEMENT**



#### How do we guarantee R2 has a free buffer / VC for the flit from R0?

Every router has free VC information about its neighbor, just like the baseline. R0 sends only if R1 has a free buffer

R1 lets the incoming flit bypass only if R2 has a free buffer, else latches it.

**Corollary:** VCid is allocated after it stops, rather than before starting, since router where it stops is not known.



# THE DEVIL IS IN THE DETAILS

- Managing Distributed Arbitration
  - Could flits get misrouted?
  - Could flits not arrive when expected?
- SMART\_2D
  - How can flits bypass routers at turns?
- Buffer Management
  - How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way?
  - How is buffer availability conveyed?

### Multi-flit packets

- How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered?
- How do flits know which VC to stop in



## **MULTI-FLIT PACKETS (1)**



# How do we guarantee Body/Tail flits from R0 do not bypass R2 if Head was stopped?

If R2 sees a SSR requesting bypass from a router from where it has a buffered flit, it stops all subsequent flits from that router, and buffers them in the same VC



## **MULTI-FLIT PACKETS (2)**



# How do the Body/Tail flits know which VC to stop in if VC is allocated at the router where Head stops?

Match using source\_id of all the flits of the packet



## **MULTI-FLIT PACKETS (2)**



#### What if 2 flits from different packets from the same source arrive?

Virtual Cut-Through

- enough Buffers to store all flits of a packet
- all flits of one packet should leave before flits of another packet



### **EVALUATIONS**

- Simulation Infrastructure: GEMS (Full-System) + Garnet (NoC)
- System: 64-core (8x8 Mesh)
- Technology: 45nm, 1GHz
- Networks being compared
  - **BASELINE (t<sub>r</sub>=1):** Baseline Mesh with 1-cycle router at every hop
  - SMART-<HPC<sub>max</sub>>\_1D: Always stop at turning router
    - Best case delay: 4 cycles ( $\operatorname{Req}_X \rightarrow \operatorname{Flit}_X \rightarrow \operatorname{Req}_Y \rightarrow \operatorname{Flit}_Y$ )
  - SMART-<HPC<sub>max</sub>>\_2D: Bypass turning router
    - Best case delay: 2 cycles (Req  $\rightarrow$  Flit)
  - DREAM (T<sub>N</sub>=1): Contention-less 1-cycle network (fully-connected)



SMART reduces low-load latency to 2-4 cycles **across all** traffic patterns, *independent* of average hops.





# IMPACT OF HPC<sub>MAX</sub>



(smaller cores, similar frequency)







## **PROJECT IDEAS USING SMART**

### SMART NoCs

- SMART NoCs vs High-Radix topologies
- SMART NoCs with non-minimal routes
- SMART paths for latency guarantees
- SMART NoC inside GPU
- Reserve SMART paths using hints from cache/memorycontroller/OS







# PROTOCOL-LEVEL ORDERING

- If a virtual network requires protocol-level ordering
  - Only deterministic routing allowed within that virtual network.
  - Priority should be Local > Bypass
    - Guarantees that 2 flits from the same source do not overtake each other at any router.



### SMART VS. FLATTENED BUTTERFLY



What if we exploit repeated wires and add explicit 1-cyle **physical express links**?

|                             | SMART                                   | Flattened Butterfly                                       |                                                           |                                                  |  |  |  |
|-----------------------------|-----------------------------------------|-----------------------------------------------------------|-----------------------------------------------------------|--------------------------------------------------|--|--|--|
| Ports                       | orts 5-port router                      |                                                           | 15-port router                                            |                                                  |  |  |  |
| Router Delay                | 1-2 cycle in router<br>[SA-L, SSR+SA-G] | 3-4 cycle in router<br>[8-14:1 Arbiter, Multi-stage Xbar] |                                                           |                                                  |  |  |  |
| Bisection<br>Bandwidth (BB) | 128x8<br>1-flit req<br>5-flit resp      | 128x8 [ <b>1</b> x]<br>7-flit req<br>35-flit resp         | 448x8 [ <mark>3.5x</mark> ]<br>2-flit req<br>10-flit resp | 896x8 [ <b>7x</b> ]<br>1-flit req<br>5-flit resp |  |  |  |
| Dyn Power (mW)              | 24.5                                    | 23.1 [ <b>~1x</b> ]                                       | 37 [ <b>1.5x</b> ]                                        | 58 [ <b>2.3x</b> ]                               |  |  |  |
| Area (mm x mm)              | 0.27x0.27                               | 0.47x0.47 [ <mark>3x</mark> ]                             | 0.53x0.53 [ <mark>3.9x</mark> ]                           | 0.7x0.7 [ <b>6.7x</b> ]                          |  |  |  |
| R0                          |                                         |                                                           |                                                           |                                                  |  |  |  |



## SMART VS. FLATTENED BUTTERFLY

Assume 1-cycle Flattened Butterfly Router: Highly optimistic assumption



Better to use SMART and reconfigure 1-cycle multi-hop paths based on traffic rather than use a **fixed** high-radix topology.