### A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication, and Fine-Grained Power Management

<u>Thierry Tambe</u><sup>1</sup>, Jeff Zhang<sup>1</sup>, Coleman Hooper<sup>1</sup>, Tianyu Jia<sup>2</sup>, Paul N. Whatmough<sup>1,3</sup>, Joseph Zuckerman<sup>4</sup>, Maico Cassel<sup>4</sup>, Erik Loscalzo<sup>4</sup>, Davide Giri<sup>4</sup>, Kenneth Shepard<sup>4</sup>, Luca Carloni<sup>4</sup>, Alexander M. Rush<sup>5</sup>, David Brooks<sup>1</sup>, Gu-Yeon Wei<sup>1</sup>

<sup>1</sup>Harvard University, Cambridge, MA, <sup>2</sup>Peking University, Beijing, China,

<sup>3</sup>ARM, Boston, MA, <sup>4</sup>Columbia University, New York, NY,



<sup>5</sup>Cornell University, New York, NY

Harvard John A. Paulson

School of Engineering

and Applied Sciences

### **ML-based NLP is applied widely**



Language Modeling & Understanding







Search Engines

### Understanding searches better than ever before

Oct 25, 2019 · 5 min read https://blog.google/products/search/search-language-understanding-bert/

### Bing delivers its largest improvement in search experience using Azure GPUs

Posted on November 18, 2019



https://azure.microsoft.com/en-us/blog/bing-delivers-its-largest-improvement-in-search-experience-using-azure-gpus/

### **NLP Growing Overhead**



Source: https://amatriain.net/blog/transformer-models-anintroduction-and-catalog-2d1e9039f376/

### Opportunities to achieve higher energy efficiency on edge devices via careful algorithm-hardware co-design

© 2023 IEEE International Solid-State Circuits Conference

### **Processor for Efficient Transformer Computation**



### **Abstracting Energy Consumption**

# Energy $\propto \alpha \ C \ V_{DD}^2 \ N_{cycles}$

- $\alpha$  switching activity factor
- *C* wire and device capacitance
- $V_{DD}^2$  supply voltage
- *N<sub>cycles</sub>* # of inference clock cycles

### **Proposed Optimization Schemes**



### **Proposed Optimization Schemes**



### Outline

#### Motivation

#### Entropy-Driven Optimizations

- Early Exit
- Latency-Aware Voltage-Frequency Scaling

#### 12nm Transformer Accelerator Architecture

- Mixed-Precision FP4/FP8 Datapath
- Chip Measurement Results

#### Summary

### Outline

#### Motivation

#### Entropy-Driven Optimizations

- Early Exit
- Latency-Aware Voltage-Frequency Scaling
- 12nm Transformer Accelerator Architecture
  - Mixed-Precision FP4/FP8 Datapath
- Chip Measurement Results

#### Summary

### **Conventional BERT Inference**



<u>Source</u>: Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." *arXiv preprint arXiv:1810.04805* (2018)

### **BERT Inference with Entropy-based Early Exit**



Inference exits early if the entropy is smaller than a usergiven threshold

<u>Source</u>: Xin, Ji, et al. "DeeBERT: Dynamic early exiting for accelerating BERT inference." *arXiv preprint arXiv:2004.12993* (2020).

### **Significant Latency Savings**



### **Two Optimization Directions**

# Latency minimization Energy minimization

### **Latency Minimization**



- Accelerator operates at max frequency with early exit
- Attention head pruning and mixed-precision FP4/FP8 datapath further cut inference latency

### **Energy Minimization**



Accelerator uses entropy statistics to derate its voltage and frequency while adhering to a prescribed latency target.

### Outline

#### Motivation

- Entropy-Driven Optimizations
  - Early Exit
  - Latency-Aware Voltage-Frequency Scaling
- 12nm Transformer Accelerator Architecture
  - Mixed-Precision FP4/FP8 Datapath
- Chip Measurement Results

#### Summary

### **Proposed Sparse Transformer Processor**



### **Compressed Sparse Mixed-Precision Execution**



Bit-Mask Sparse Decoder Mixed-Precision FP4/FP8 Datapath Sparse Encoder

### **Entropy-Controlled Voltage-Frequency Scaling**



Entropy-Controlled Voltage-Frequency Scaling using:

- Open-loop free running LDO
- Cell-based PMOS power headers
- 16 pre-characterized LUT entropy values control the LDO drive strength
- DCO powered from LDO output

### **Proposed Sparse Transformer Processor**



### Outline

#### Motivation

#### Entropy-Driven Optimizations

- Early Exit
- Latency-Aware Voltage-Frequency Scaling
- 12nm Transformer Accelerator Architecture

#### Mixed-Precision FP4/FP8 Datapath

Chip Measurement Results

#### Summary

### **Efficient Number Systems**



 $\gamma \propto$  distance between values

<u>Source</u>: Zhao, Jiawei, et al. "LNS-Madam: Low-Precision Training in Logarithmic Number System Using Multiplicative Weight Update." *IEEE Transactions on Computers*, 2022.

### Tensor Multiplication in Logarithmic Number System (LNS) $a = sign_a \times 2^{\tilde{a}/\gamma}$ $b = sign_b \times 2^{\tilde{b}/\gamma}$

$$a^{T}b = \sum XOR(sign_{a}, sign_{b}) \times 2^{(\tilde{a}+\tilde{b})/\gamma}$$

#### No Need for Multipliers! (Only Adders + Shifters)

<u>Source</u>: Zhao, Jiawei, et al. "LNS-Madam: Low-Precision Training in Logarithmic Number System Using Multiplicative Weight Update." *IEEE Transactions on Computers*, 2022.

### **Tensor Scaling**



© 2023 IEEE International Solid-State Circuits Conference

### **Mixed-Precision MAC Datapath**



### FP8 (E4M3) MAC



### FP4/LOG4 (E3M0) MAC



### **Tensor Scaling Unit**



### **Steep Accuracy Loss w/ Per-Tensor Bias in FP4**

| Baseline SST-2 Acc.                | 92.2        |           |
|------------------------------------|-------------|-----------|
| w/ FP8 per-tensor<br>exponent bias | 92.2        | <br> 23 · |
| w/ FP4 per-tensor<br>exponent bias | <b>69.0</b> |           |

### **Per-Vector Exponent Scaling when using FP4**



## To avoid steep accuracy loss, we adopt per-vector exponent bias scaling in the FP4 regime

### Per-Vector Scaling in FP4 Averts Steep Accuracy Loss

| Baseline SST-2 Acc.       | 92.2 |     |
|---------------------------|------|-----|
| w/ FP8 per-tensor scaling | 92.2 |     |
| w/ FP4 per-tensor scaling | 69.0 |     |
| w/ FP4 per-vector scaling | 88.3 | -3. |

### **Entropy-Controlled Precision Selection**



Pre-calibrated entropy predication selects between FP4 and FP8 MAC during mixed-precision operation

### Outline

#### Motivation

#### Entropy-Driven Optimizations

- Early Exit
- Latency-Aware Voltage-Frequency Scaling
- 12nm Transformer Accelerator Architecture
  - Mixed-Precision FP4/FP8 Datapath

### Chip Measurement Results

#### Summary

### **12nm Chip Tapeout**



© 2023 IEEE International Solid-State Circuits Conference 22.9: A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

0.62 - 1.0

77 – 717

647

### **Accelerator Efficiency**



### **Entropy Hardware Unit**







22.9: A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

37 of 43

### **Measured Latency Results**



### **Measured Energy Results**



### Summary

- Large language models levy a hefty cost on low capacity edge devices
- This work enables fine-grained sentence-level latency and energy optimizations for BERT inference aided by:
  - Entropy-based early exit
  - Entropy-based voltage/frequency scaling
  - FP4/FP8 mixed-precision MAC

#### Measurements on test chip show:

- Up to 6x latency reduction and 7x energy reduction over conventional inference
- Peak throughput of 18.1TFLOPs/W

# This Work is Dedicated to our Friend and Collaborator: Davide Giri



#### 1990 - 2021

© 2023 IEEE International Solid-State Circuits Conference

### **Thank You!**

### Acknowledgements

- This work is supported in part by DARPA, JUMP ADA, NSF Awards 1704834 and 1718160, Intel Corp., and Arm Inc.
- We thank our DARPA collaborators from IBM, Pradip Bose, Martin Cochet, and Karthik Swaminathan for helping support this work.