Publications |

Conference

2025

arXiv

GainSight: Application-Guided Profiling for Composing Heterogeneous On-Chip Memories in AI Hardware Accelerators

Peijing Li, Matthew Hung, Yiming Tan, Konstantin Hoßfeld, Jake Jiajun Cheng, Shuhan Liu, Lixian Yan, Xinxin Wang, H-S Philip Wong, and Thierry Tambe

In arXiv.2504.14866, 2025

HTML PDF
arXiv

BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Wonsuk Jang, and Thierry Tambe

In arXiv.2501.01144, 2025

HTML PDF

2024

ISLPED

Accelerating DNN Execution with Adaptive N:M Pruning on Both Weight and Data

Sai Qian Zhang, Thierry Tambe, David Brooks, and Gu-Yeon Wei

In ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2024
HPCA

CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning

Sai Qian Zhang*, Thierry Tambe*, Nestor Cuevas, Gu-Yeon Wei, and David Brooks

In International Symposium on High-Performance Computer Architecture (HPCA), 2024

arXiv
ISSCC

A 12nm Linux-SMP-Capable RISC-V SoC with 14 Accelerator Types, Distributed Hardware Power Management and Flexible NoC-based Data Orchestration

M.C. Santos, T. Jia, J. Zuckerman, M. Cochet, D. Giri, E. Loscalzo, K. Swaminathan, T. Tambe, J. Zhang, A. Buyuktosunoglu, K-L. Chiu, G-D. Guglielmo, G. Tombesi, D. Trilla, J-D. Wellman, E-Y. Yang, A. Amarnath, Y. Jing, B. Mishra, J. Park, V. Suresh, S. Adve, D. Brooks, L. Carloni, K. Shepard, and G-Y. Wei

In 2024 IEEE International Solid- State Circuits Conference (ISSCC), 2024

2023

IROS

VaPr: Variable-Precision Tensors to Accelerate Robot Motion Planning

Yu-Shun Hsiao, Siva Hari, Balakumar Sundaralingam, Jason Yik, Thierry Tambe, Charbel Sakr, Stephen Keckler, and Vijay Janapa Reddi

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023
ISSCC

A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

Thierry Tambe, Jeff Zhang, Coleman Hooper, Tianyu Jia, Paul N. Whatmough, Joseph Zuckerman, Maico Cassel Dos Santos, Erik Jens Loscalzo, Davide Giri, Kenneth Shepard, Luca Carloni, Alexander Rush, David Brooks, and Gu-Yeon Wei

In 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023

Abs HTML Slides

Large language models have substantially advanced nuance and context understanding in natural language processing (NLP), further fueling the growth of intelligent conversational interfaces and virtual assistants. However, their hefty computational and memory demands make them potentially expensive to deploy on cloudless edge platforms with strict latency and energy requirements. To address this challenge, we present a 4.60mm2 sparse transformer processor (STP) that efficiently accelerates transformer workloads by tailoring its latency and energy expenditures according to the complexity of the input query it processes. Key contributions of this work are as follows: (1) A specialized datapath for entropy-based early exit assessment reduces BERT latency by up to 6.13x, e.g., inferences terminate early at an average early exit layer of 3.90 (out of 12) for the SST-2 NLP benchmark; (2) A mixed-precision (MP) FP4/FP8 MAC supports per-vector exponent biases during 4-bit floating point (FP4) computations, allowing the processor to double its throughput while reducing its energy consumption and maintaining high inference accuracy, depending on the entropy; and (3) A fine-grained sentence-level power management scheme opportunistically scales the accelerator’s supply voltage and clock frequency while meeting an application’s end-to-end latency target. Together, the proposed STP achieves a peak efficiency of 65mJ/inf, a 7.14x energy improvement, on average, over conventional BERT inference without the key innovations.

2022

DSN

GoldenEye: A Platform for Evaluating Emerging Numerical Data Formats in DNN Accelerators

Abdulrahman Mahmoud, Thierry Tambe, Tarek Aloui, David Brooks, and Gu-Yeon Wei

In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2022

HTML PDF
ICS

ASAP: Automatic Synthesis of Area-Efficient and Precision-Aware CGRAs

Cheng Tan, Thierry Tambe, Jeff (Jun) Zhang, Bo Fang, Tong Geng, Gu-Yeon Wei, David Brooks, Antonino Tumeo, Ganesh Gopalakrishnan, and Ang Li

In Proceedings of the 36th ACM International Conference on Supercomputing, 2022

HTML

2021

MICRO

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei

In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Abs HTML PDF Code Slides

Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimizations for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.
ASPLOS

Robomorphic Computing: A Design Methodology for Domain-Specific Accelerators Parameterized by Robot Morphology

Sabrina M. Neuman, Brian Plancher, Thomas Bourgeat, Thierry Tambe, Srinivas Devadas, and Vijay Janapa Reddi

In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

Abs HTML PDF

Robotics applications have hard time constraints and heavy computational burdens that can greatly benefit from domain-specific hardware accelerators. For the latency-critical problem of robot motion planning and control, there exists a performance gap of at least an order of magnitude between joint actuator response rates and state-of-the-art software solutions. Hardware acceleration can close this gap, but it is essential to define automated hardware design flows to keep the design process agile as applications and robot platforms evolve. To address this challenge, we introduce robomorphic computing: a methodology to transform robot morphology into a customized hardware accelerator morphology. We (i) present this design methodology, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware; (ii) use the methodology to generate a parameterized accelerator design for the gradient of rigid body dynamics, a key kernel in motion planning; (iii) evaluate FPGA and synthesized ASIC implementations of this accelerator for an industrial manipulator robot; and (iv) describe how the design can be automatically customized for other robot models. Our FPGA accelerator achieves speedups of 8x and 86x over CPU and GPU when executing a single dynamics gradient computation. It maintains speedups of 1.9x to 2.9x over CPU and GPU, including computation and I/O round-trip latency, when deployed as a coprocessor to a host CPU for processing multiple dynamics gradient computations. ASIC synthesis indicates an additional 7.2x speedup for single computation latency. We describe how this principled approach generalizes to more complex robot platforms, such as quadrupeds and humanoids, as well as to other computational kernels in robotics, outlining a path forward for future robomorphic computing accelerators.
ISSCC

A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei

In 2021 IEEE International Solid- State Circuits Conference (ISSCC), 2021

HTML Code Slides

2020

HotChips

A Scalable Bayesian Inference Accelerator for Unsupervised Learning

Glenn Ko, Yuji Chai, Marco Donato, Paul N. Whatmough, Thierry Tambe, Rob A. Rutenbar, Gu-Yeon Wei, and David Brooks

In 2020 IEEE Hot Chips 32 Symposium (HCS), 2020

HTML
DAC
Best Paper Award

Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference

Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander Rush, David Brooks, and Gu-Yeon Wei

In 2020 57th ACM/IEEE Design Automation Conference (DAC), 2020

Abs HTML PDF Code Press Slides

Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low precision as their shrunken dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present an algorithm-hardware co-design centered around a novel floating-point inspired number format, AdaptivFloat, that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encodings of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at low bit precision (≤ 8-bit) across a diverse set of state-of-the-art neural networks, exhibiting narrow to wide weight distribution. Notably, at 4-bit weight precision, only a 2.1 degradation in BLEU score is observed on the AdaptivFloat-quantized Transformer network compared to total accuracy loss when encoded in the above-mentioned prominent datatypes. Furthermore, experimental results on a deep neural network (DNN) processing element (PE), exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9X and 1.14X, respectively, that of an equivalent bit width NVDLA-like integer-based PE.
VLSI Symp

A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm

Glenn G. Ko, Yuji Chai, Marco Donato, Paul N. Whatmough, Thierry Tambe, Rob A. Rutenbar, David Brooks, and Gu-Yeon Wei

In 2020 IEEE Symposium on VLSI Circuits, 2020

HTML

2019

PACT

MASR: A Modular Accelerator for Sparse RNNs

Gupta Udit, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander Rush, Gu-Yeon Wei, and David Brooks

In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2019

arXiv HTML

Journal

2024

TODAES

Application-level Validation of Accelerator Designs Using a Formal Software/Hardware Interface

Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Gus Henry Smith, Thierry Tambe, Akash Gaonkar, Vishal Canumalla, Andrew Cheung, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, and Sharad Malik

ACM Trans. Des. Autom. Electron. Syst., Feb 2024

HTML

2023

JSSC

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei

IEEE Journal of Solid-State Circuits, Feb 2023

Abs HTML Code

The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer Internet of Things (IoT) market. As these technologies are being applied to keyword spotting (KWS), automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications, it is of paramount importance that they provide uncompromising performance for context learning in long sequences, which is a key benefit of the attention mechanism, and that they work seamlessly in polyphonic environments. In this work, we present a 25-mm2 system-on-chip (SoC) in 16-nm FinFET technology, codenamed SM6, which executes end-to-end speech-enhancing attention-based ASR and NLP workloads. The SoC includes: 1) FlexASR, a highly reconfigurable NLP inference processor optimized for whole-model acceleration of bidirectional attention-based sequence-to-sequence (seq2seq) deep neural networks (DNNs); 2) a Markov random field source separation engine (MSSE), a probabilistic graphical model accelerator for unsupervised inference via Gibbs sampling, used for sound source separation; 3) a dual-core Arm Cortex A53 CPU cluster, which provides on-demand single Instruction/multiple data (SIMD) fast fourier transform (FFT) processing and performs various application logic (e.g., expectation–maximization (EM) algorithm and 8-bit floating-point (FP8) quantization); and 4) an always-ON M0 subsystem for audio detection and power management. Measurement results demonstrate the efficiency ranges of 2.6–7.8 TFLOPs/W and 4.33–17.6 Gsamples/s/W for FlexASR and MSSE, respectively; MSSE denoising performance allowing 6x smaller ASR model to be stored on-chip with negligible accuracy loss; and 2.24-mJ energy consumption while achieving real-time throughput, end-to-end, and per-frame ASR latencies of 18 ms.

Technical Report

2022

LATTE

Learnings from a HLS-based High-Productivity Digital VLSI Flow

Thierry Tambe, David Brooks, and Gu-Yeon Wei

Feb 2022

PDF

2019

ArXiv

AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference

Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander M. Rush, David M. Brooks, and Gu-Yeon Wei

Feb 2019

arXiv

Preprint

2022

ArXiv

Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface

Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Thierry Tambe, Gus Henry Smith, Akash Gaonkar, Vishal Canumalla, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, and Sharad Malik

Feb 2022

arXiv

Thesis

2023

Ph.D.

Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning

Thierry Tambe

Feb 2023

PhD Dissertation, Electrical Engineering, Harvard University

Abs HTML PDF

The unabated pursuit of omniscient and omnipotent AI is levying hefty latency, memory, and energy taxes at all computing scales. At the same time, the twilight of Dennard scaling means traditional performance gains are no longer proportionally attained with reduction in transistor feature size – compelling a global trend towards application-based hardware specialization. Over the course of my PhD, I have built a heterogeneity of solutions co-optimized across the algorithm, architecture, and silicon stack to generate breakthrough advances in arithmetic performance, compute density and flexibility, and energy efficiency for on-chip machine learning (ML), and natural language processing (NLP) in particular. My work aims to significantly increase the application space of embedded ML computing, in both the inference and training regimes, by coalescing innovative vectors spanning the algorithm, memory subsystem, hardware architecture, and circuit layers, while tuning their designs and inter-dependencies to promote greater performance, energy efficiency, and reliability within a silicon chip system. In the algorithm front, this thesis discusses best paper award-winning work on a novel floating-point based data type, AdaptivFloat, which enables resilient quantized AI computations; and is particularly suitable for NLP networks with large parameter distribution. To evaluate AdaptivFloat impact on a real system, this thesis describes a 16nm chip prototype that integrates FlexASR, a programmable hardware accelerator with AdaptivFloat-based processing elements, and specialized for attention-based recurrent neural networks used in speech and machine translation AI workloads. We further verify FlexASR fidelity to the front-end AI application via a formal hardware/software compiler interface. Towards the goal of lowering the prohibitive energy cost of inferencing large language models on TinyML devices, this dissertation describes a principled algorithm-hardware co-design solution, validated in a 12nm chip tapeout, that accelerates Transformer workloads by tailoring the accelerator’s latency and energy expenditures according to the complexity of the input query it processes. Finally, recognizing that the overwhelming majority of the data generated during the deep learning training process exhibits a very short-lived lifetime, this thesis proposes leveraging non-conventional embedded dynamic RAMs (eDRAMs) as the main on-chip storage medium for ML training data – which, along with a tightly-coupled offering of algorithmic alterations and custom hardware specialization, yields significant energy efficiency advantages over conventional SRAMs.