A 28nm 4.96 TOPS/W End-to-End Diffusion Accelerator with Reconfigurable Hyper-Precision and Unified Non-Matrix Processing Engine

This paper presents Picasso, an end-to-end diffusion
accelerator. Picasso proposes a novel hyper-precision data type
and reconfigurable architecture that can maximize hardware
efficiency with extended dynamic range, with no compromise
in accuracy. Picasso also proposes a unified engine operating
all non-matrix operations in a streamlined processing flow and
minimizes the end-to-end latency by sub-block pipeline scheduling. The accelerator is fabricated in 28nm CMOS technology
and achieves an energy efficiency of 4.96 TOPS/W and a peak
performance of 9.83 TOPS. Compared with prior works, Picasso
achieves speedups of 8.4×-26.8× while improving energy and
area efficiency by 1.1×-2.8× and 3.6×-30.5×, respectively.