Processing-in-Memory

Processing-in-Memory

“Create Computer Systems
In Memories”

Traditionally, CPU is the center of the computing systems that executes arithmetic and logic calculation, while memory is built around it to simply load and store the data. Today, compute unit is executing operations faster than the memory unit can load and store the required data due to technology scaling. Therefore, compute unit is no longer the most time-consuming and energy-consuming part of the system, and the cost of moving data to the locations where computations happen has become the bottleneck instead.

The memory-centric model takes an opposite approach to the traditional compute-centric model to address this expensive data movement problem. It adopts processing-in-memory (PIM) to remove redundant data movement. PIM integrates processing engines around/in the memory to perform computations, which allows the data to stay in the memory. The trend of adopting PIM can be seen at multiple levels in the hardware system. The PIM research consists of applying PIM at the main memory and cache levels that use DRAM and SRAM, respectively, as their hardware devices.

DRAM-based PIM 

DRAM-based PIM is an attractive solution to minimize the von Neumann bottleneck by reducing the off-chip data movement. Rather than using the narrow off-chip bandwidth to transfer data to the CPU, DRAM-based PIM exploits the high internal bandwidth and parallelism in computing inside the memory.

Through the high memory capacity and the internal bandwidth of DRAM, data-intensive applications such as deep learning (DL) and data analytics can take advantage of DRAM-based PIM. Especially, memory-bound applications are of particular interest, including recommendation systems (e.g., DLRM), language models (e.g., BERT, GPT), and online analytical processing (OLAP).

Related Publications:

PRIMO : A Full-Stack Processing-in-DRAM Emulation Framework for Machine Learning Workloads (ICCAD 2023)

SRAM-based PIM 

SRAM-based PIM also removes the overhead of unnecessary data movement between the memory and the processing engines. It can even minimize the distance by integrating the processing engines in a single SRAM cell since it has room for improvement. Numerous researches modify SRAM cells to perform simple Boolean operations, enabling multiply-and-accumulate (MAC) operations in a machine learning (ML) inference and training.