MAT: Processing In-Memory Acceleration for Long-Sequence Attention
TimeTuesday, December 7th1:30pm - 1:53pm PST
Event Type
Research Manuscript
Virtual Programs
Presented In-Person
Near-Memory and In-Memory Computing
Embedded Systems
DescriptionProcessing attention-based machine learning models can be prohibitively costly on long sequences because of the large memory consumption. In this work, we propose MAT, a processing in-memory framework, to accelerate long-sequence attention. MAT adopts a memory-efficient processing flow for attention models that can process sub-sequences in a pipeline with a much smaller memory footprint. MAT utilizes two techniques, reuse-driven data layout, and optimal sample scheduling, to optimize the performance of memory-efficient attention. Our experiments show that MAT provides significant speedups and energy efficiency improvements over TPU and GPU on two emerging long-sequence tasks, medical image processing, and natural language processing.