Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers
TimeTuesday, December 7th3:30pm - 3:50pm PST
Event Type
Research Manuscript
Virtual Programs
Presented In-Person
Approximate Computing for AI/ML
DescriptionTransformer-based networks are increasingly popular, achieving state-of-the-art performance in a number of tasks. This performance is largely attributed to the use of stacked "self-attention" layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation actually provides a significant contribution to the total run-time of Transformers. To address this, we propose Softermax, a hardware-friendly softmax design. Softermax consists of: base replacement, low-precision softmax computations, and an online normalization calculation. We demonstrate that Softermax allows for a large improvement in TOPs/mm^2 with a negligible impact on network accuracy.