Podcast cover for "Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling" by Huizheng Wang et al.
Episode

Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Dec 23, 20258:18
cs.AReess.SP
No ratings yet

Abstract

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\times$ speedup and 71.2$\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\times$ energy and 27.1$\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\times$ throughput improvement.

Summary

This paper introduces STAR, a novel algorithm-hardware co-design for accelerating Transformer inference under large-scale token parallelism (LTPP). STAR achieves significant energy efficiency and throughput improvements by coordinating the pre-compute, top-k sort, and formal compute stages of the acceleration pipeline, minimizing redundant computation and memory access.

Key Insights

  • STAR introduces a cross-stage optimization approach for dynamic sparsity, coordinating the pre-compute, top-k sort, and formal compute stages to reduce redundant computation, a departure from stage-isolated optimizations.
  • The "differential leading zero scheme" (DLZS) efficiently predicts sparsity by replacing computationally expensive multiplications with addition operations and leading-zero detection in the log-domain.
  • The "sphere-search-aided distributed sorting" (SADS) strategy reduces the complexity of top-k selection by dividing the sequence into sub-segments, enabling tiled execution and minimizing memory access.
  • The "sorted updating FlashAttention" (SU-FA) mechanism leverages the sorting information from the SADS stage to simplify the FlashAttention process, reducing computational overhead by avoiding repeated comparisons and exponentiations.

Practical Implications

  • STAR's significant performance gains suggest it could substantially reduce the cost and energy consumption of deploying large language models (LLMs), making them more accessible and sustainable.
  • Future research should focus on detailed hardware architecture analysis, hyperparameter optimization (e.g., tile size), and exploration of adaptive tiling strategies to further enhance STAR's performance and adaptability.
  • The cross-stage optimization principle of STAR opens new avenues for algorithm-hardware co-design, potentially leading to more energy-efficient and cost-effective solutions for other computationally intensive tasks beyond Transformer inference.
  • Exploring reinforcement learning to optimize the cross-stage coordination strategy could lead to further performance improvements and more robust adaptation to different hardware platforms and model architectures.

Links & Resources

Authors

Cite This Paper

Year:2025
Category:cs.AR
APA

Wang, H., Wei, T., Wang, H., Wang, Z., Tang, X., Yue, Z., Wei, S., Hu, Y., Yin, S. (2025). Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling. arXiv preprint arXiv:2512.20198.

MLA

Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Shaojun Wei, Yang Hu, and Shouyi Yin. "Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling." arXiv preprint arXiv:2512.20198 (2025).