Episode

Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Huizheng Wang,Taiquan Wei,Hongbin Wang,Zichuan Wang,Xinru Tang,Zhiheng Yue,Shaojun Wei,Yang Hu,Shouyi Yin

Dec 23, 2025•8:18

cs.AReess.SP

No ratings yet

Abstract

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\times$ speedup and 71.2$\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\times$ energy and 27.1$\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\times$ throughput improvement.

Summary

This paper introduces STAR, a novel algorithm-hardware co-design for accelerating Transformer inference under large-scale token parallelism (LTPP). STAR achieves significant energy efficiency and throughput improvements by coordinating the pre-compute, top-k sort, and formal compute stages of the acceleration pipeline, minimizing redundant computation and memory access.

Key Insights

•STAR introduces a cross-stage optimization approach for dynamic sparsity, coordinating the pre-compute, top-k sort, and formal compute stages to reduce redundant computation, a departure from stage-isolated optimizations.
•The "differential leading zero scheme" (DLZS) efficiently predicts sparsity by replacing computationally expensive multiplications with addition operations and leading-zero detection in the log-domain.
•The "sphere-search-aided distributed sorting" (SADS) strategy reduces the complexity of top-k selection by dividing the sequence into sub-segments, enabling tiled execution and minimizing memory access.
•The "sorted updating FlashAttention" (SU-FA) mechanism leverages the sorting information from the SADS stage to simplify the FlashAttention process, reducing computational overhead by avoiding repeated comparisons and exponentiations.

Practical Implications

•STAR's significant performance gains suggest it could substantially reduce the cost and energy consumption of deploying large language models (LLMs), making them more accessible and sustainable.
•Future research should focus on detailed hardware architecture analysis, hyperparameter optimization (e.g., tile size), and exploration of adaptive tiling strategies to further enhance STAR's performance and adaptability.
•The cross-stage optimization principle of STAR opens new avenues for algorithm-hardware co-design, potentially leading to more energy-efficient and cost-effective solutions for other computationally intensive tasks beyond Transformer inference.
•Exploring reinforcement learning to optimize the cross-stage coordination strategy could lead to further performance improvements and more robust adaptation to different hardware platforms and model architectures.

Links & Resources

View on arXiv Download PDF

Authors

Huizheng Wang Taiquan Wei Hongbin Wang Zichuan Wang Xinru Tang Zhiheng Yue Shaojun Wei Yang Hu Shouyi Yin

Cite This Paper

arXiv:2512.20198

Year:2025

Category:cs.AR

APA

Wang, H., Wei, T., Wang, H., Wang, Z., Tang, X., Yue, Z., Wei, S., Hu, Y., Yin, S. (2025). Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling. arXiv preprint arXiv:2512.20198.

MLA

Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Shaojun Wei, Yang Hu, and Shouyi Yin. "Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling." arXiv preprint arXiv:2512.20198 (2025).