Conference paper
Auto-tuning Dense Vector and Matrix-vector Operations for Fermi GPUs
In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications.
As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU.
Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.
Language: | English |
---|---|
Publisher: | Springer |
Year: | 2012 |
Pages: | 619-629 |
Proceedings: | Parallel Processing and Applied Mathematics. 9th International Conference, PPAM 2011 |
Series: | Lecture Notes in Computer Science |
Journal subtitle: | 9th International Conference, Ppam 2011 |
ISBN: | 3642314635 , 3642314643 , 9783642314636 and 9783642314643 |
ISSN: | 03029743 |
Types: | Conference paper |
DOI: | 10.1007/978-3-642-31464-3_63 |