//Author: Marcos Fernandez Lopez
//CESGA (Galician Supercomputing Centre)
Library for sparse matrix-vector multiplication (SpMV) with OPENCL kernels.
This library includes the following kernels and formats:
- CSR : Compressed Sparse Row. One wavefront per row (the number of work items that compute the row can be modified. A complete wavefront for AMD GPUs (64 work items) is not usually the best option, typically 16 or 32)
- COO : Coordinate. Each wavefront can compute several rows.
- ELL : ELLPACK. 1 work item per matrix row.
- HYB : hybrid ELL/COO combination. There's no specific kernel for it, just two consecutive calls to COO/ELL kernels.
- HYB-CSR : hybrid ELL/CSR combination. There's no specific kernel for it, just two consecutive calls to CSR/ELL kernels.
For further details refer to the following papers:
"Efficient Sparse Matrix-Vector Multiplication on CUDA"
Nathan Bell and Michael Garland, "NVIDIA Technical Report NVR-2008-004", December 2008
"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors"
Nathan Bell and Michael Garland, in "Proc. Supercomputing '09", November 2009
"Optimization of sparse matrix–vector multiplication using reordering techniques on GPUs"
J. C. Pichel, F. F. Rivera, M. Fernández and A. Rodríguez. Microprocessors and Microsystems 36, pages 65–77, 2012
ELL format may excess GPU storage size. By default, if it is going to result in a size larger than 256 MB, execution will
be aborted with the appropriate message. This parameter can be modified in "definitions.h". Anyway, a matrix that results in
an ELL size so large, typically won't give good performance results for that kernel.
The use of images could be implemented to improve acceses to the x vector. This approach has been tried with no visible improvements
on ATI cards and little improvements on NVIDIA cards. This library does not use images.
Build Instructions
------------------
The first thing to do is to change the absolute paths in the first two lines of
each kernel to the appropriate paths in your system.
For NVIDIA GPUs:
g++ -O3 -I/usr/local/cuda/include/ -o matvecmul matvecmul.cpp -lOpenCL
Modify the path to CL.h accordingly
This code has been tested under CUDA 4.2 on 64-bit Linux, NVIDIA C1060 GPU.
For AMD GPUs:
g++ -O3 -L/opt/AMD-APP-SDK-v2.4-lnx64/lib/x86_64/ -I/opt/AMD-APP-SDK-v2.4-lnx64/include/ -o matvecmul matvecmul.cpp -lOpenCL
Modify the paths to CL.h and libOpenCL.so accordingly
This code has been tested under AMD SDK 2.4 on 64-bit Linux, AMD FirePro 7800
Program Usage
-------------
For local execution:
$ ./matvecmul matrix.mm
where matrix.mm is the file name of a sparse matrix in MatrixMarket format.
For remote execution in SVG (CESGA):
$ qsub -l num_proc=1,s_rt=00:10:00,s_vmem=2G,h_fsize=1G,arch=amd,gpu=1 ./Launch_spmv.sh matrix.mm
where matrix.mm is the file name of a sparse matrix in MatrixMarket format,
and Launch_spmv.sh is constructed this way
module load amdAppSdk
make clean
make
./matvecmul /path-to-matrix.mm/$1
Parameters that can be changed
------------------------------
They are in the file "definitions.h"
WAVEFRONT_SIZE * : It defines how many threads will compute a row (or group of elements) for CSR (or COO) kernels. It is
important to note that it does not change the actual wavefront size, that is hardware specific.
WORKGROUP_SIZE ** : Modifies the size of the workgroup. Possible values are 64, 128, 256
GLOBAL_SIZE : Modifies the number of work-items that will be launched. For ELL kernel it just defines a minimum value.
MAX_GPU_MEMORY : Sets the maximum size for ELL representation of the matrix.
* If WAVEFRONT_SIZE is changed, reduction sums must be changed accordingly in COO and CSR kernels (the changes needed are indicated in the kernel code)
** If WORKGROUP_SIZE is changed, reductions for wavefront and workgroup must be changed accordingly in COO kernel
Examples and guidelines
-----------------------
A file "matrix.mm" is distributed with this packet as a matrix example to test the library. It corresponds to
the matrix lhr10 of the University of Florida Sparse Matrix Collection
A spreadsheet (and corresponding pdf) is included with performance measurements for different matrices and systems.
This can be used as a guide to choose one or another kernel depending on the pattern similarities of the matrix target
with some of the matrices already tested.