4J-4
LU factorization on Cypress GPU
○酒井智哉,松本和也,中里直人,Stanislav Sedukhin(会津大)
LU-factorization is important part in many practical problems,
which are based on the solution of system of linear equations.
We present performance result of LU-factorization on Cypress
GPU architecture. Cypress GPU can compute 320 fused
multiply-add (FMA) operations per cycle in double precision
floating point. Working frequency of the fastest Cypress GPU
is 850 MHz, i.e peak performance is 544 Gflop/s (one FMA
operation includes 2 flops). Most computations for LU-factorization
depend on General Matrix Multiply (GEMM). The performance of
our implementation of GEMM on Cypress GPU is close to 80% of
the peak. Our current implementation of LU-factorization achieved
379 Gflop/s (69 % of the peak), which is the fastest among
existing one-chip GPU implementations.