CUBLAS

更新时间：2024-01-30 09:06:01 阅读量：教育文库文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

The library is self‐contained at the API level, that is, no direct interaction with the CUDA driver is necessary The interface to the CUBLAS library is the header file cublas.h.

The type cublasStatus is used for function status returns. cublasStatus

cublasInit (void)

initializes the CUBLAS library and must be called before any other

CUBLAS API function is invoked. It allocates hardware resources

necessary for accessing the GPU.

cublasStatus

cublasShutdown (void)

releases CPU‐side resources used by the CUBLAS library.

cublasStatus

cublasGetError (void)

returns the last error that occurred on invocation of any of the

CUBLAS core functions.

cublasStatus

cublasAlloc (int n, int elemSize, void **devicePtr) 用CUBLAS分配的空间和用cudaMalloc分配的是等价的

creates an object in GPU memory space capable of holding an array of

n elements, where each element requires elemSize bytes of storage. If

the function call is successful, a pointer to the object in GPU memory space is placed in devicePtr. Note that this is a device pointer that

cannot be dereferenced in host code. Function cublasAlloc() is a

wrapper around cudaMalloc(). Device pointers returned by cublasAlloc() can therefore be passed to any CUDA device kernels,

not just CUBLAS functions.

cublasStatus

cublasFree (const void *devicePtr)

destroys the object in GPU memory space referenced by devicePtr. 释放显存空间

cublasStatus

cublasSetVector (int n, int elemSize, const void *x, int incx, void *y, int incy)

copies n elements from a vector x in CPU memory space to a vector y

in GPU memory space. Elements in both vectors are assumed to have a

size of elemSize bytes. Storage spacing between consecutive elements

is incx for the source vector x and incy for the destination vector y. In general, y points to an object, or part of an object, allocated via

cublasAlloc().

主机到设备端的向量数据拷贝

cublasStatus

cublasGetVector (int n, int elemSize, const void *x, int incx, void *y, int incy)

copies n elements from a vector x in GPU memory space to a vector y

in CPU memory space. Elements in both vectors are assumed to have a

size of elemSize bytes. Storage spacing between consecutive elements

is incx for the source vector x and incy for the destination vector y.

显存到主机的向量数据拷贝问题：参数incx和incy什么意思？

incx ：storage spacing between elements of x abs(x[1 + i * incx]) i = 0 to n-1 存储间隔

i = 0，存储在x[1]位置

i = 1，存储在x[1 + incx]位置 i = 2，存储在x[1 + 1 * incx]位置 i = 3，存储在x[1 + 2 * incx]位置

cublasStatus

cublasSetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)

copies a tile of rows×cols elements from a matrix A in CPU memory space to a matrix B in GPU memory space. Each element requires storage of elemSize bytes. 主机到设备端的矩阵数据拷贝

cublasStatus

cublasGetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)

copies a tile of rows×cols elements from a matrix A in GPU memory space to a matrix B in CPU memory space. Each element requires storage of elemSize bytes. 设备到主机端的矩阵数据拷贝 int

cublasIsamax (int n, const float *x, int incx)

finds the smallest index of the maximum magnitude element of singleprecision

vector x; that is, the result is the first i, i = 0 to n-1, that maximizes abs(x[1 + i * incx]) 找到最大值的下标 int

cublasIsamin (int n, const float *x, int incx)

finds the smallest index of the minimum magnitude element of singleprecision

vector x; that is, the result is the first i, i = 0 to n-1,

that

minimizesabs(x[1 + i * incx]) 找到最小值的下标

float

cublasSasum (int n, const float *x, int incx)

computes the sum of the absolute values of the elements of singleprecision

vector x; that is, the result is the sum from i = 0 to n-1 of abs(x[1 + i * incx]) 求n个元素的绝对值之和

参数：

n number of elements in input vector

x single-precision vector with n elements incx storage spacing between elements of x

scalar 英['skeil?] 美['skel?, -,lɑr]

adj.梯状的,分等级的,数量的,标量的 n.数量,标量

precision 英[pri'si??n] 美[pr?'s???n]

n.精确度, 准确(性) adj.精确的；准确的；细致的

参数意义基本一样向量类型和矩阵类型 void

cublasSaxpy (int n, float alpha, const float *x, int incx, float *y, int incy)

multiplies single‐precision vector x by single‐precision

scalar alpha

and adds the result to single‐precision vector y; that is, it overwrites

single‐precision y with single‐precision alpha * x + y For i = 0 to n-1, it replaces y[ly + i * incy] alpha * x[lx + i * incx] + y[ly + i * incy

lx = 1 if incx >= 0, else lx = 1 + (1 – n) * incx 当incx >= 0时，是正的间隔，一般调用时为1即可当incx < 0时，时负的，incx = -1时，间隔为n，

void

cublasScopy (int n, const float *x, int incx, float *y, int incy)

copies the single‐precision vector x to the single‐precision vector y. For

i = 0 to n-1, it copies x[lx + i * incx] y[ly + i * incy] 拷贝函数

float

cublasSdot (int n, const float *x, int incx, const float *y, int incy)

computes the dot product of two single‐precision vectors. It returns

the dot product of the single‐precision vectors x and y if successful,

and 0.0f otherwise. It computes the sum for i = 0 to n-1 of x[lx + i * incx] * y[ly + i * incy] 计算两个向量的点乘

void

cublasSrot (int n, float *x, int incx, float *y, int incy, float sc, float ss)

动脑子分析更易有兴趣：

不动脑子只是一摊子看，最易浪费时间，因为无趣也易不专心。

无论MTI OCW还是The Interpretation Of Dream，都需要用心专心分析才能有收获。

分析CUBLAS函数类型：

需要：矩阵减法函数，矩阵与向量的减法和乘法函数

分析纹理：只要将CUDA数组或者线性内存与纹理绑定后，直接读取其中数据即可，不做任何变换。

用到过的CUBLAS函数有： cublasAlloc

cublasSetVector cublasSgemm cublasGetError cublasGetVector cublasFree cublasInit

cublasShutdown

1. 分配空间，数据传输函数： cublasAlloc

cublasSetVector cublasGetVector cublasFree

2．初始化，推出CUBLAS函数： cublasInit

cublasShutdown 3．运算函数：

cublasSgemm cublasSgemv

cublasSdot cublasSaxpy cublasSscal

下面一个一个分析其用法，并在例子中测试：各个参数的意义，一般用法

float

cublasSdot (int n, const float *x, int incx, const float *y, int incy)

computes the dot product of two single‐precision vectors. It returns the dot product of the single‐precision vectors x and y if successful, and 0.0f otherwise. It computes the sum for

i = 0 to n-1 of x[lx + i * incx] * y[ly + i * incy] 计算两个向量x 和 y 的点乘，

lx = 1 if incx >= 0, else lx = 1 + (1 – n) * incx ;

参数：

n ：number of elements in input vectors x ：single-precision vector with n elements incx： storage spacing between elements of x y ：single-precision vector with n elements incy： storage spacing between elements of y incx和incy调用时一般设为1

alpha = r2 / cublasSdot(n, p, 1, Ap, 1)

status = cublasAlloc(n, sizeof(float), (void **)&p); status = cublasAlloc(n, sizeof(float), (void **)&Ap);

1．

用cublasAlloc分配的空间是否下标从1开始

y[ly + i * incy]

分析：incy = 1时： i = 0放在y[1]里 i = 1放在y[2]里

2．

用cublasAlloc分配的空间不能简单的等价于cudaMalloc分配

的空间。

返回值：

returns single-precision dot product (returns zero if n <= 0)

Error Status

CUBLAS_STATUS_NOT_INITIALIZED if CUBLAS library was not initialized

CUBLAS_STATUS_ALLOC_FAILED if function could not allocate reduction buffer

CUBLAS_STATUS_EXECUTION_FAILED if function failed to execute on GPU

The basic model by which applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. To accomplish this, CUBLAS provides helper functions for creating and destroying objects in GPU space, and for writing data to and retrieving data from these objects.

For maximum compatibility with existing Fortran environments, CUBLAS uses column‐major storage and 1‐based indexing.（注意这里：CUBLAS利用了列优先的存储方式和以1开始的下标方式）。

Since C and C++ use row‐major storage,（行优先排列） applications cannot use the native array semantics for two‐dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of onedimensional arrays. For Fortran code ported to C in mechanical fashion, one may chose to retain 1‐based indexing to avoid the need to transform loops. In this case, the array index of a matrix element in row i and column j can be computed via the following macro:

#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) （采用1‐based indexing）

Here, ld refers to the leading dimension of the matrix as allocated,（按之优先排列的维数）which in the case of column‐major storage is the number of rows. For natively written C and C++ code, one would most likely chose 0‐based indexing, in which case the indexing macro becomes

#define IDX2C(i,j,ld) (((j)*(ld))+(i)) （采用0‐based indexing）

看简单的例子做一下，看结果中体会其用处。

cublasStatus cublasInit (void)

initializes the CUBLAS library and must be called before any other CUBLAS API function is invoked. It allocates hardware resources necessary for accessing the GPU.

初始化函数

cublasStatus

cublasShutdown (void)

releases CPU‐side resources used by the CUBLAS library. The release of GPU‐side resources may be deferred until the application shuts down.

释放资源

1． 2．

CUBLAS helper functions help函数 CUBLAS core functions. Core函数

cublasStatus

cublasGetError (void)

returns the last error that occurred on invocation of any of the CUBLAS core functions.

Reading the error status via cublasGetError() resets the internal error state to

CUBLAS_STATUS_SUCCESS.

在.cpp文件里可以直接调用CUBLAS中函数

void CcublasDlg::simple_sgemm(int n, float alpha, const float *A, const float *B,

float beta, float *C) {

int i; int j; int k;

for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j) { float prod = 0;

for (k = 0; k < n; ++k) {

prod += A[k * n + i] * B[j * n + k]; }

C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } }

分析：计算alpha * AB + beta * C的值。对k求和：A(i , k) * B(k, j) = C(i, j)

A(i , k)元素存放在A[k * n + i]位置，k * n可见是按列存放的。这样存放与CUBLAS中CUBLAS uses column‐major storage and 1‐based indexing.

保持了一致。

留意下标怎么一致的？区分用cublasAlloc和用cudaMalloc

M + alloc：memory allocate 内存分配只有实践中才能体会计算机的作用。

main()函数也可以有两个参数，int main(int argc, char** argv)

texture memory就是在global memory上的，是人为划出的有“cache”的“global memory”，定义时一般用texture texRef，而后再与数组绑定使用，详细见编程指南。在速度上，因为有了缓存，所以比一般的global memory要快，但是如果是内存对齐后的global memory就不好说了。

纹理一般适用于固定的表结构，适合随机读取。

texture memory 是global memory上的一部分,但是它有两级缓存,用来加速和filter数据的访存,只读

评：将常量数组与纹理绑定，则 w_d，noise_d，an_d，mn_d，cos_d，sin_d都可以用纹理拾取来读取

至于CUBLAS，因为两个矩阵对应的元素相乘，不能转化为矩阵运算不过去噪可以只用两个vector来做

保持了一致。

留意下标怎么一致的？区分用cublasAlloc和用cudaMalloc

M + alloc：memory allocate 内存分配只有实践中才能体会计算机的作用。

main()函数也可以有两个参数，int main(int argc, char** argv)

纹理一般适用于固定的表结构，适合随机读取。

texture memory 是global memory上的一部分,但是它有两级缓存,用来加速和filter数据的访存,只读

评：将常量数组与纹理绑定，则 w_d，noise_d，an_d，mn_d，cos_d，sin_d都可以用纹理拾取来读取

至于CUBLAS，因为两个矩阵对应的元素相乘，不能转化为矩阵运算不过去噪可以只用两个vector来做