CUBLAS
更新时间:2024-01-30 09:06:01 阅读量: 教育文库 文档下载
The library is self‐contained at the API level, that is, no direct interaction with the CUDA driver is necessary The interface to the CUBLAS library is the header file cublas.h.
The type cublasStatus is used for function status returns. cublasStatus
cublasInit (void)
initializes the CUBLAS library and must be called before any other
CUBLAS API function is invoked. It allocates hardware resources
necessary for accessing the GPU.
cublasStatus
cublasShutdown (void)
releases CPU‐side resources used by the CUBLAS library.
cublasStatus
cublasGetError (void)
returns the last error that occurred on invocation of any of the
CUBLAS core functions.
cublasStatus
cublasAlloc (int n, int elemSize, void **devicePtr) 用CUBLAS分配的空间和用cudaMalloc分配的是等价的
creates an object in GPU memory space capable of holding an array of
n elements, where each element requires elemSize bytes of storage. If
the function call is successful, a pointer to the object in GPU memory space is placed in devicePtr. Note that this is a device pointer that
cannot be dereferenced in host code. Function cublasAlloc() is a
wrapper around cudaMalloc(). Device pointers returned by cublasAlloc() can therefore be passed to any CUDA device kernels,
not just CUBLAS functions.
cublasStatus
cublasFree (const void *devicePtr)
destroys the object in GPU memory space referenced by devicePtr. 释放显存空间
cublasStatus
cublasSetVector (int n, int elemSize, const void *x, int incx, void *y, int incy)
copies n elements from a vector x in CPU memory space to a vector y
in GPU memory space. Elements in both vectors are assumed to have a
size of elemSize bytes. Storage spacing between consecutive elements
is incx for the source vector x and incy for the destination vector y. In general, y points to an object, or part of an object, allocated via
cublasAlloc().
主机到设备端的向量数据拷贝
cublasStatus
cublasGetVector (int n, int elemSize, const void *x, int incx, void *y, int incy)
copies n elements from a vector x in GPU memory space to a vector y
in CPU memory space. Elements in both vectors are assumed to have a
size of elemSize bytes. Storage spacing between consecutive elements
is incx for the source vector x and incy for the destination vector y.
显存到主机的向量数据拷贝 问题:参数incx和incy什么意思?
incx :storage spacing between elements of x abs(x[1 + i * incx]) i = 0 to n-1 存储间隔
i = 0,存储在x[1]位置
i = 1,存储在x[1 + incx]位置 i = 2,存储在x[1 + 1 * incx]位置 i = 3,存储在x[1 + 2 * incx]位置
cublasStatus
cublasSetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
copies a tile of rows×cols elements from a matrix A in CPU memory space to a matrix B in GPU memory space. Each element requires storage of elemSize bytes. 主机到设备端的矩阵数据拷贝
cublasStatus
cublasGetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
copies a tile of rows×cols elements from a matrix A in GPU memory space to a matrix B in CPU memory space. Each element requires storage of elemSize bytes. 设备到主机端的矩阵数据拷贝 int
cublasIsamax (int n, const float *x, int incx)
finds the smallest index of the maximum magnitude element of singleprecision
vector x; that is, the result is the first i, i = 0 to n-1, that maximizes abs(x[1 + i * incx]) 找到最大值的下标 int
cublasIsamin (int n, const float *x, int incx)
finds the smallest index of the minimum magnitude element of singleprecision
vector x; that is, the result is the first i, i = 0 to n-1,
that
minimizesabs(x[1 + i * incx]) 找到最小值的下标
float
cublasSasum (int n, const float *x, int incx)
computes the sum of the absolute values of the elements of singleprecision
vector x; that is, the result is the sum from i = 0 to n-1 of abs(x[1 + i * incx]) 求n个元素的绝对值之和
参数:
n number of elements in input vector
x single-precision vector with n elements incx storage spacing between elements of x
scalar 英['skeil?] 美['skel?, -,lɑr]
adj.梯状的,分等级的,数量的,标量的 n.数量,标量
precision 英[pri'si??n] 美[pr?'s???n]
n.精确度, 准确(性) adj.精确的;准确的;细致的
参数意义基本一样 向量类型和矩阵类型 void
cublasSaxpy (int n, float alpha, const float *x, int incx, float *y, int incy)
multiplies single‐precision vector x by single‐precision
scalar alpha
and adds the result to single‐precision vector y; that is, it overwrites
single‐precision y with single‐precision alpha * x + y For i = 0 to n-1, it replaces y[ly + i * incy] alpha * x[lx + i * incx] + y[ly + i * incy
lx = 1 if incx >= 0, else lx = 1 + (1 – n) * incx 当incx >= 0时,是正的间隔,一般调用时为1即可 当incx < 0时,时负的,incx = -1时,间隔为n,
void
cublasScopy (int n, const float *x, int incx, float *y, int incy)
copies the single‐precision vector x to the single‐precision vector y. For
i = 0 to n-1, it copies x[lx + i * incx] y[ly + i * incy] 拷贝函数
float
cublasSdot (int n, const float *x, int incx, const float *y, int incy)
computes the dot product of two single‐precision vectors. It returns
the dot product of the single‐precision vectors x and y if successful,
and 0.0f otherwise. It computes the sum for i = 0 to n-1 of x[lx + i * incx] * y[ly + i * incy] 计算两个向量的点乘
void
cublasSrot (int n, float *x, int incx, float *y, int incy, float sc, float ss)
动脑子分析更易有兴趣:
不动脑子只是一摊子看,最易浪费时间,因为无趣也易不专心。
无论MTI OCW还是The Interpretation Of Dream,都需要用心专心分析才能有收获。
分析CUBLAS函数类型:
需要:矩阵减法函数,矩阵与向量的减法和乘法函数
分析纹理:只要将CUDA数组或者线性内存与纹理绑定后,直接读取其中数据即可,不做任何变换。
用到过的CUBLAS函数有: cublasAlloc
cublasSetVector cublasSgemm cublasGetError cublasGetVector cublasFree cublasInit
cublasShutdown
1. 分配空间,数据传输函数: cublasAlloc
cublasSetVector cublasGetVector cublasFree
2. 初始化,推出CUBLAS函数: cublasInit
cublasShutdown 3. 运算函数:
cublasSgemm cublasSgemv
cublasSdot cublasSaxpy cublasSscal
下面一个一个分析其用法,并在例子中测试:各个参数的意义,一般用法
float
cublasSdot (int n, const float *x, int incx, const float *y, int incy)
computes the dot product of two single‐precision vectors. It returns the dot product of the single‐precision vectors x and y if successful, and 0.0f otherwise. It computes the sum for
i = 0 to n-1 of x[lx + i * incx] * y[ly + i * incy] 计算两个向量x 和 y 的点乘,
lx = 1 if incx >= 0, else lx = 1 + (1 – n) * incx ;
参数:
n :number of elements in input vectors x :single-precision vector with n elements incx: storage spacing between elements of x y :single-precision vector with n elements incy: storage spacing between elements of y incx和incy调用时一般设为1
alpha = r2 / cublasSdot(n, p, 1, Ap, 1)
status = cublasAlloc(n, sizeof(float), (void **)&p); status = cublasAlloc(n, sizeof(float), (void **)&Ap);
1.
用cublasAlloc分配的空间是否下标从1开始
y[ly + i * incy]
分析:incy = 1时: i = 0放在y[1]里 i = 1放在y[2]里
2.
用cublasAlloc分配的空间不能简单的等价于cudaMalloc分配
的空间。
返回值:
returns single-precision dot product (returns zero if n <= 0)
Error Status
CUBLAS_STATUS_NOT_INITIALIZED if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED if function could not allocate reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED if function failed to execute on GPU
The basic model by which applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. To accomplish this, CUBLAS provides helper functions for creating and destroying objects in GPU space, and for writing data to and retrieving data from these objects.
For maximum compatibility with existing Fortran environments, CUBLAS uses column‐major storage and 1‐based indexing.(注意这里:CUBLAS利用了列优先的存储方式和以1开始的下标方式)。
Since C and C++ use row‐major storage,(行优先排列) applications cannot use the native array semantics for two‐dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of onedimensional arrays. For Fortran code ported to C in mechanical fashion, one may chose to retain 1‐based indexing to avoid the need to transform loops. In this case, the array index of a matrix element in row i and column j can be computed via the following macro:
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) (采用1‐based indexing)
Here, ld refers to the leading dimension of the matrix as allocated,(按之优先排列的维数)which in the case of column‐major storage is the number of rows. For natively written C and C++ code, one would most likely chose 0‐based indexing, in which case the indexing macro becomes
#define IDX2C(i,j,ld) (((j)*(ld))+(i)) (采用0‐based indexing)
看简单的例子做一下,看结果中体会其用处。
cublasStatus cublasInit (void)
initializes the CUBLAS library and must be called before any other CUBLAS API function is invoked. It allocates hardware resources necessary for accessing the GPU.
初始化函数
cublasStatus
cublasShutdown (void)
releases CPU‐side resources used by the CUBLAS library. The release of GPU‐side resources may be deferred until the application shuts down.
释放资源
1. 2.
CUBLAS helper functions help函数 CUBLAS core functions. Core函数
cublasStatus
cublasGetError (void)
returns the last error that occurred on invocation of any of the CUBLAS core functions.
Reading the error status via cublasGetError() resets the internal error state to
CUBLAS_STATUS_SUCCESS.
在.cpp文件里可以直接调用CUBLAS中函数
void CcublasDlg::simple_sgemm(int n, float alpha, const float *A, const float *B,
float beta, float *C) {
int i; int j; int k;
for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j) { float prod = 0;
for (k = 0; k < n; ++k) {
prod += A[k * n + i] * B[j * n + k]; }
C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } }
分析:计算alpha * AB + beta * C的值。 对k求和:A(i , k) * B(k, j) = C(i, j)
A(i , k)元素存放在A[k * n + i]位置,k * n可见是按列存放的。 这样存放与CUBLAS中CUBLAS uses column‐major storage and 1‐based indexing.
保持了一致。
留意下标怎么一致的? 区分用cublasAlloc和用cudaMalloc
M + alloc:memory allocate 内存分配 只有实践中才能体会计算机的作用。
main()函数也可以有两个参数,int main(int argc, char** argv)
texture memory就是在global memory上的,是人为划出的有“cache”的“global memory”,定义时一般用texture
纹理一般适用于固定的表结构,适合随机读取。
texture memory 是global memory上的一部分,但是它有两级缓存,用来加速和filter数据的访存,只读
评:将常量数组与纹理绑定,则 w_d,noise_d,an_d,mn_d,cos_d,sin_d都可以用纹理拾取来读取
至于CUBLAS,因为两个矩阵对应的元素相乘,不能转化为矩阵运算 不过去噪可以只用两个vector来做
保持了一致。
留意下标怎么一致的? 区分用cublasAlloc和用cudaMalloc
M + alloc:memory allocate 内存分配 只有实践中才能体会计算机的作用。
main()函数也可以有两个参数,int main(int argc, char** argv)
texture memory就是在global memory上的,是人为划出的有“cache”的“global memory”,定义时一般用texture
纹理一般适用于固定的表结构,适合随机读取。
texture memory 是global memory上的一部分,但是它有两级缓存,用来加速和filter数据的访存,只读
评:将常量数组与纹理绑定,则 w_d,noise_d,an_d,mn_d,cos_d,sin_d都可以用纹理拾取来读取
至于CUBLAS,因为两个矩阵对应的元素相乘,不能转化为矩阵运算 不过去噪可以只用两个vector来做
正在阅读:
CUBLAS01-30
CPSM2 练习01-31
2014年成考专升本政治复习资料重点专题 - 图文01-15
华东师大《多元分析》历年真题及答案 -10-06
建筑制图与识图复习题及答案11-20
1000kV架空输电线路双回路铁塔组立施工工艺导则(报批稿)04-18
邓小平理论试题答案11-09
新课标化学教学面临的困惑与解决策略01-16
读曹文轩草房子有感范本参考03-25
COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1_week3chap6_basic_association_analysis08-18
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- 《中国现代文学专题》期末复习指导 - 综合练习题及答案
- 统计考试
- 河北省唐山市开滦第二中学高中语文 第5课 苏轼词两首第二课时导学案 新人教版必修4
- 高铁桥下防护栅栏安装技术交底
- 2010学年第二学期海珠区七、八年级各学科期末调研测试
- 综合单元测试 - Level 2 Unit 7
- 曾祁《学习展开纵向议论》导学案
- 增强纪律观念
- 歌词大全 - 企业歌曲歌词汇编
- 中国链条联轴器市场发展研究及投资前景报告(目录) - 图文
- 堕胎过多会导致不孕么?
- 陕西省炼焦行业企业名录2018版204家
- 福建发电厂简介
- 关于酒店GOP率的分析
- 蔡校博客里的考研重点单词
- 人教版五上语文日积月累根据语境填空题附答案
- 读《以爱之名 - 100封优秀廉洁书信》有感
- 四年级下册品德与社会第一单元《一方水土养一方人》教案
- 2017一级建造师考试《公路工程》真题及答案(完整版) - 图文
- 最新版八年级语文下册文学常识、名著阅读、课外古诗词、文言文专项训练(教师版)