Automatic Tuning Matrix Multiplication Performance on Graphics hardware
更新时间:2023-08-07 05:02:01 阅读量: 实用文档 文档下载
- automatic推荐度:
- 相关推荐
Automatic Tuning Matrix Multiplication Performance on Graphics HardwareChanghao Jiang (cjiang@cs.uiuc.edu)Lecture notes for CS498dp Fall 2006 University of Illinois Urbana Champaign
University of Illinois Urbana Champaign
Outline! ! ! ! !
Introduction to GPGPU GPU architecture Programming model and the Brook language for GPU Automatic matrix multiply library generation on GPU Performance results
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
2
University of Illinois Urbana Champaign
Introduction!
The GPU on commodity video cards has evolved into an extremely flexible and powerful processor! ! !
Programmability Precision Power
!
This class will introduce how to harness GPU power and how to automatically tune matrix multiplication performance on GPU!
an ATLAS for GPU"
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
3
University of Illinois Urbana Champaign
Motivation: Computational Power!
GPUs are fast…! ! ! !
3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForceFX 7800: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition: 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s CPUs: 1.4! annual growth GPUs: 1.7!(pixels) to 2.3! (vertices) annual growth
!
GPUs are getting faster, faster! !
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
4
University of Illinois Urbana Champaign
GPU becomes more powerfulNVIDIA GeForceFX 7800
3.0 GHz dualcore Pentium4
Courtesy Ian Buck, John Owens11/7/06 CS498dp Fall, 2006, University of Illinois Urbana Champaign 5
University of Illinois Urbana Champaign
An Aside: Computational Power!
Why are GPUs getting faster so fast?!
!
Arithmetic intensity: the specialized nature of GPUs makes it easier to use additional transistors for computation instead of cache Economics: multi-billion dollar video game market is a pressure cooker that drives innovation
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
6
University of Illinois Urbana Champaign
Motivation: Flexible and Precise!
Modern GPUs are deeply programmable! !
Programmable pixel, vertex, video engines Solidifying high-level language support 32 bit floating point throughout the pipeline High enough for many (not all) applications
!
Modern GPUs support high precision! !
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
7
University of Illinois Urbana Champaign
Motivation: The Potential of GPGPU!
GPGPU (General purpose computation on GPUs)!
Active research topic:!
http://www.77cn.com.cn
!
In short:!
!
!
The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
8
University of Illinois Urb
ana Champaign
The Problem: Difficult To Use!
GPUs designed for& driven by video games! ! !
Programming model unusual Programming idioms tied to computer graphics Programming environment tightly constrained Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret
!
Underlying architectures are:! ! !
!
Can’t simply“port” CPU code!
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
9
University of Illinois Urbana Champaign
GPU architecture!
Simplified graphics pipelineGraphics StateFragments (pre-pixels) Screenspace triangles (2D) Xformed, Lit Vertices (2D) Final Pixels (Color, Depth)
Application
Vertex Transform Processor& Light
Assemble Primitives
Rasterize
Fragment Shade Processor
Vertices (3D)
Video Memory (Textures)
CPU!
GPU
Render-to-texture
Programmability was introduced into two stages
Courtesy David Luebke
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
10
University of Illinois Urbana Champaign
GPU architecture!
Another view of GPU architectureVertex processors
vp vp vp vp vp vp Rasterizer fp fp fp fp fp fp fp fp fp fp fp fp
fp fp fp fpr g b a
Fragment processors4 component vector
Frame buffer!
Pixel
Most horsepower of GPGPU comes from the“fragment processors” ! The same“shader” runs synchronously on all fragment processors ! Every fragment processor can execute SIMD instructions.11/7/06 CS498dp Fall, 2006, University of Illinois Urbana Champaign 11
University of Illinois Urbana Champaign
Unusual features/constraints of GPU program! ! !
SIMD instructions with smearing and swizzling!
R2.rgba= R1.abgr * R3.ggab A shader can contain limited number of instructions No scatter operations Designated memory locations (limited number)“If-then-else”# predicated instructions“loop”# fully unrolled
Limit on instruction count!
Limit on output! !
!
Limit on branch instruction on some GPUs! !
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
12
University of Illinois Urbana Champaign
High Level Shading Languages!
Cg, HLSL,& OpenGL Shading Language ! Cg:!
http://www.77cn.com.cn/cg http://www.77cn.com.cn/library/default.asp?url=/library/enus/directx9_c/directx/graphics/reference/highlevellanguageshaders.asp http://www.77cn.com.cn/support/developer/ogl2/whitepapers/index.html
!
HLSL:!
!
OpenGL Shading Language:!
!
Requires knowledge of computer graphics and graphics API (OpenGL or DirectX)
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
13
University of Illinois Urbana Champaign
GPGPU Languages!
Why do we want them?!
Make programming GPUs easier!! ! !
Don’t need to know OpenGL, DirectX, or ATI/NV extensions Simplify common operations Focus on the algorithm, not on the implementation
!
ShUniversity of Waterloo http://www.77cn.com.cn http://www.77cn.com.cn
!
BrookStanford University http://www.77cn.com.cn http://graphi
cs.stanford.edu/projects/brookgpu
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
14
University of Illinois Urbana Champaign
Brook: General Purpose Streaming Language
!
Stream programming model!
The same kernel program (shader) operates on streams of data.Input stream
! !
C with stream extensions Cross platform! ! !
ATI& NVIDIA OpenGL& DirectX Windows& Linux
Shader
Constants Temp Regs Textures
Output stream
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
15
University of Illinois Urbana Champaign
Streams!
Collection of records requiring similar computation!
particle positions, voxels, FEM cell,… Ray r<200>; float3 velocityfield<100,100,100>;
!
Similar to arrays, but…! !
index operations disallowed: position[i] read/write stream operators streamRead (r, r_ptr); streamWrite (velocityfield, v_ptr);
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
16
University of Illinois Urbana Champaign
Kernels!
Functions applied to streams! !
similar to for_all construct no dependencies between stream elements
kernel void foo (float a<>, float b<>, out float result<>){ result= a+ b;} float a<100>; float b<100>; float c<100>; foo(a,b,c);
for (i=0; i<100; i++) c[i]= a[i]+b[i];11/7/06 CS498dp Fall, 2006, University of Illinois Urbana Champaign 17
University of Illinois Urbana Champaign
Reductions!
Compute single value from a stream!
associative operations only
reduce void sum (float a<>, reduce float r<>) r+= a;} float a<100>; float r; sum(a,r);
r= a[0]; for (int i=1; i<100; i++) r+= a[i];
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
18
University of Illinois Urbana Champaign
Matrix Vector Multiplykernel void mul (float a<>, float b<>, out float result<>){ result= a*b;} reduce void sum (float a<>, reduce float result<>){ result+= a;} float float float float matrix<20,10>; vector<1, 10>; tempmv<20,10>; result<20, 1>;
M
V V V
=
T
mul(matrix,vector,tempmv); sum(tempmv,result);
T
sum
R
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
19
University of Illinois Urbana Champaign
An automatic matrix-multiply generation system!
An automatic matrix multiply generation system, which includes:!
A code generator:!
Generate multiple versions in high level BrookGPU language, which will be compiled into low level code. Searches in the implementation space for the best version Measure performance of generated code Code GeneratorTuning parameters Generated program
!
A search engine:!
!
A performance evaluator:!
Performance EvaluatorPerformance metrics
Search Engine
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
20
University of Illinois Urbana Champaign
Code generator$ python codegen.py -h codegen[options] options: -h, --help print this help -u, --unroll unroll the kernel function (defulat not unroll) -p, --np=[passes] the number of iterations ex
ecuted in each invocation of kernel default 1 -m, --mrt_height=[height] the height of multiple-render-target(mrt) grid default 1 -n, --mrt_width=[width] the width of multiple-reder-target(mrt) grid default 1 -c, --channel=[channel code] how to use multiple (up to 4) color channels of a fragment 1: 1 x 1 (default) 2:1x2 3:1x3 4:1x4 5:2x2 default value"1" 11/7/06 CS498dp Fall, 2006, University of Illinois Urbana Champaign 21
University of Illinois Urbana Champaign
GPU algorithms for matrix multiply!
Straightforward mapping of triply nested loop onto GPU! ! !
Store two input matrices (A and B) as two textures Store the resulting matrix C in the frame buffer. Each execution of the shader program outputs one element of C! ! !
Matrix B
Fetches one row from matrix A Fetches one column from matrix B Computes the dot product. Save result to C
!
Problems:!
!
No data reuse in the shader=> poor performance Shader length might exceed instruction limit if loop is unrolled due to the lack of branch instruction
Matrix A
Matrix C
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
22
University of Illinois Urbana Champaign
Generated code:!
python codegen.py–p 1024for (start=0.0f;start<1024/1;start+=1024){ mult(index, a, b, start, sum0, c0); streamSwap(sum0, c0);} kernel void mult(iter float2 index<>, float a[][], float b[][], float start, float sum0[][], out float c0<>){ float i; float2 addr_a; float2 addr_b; float2 tmp_addr; float a_tmp00; float b_tmp00; c0= sum0[index]; addr_a= float2(start,index.y); addr_b= float2(index.x, start*1); for (i=0.0f; i<1024; i=i+1.0f){// load a column/row from texture(matrix) a tmp_addr= addr_a; a_tmp00= a[tmp_addr]; tmp_addr= addr_b;// load an element from texture(matrix) b b_tmp00= b[tmp_addr]; c0+= a_tmp00 * b_tmp00; addr_a.x+= 1.0f; addr_b.y+= 1.0f*1;}}
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
23
University of Illinois Urbana Champaign
Tuning for multi-render-targets! ! !
“multi-render-targets”:! ! !
allows a shader to simultaneously write to multiple buffers Take advantage of“multi-render-targets” to improve data-reuse Divide matrix C into mxn sub matrix blocks! !
Tuning strategy: Algorithm with multi-render-targets:Each of them will be a render-target A and B are logically divided too Fetches m rows from matrix A Fetches n columns from matrix B Computes mxn dot products
Matrix B
!
Each fragment program! ! !
!
Downside:! !
The shader require more temporary registers Using multi-render-target has performance overhead
!
Code generation command: !“python codegen.py–m 2–n 2”
Matrix A11/7/06 CS498dp Fall, 2006, University of Illinois Urbana Champaign
Matrix C24
University of Illinois Urbana Champaign
Data packing
!
Pack scalar data into RGBA in texture memory
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
25
University of Illinois Urbana Champaign
Tuning for SIMD
instruction with data packing! !
Fragment processor supports SIMD instructions Tuning strategy:! !
Use SIMD instruction to improve performance Use smearing and swizzling to do“register tiling” to improve data reuse
!
Algorithm of tuning for SIMD instruction with data packing! !
Matrix B
Packing four elements into one pixel!
Two schemes: 1x4 vs. 2x2 Fetches one row from matrix A Fetches four columns from matrix B Perform a series of vector by matrix product
Each fragment program (1x4 scheme)! ! !
!
Question:!
What packing scheme is the best in performance?
!
Code generation command:!
“python codegen.py–c 1” -c, --channel=[channel code] 1: 1 x 1 (default) 2:1x2 3:1x3 4:1x4 5:2x2
Matrix A
Matrix C
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
26
University of Illinois Urbana Champaign
Tuning the number of passes!
Problem:!
GPU’s limit on instruction count prevents the dot product to be completed in one pass Partition the computation into multiple passes Each fragment program! ! !
! !
Strategy:! !
Matrix B
Algorithm with multiple passes:Fetches a part of a row from matrix A Fetches a part of a column from matrix B Perform a dot product to get a partial sum
!
Iterate multiple times to get the final result Multi-pass results in expensive overhead in copying intermediate results“python codegen.py–p 32”Matrix A Matrix C27 CS498dp Fall, 2006, University of Illinois Urbana Champaign
!
Downside!
!
Code generation command:!
11/7/06
University of Illinois Urbana Champaign
Tuning parameters! !
The code generator is controlled by a set of tuning parameters Tuning parameters! ! !
Matrix B
“mrt_w”,“mrt_h”!
How to divide matrix C How to pack data to use SIMD How many iterations executed in each pass Whether or not to use branch instructions To use“cgc” or“fxc” compiler To use DirectX backend with“ps20”,“ps2a”,“ps2b”,“ps30”, or use OpenGL backend with“arbfp”,“fp30”,“fp40”
“mc_w”,“mc_h”!
“np”!
! ! !
“unroll”!
“compiler”!
“shader”!
Matrix A
Matrix C
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
28
University of Illinois Urbana Champaign
Search strategy! !
Search in an exponential space is time-consuming. Two techniques employed to speed up the search! !
Space pruning!
Limit the search range of parameters based on problem-specific heuristics Search parameters in phases Search order:1: For each compiler value 2: For each profile value 3: For each unroll value 4: Search np in power of two values 5: For each mc_* value 6: For each mrt_* value 7: Evaluate Performance 8: Recursively search np in both sides of best np found in step 4.
Search in phases! !
!
The search time reduces dramatically!
from 53 days in theory to 4 hours in practice, with no significant performance loss.
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Cha
mpaign
29
University of Illinois Urbana Champaign
Performance data!
Compare with two expert hand-tuned implementations! !
Part of GPUBench developed at Stanford University Implemented with carefully crafted assembly code
!
Comparable performance on four GPU platforms!
On two platforms!
beats hand-tuned by 8% and 15% achieves 56% and 70% of hand-tuned version.
!
On the other two platforms!
70% 107% 115%
Searched
56%
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
30
University of Illinois Urbana Champaign
Performance penalties of using a high level language! !
One reason for lower performance than manual tuning:!
Overhead in using the high-level BrookGPU language.
Compare the performance of the same algorithm implemented in! !
BrookGPU Assembly code
11/7/06
CS498dp Fall, 2006, University of Illinois Urbana Champaign
31
正在阅读:
Automatic Tuning Matrix Multiplication Performance on Graphics hardware08-07
关于观看知危险会避险心得体会范本08-16
《积极分子考察表》撰写模板10-15
改进后的“聚焦教与学转型难点”的信息化教学设计小学数学1 - 图文11-24
高中数学第三章统计案例综合训练学案新人教A版选修2304-03
2016国家下半年公务员考试行测数学题技巧:特值法01-05
秋中画作文500字06-30
关于认真做好大学生“村官”有序流动工作的通知(榆林)06-30
锂离子电池工作原理及基本概念12-15
- 1Numerical Analysis of the Detection Performance
- 2Orthogonal polynomial method and odd vertices in matrix models
- 3Automatic belief revision in sneps
- 4Computer Graphics计算机制图
- 5A Framework for Automatic Adaptation of Tunable Distributed
- 6Tuning the solubility of polymerized ionic liquids by simple
- 7Performance Management ( A model and research agenda)
- 8Computer Graphics计算机制图
- 92_Fast Montgomery Modular Multiplication and RSA
- 10Orthogonal polynomial method and odd vertices in matrix models
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Multiplication
- Performance
- Automatic
- Graphics
- hardware
- Tuning
- Matrix
- 第一章东方文明古国的教育
- 智课网:雅思大作文高分技巧:对比介绍
- 地质监督工作细则
- 数值分析Ch9常微分方程数值解法
- 关心国家安全 维护海洋权益 (2)
- 小学英语课本剧剧本小鸭子
- AXURE_RP 教程 带案例
- 六年级数学上册期末复习试卷六_2
- matlab 2012a 桌面快捷方式 m文件关联 的理解
- 2019年ISO7010安全标志及其使用导则
- 第19章模域频率变换法-PPT课件
- 学校管理工作流程图范文
- 食品工艺学试题十七
- 电信运营商应对全业务竞争举措之一集团客户网格化管理(梁宇亮老师的专业文章)
- (泰安专版)2019版中考数学 第一部分 基础知识过关 第二章 方程(组)与不等式(组)第5讲 一次方程与方程组课件
- 新员工团队拓展训练方案
- 表格数据的处理
- 第四节自然灾害
- 幼儿园文化建设实施方案
- 2015西藏自治区农村信用社招考试题及答案