ARM NEON Optimization - An Example
更新时间:2023-11-08 15:22:01 阅读量: 教育文库 文档下载
- arm推荐度:
- 相关推荐
ARM NEON Optimization. An Example
Filed under: Beagleboard,OMAP3530 — Nils @ 8:15 pm
Since there is so little information about NEON optimizations out there I thought I’d write a little about it.
Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven’t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.
For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It’s just an example after all. First a reference implementation in C:
void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) {
int i;
for (i=0; i int r = *src++; // load red int g = *src++; // load green int b = *src++; // load blue // build weighted average: int y = (r*77)+(g*151)+(b*28); // undo the scale by 256 and write to memory: *dest++ = (y>>8); } } Optimization with NEON Intrinsics Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I’ll show you later.. Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with: void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) { int i; uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28); n/=8; for (i=0; i uint16x8_t temp; uint8x8x3_t rgb = vld3_u8 (src); uint8x8_t result; temp = vmull_u8 (rgb.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac); result = vshrn_n_u16 (temp, 8); vst1_u8 (dest, result); src += 8*3; dest += 8; } } Lets take a look at it step by step: First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register. uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28); Now I load 8 pixels at once into three registers. uint8x8x3_t rgb = vld3_u8 (src); The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop. After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third. Now calculate the weighted average: temp = vmull_u8 (rgb.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac); vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair. vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result. So we end up with just three instructions for weighted average of eight pixels. Nice. Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register. result = vshrn_n_u16 (temp, 8); And finally store the result. vst1_u8 (dest, result); First Results: How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings: C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel. That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function: 160: f46a040f vld3.8 {d16-d18}, [sl] 164: e1a0c005 mov ip, r5 168: ecc80b06 vstmia r8, {d16-d18} 16c: e1a04007 mov r4, r7 170: e2866001 add r6, r6, #1 ; 0x1 174: e28aa018 add sl, sl, #24 ; 0x18 178: e8bc000f ldm ip!, {r0, r1, r2, r3} 17c: e15b0006 cmp fp, r6 180: e1a08005 mov r8, r5 184: e8a4000f stmia r4!, {r0, r1, r2, r3} 188: eddd0b06 vldr d16, [sp, #24] 18c: e89c0003 ldm ip, {r0, r1} 190: eddd2b08 vldr d18, [sp, #32] 194: f3c00ca6 vmull.u8 q8, d16, d22 198: f3c208a5 vmlal.u8 q8, d18, d21 19c: e8840003 stm r4, {r0, r1} 1a0: eddd3b0a vldr d19, [sp, #40] 1a4: f3c308a4 vmlal.u8 q8, d19, d20 1a8: f2c80830 vshrn.i16 d16, q8, #8 1ac: f449070f vst1.8 {d16}, [r9] 1b0: e2899008 add r9, r9, #8 ; 0x8 1b4: caffffe9 bgt 160 Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register. What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler . NEON and assembler Since the compiler can’t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all. convert_asm_neon: # r0: Ptr to destination data # r1: Ptr to source data # r2: Iteration count: push {r4-r5,lr} lsr r2, r2, #3 # build the three constants: mov r3, #77 mov r4, #151 mov r5, #28 vdup.8 d3, r3 vdup.8 d4, r4 vdup.8 d5, r5 .loop: # load 8 pixels: vld3.8 {d0-d2}, [r1]! # do the weight average: vmull.u8 q3, d0, d3 vmlal.u8 q3, d1, d4 vmlal.u8 q3, d2, d5 # shift and store: vshrn.u16 d6, q3, #8 vst1.8 {d6}, [r0]! subs r2, r2, #1 bne .loop pop { r4-r5, pc } Final Results: Time for some benchmarking again. How does the hand-written assembler version compares? Well – here are the results: C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel. Assembler: 2.0 cycles per pixel. That’s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn’t even optimized the assembler loop. My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler. Btw: Sorry for the ugly syntax-highlighting. I’m still looking for a nice wordpress plug-in. Comments (34) 34 Comments ? 1. Your post is very interesting. I was reading up about Neon intrinsics. And I came across your post. This information is very useful. BTW I have a few doubts/comments. 1. Where are the src and dst pointers pointing to ? Is it DDR ? or some faster memory like L3 memory in OMAP3 ? Can you achieve 2 cycles per pixel performance if data is in DDR, and A8 accesses it through Dcache. My guess was that, if data is in DDR, Dcache misses would be a gating factor and will result in worse performance, than 2 cycles per pixel. Can you please clarify. 2. Can you try declaring the following variables outside the loop. uint16x8_t temp; uint8x8x3_t rgb = vld3_u8 (src); uint8x8_t result; I am wondering if declaring the variables inside the loop will cause the compliler to do strange things like reading/writing from stack every iteration. Only a guess. Isnt neccesarily right. Have you tried decaring them outside the loop ? 3. Have you tried any other compiler. TI and ARM has A8 compilers that support Neon intrinsics. I have heard the TI compiler is pretty good. Thanks and Regards Ranjith Comment by Ranjith Parakkal — January 11, 2010 @ 12:45 pm 2. Hi Ranjith, The two pointers point to DDR memory. Source is 192kb and dest 64kb in size (256*256 pixels each). No internal has been used, and the memory blocks are much larger than the cache. Also as far as I know internal or tightly coupled memory does not exist on the ARM-side of the OMAP3530. I’ll try to move the variable declaration out of the loop this evening. It shouldn’t make a difference, but well? You’ll never know until you’ve tried. For trying the TI-compiler: That’s a good idea. Comment by admin — January 11, 2010 @ 7:28 pm 3. Nils, Thanks a LOT for u reply. I will wait for you to try moving the declaration outside the loop and declare the results here .. Just thinking out loud on how you have achieved 2 cycles per pixel. Since you are processing 8 cycles per pixel every iteration. On an average the following code should take 16 cycles. .loop vld3.8 {d0-d2}, [r1]! — ?? cycle # do the weight average: vmull.u8 q3, d0, d3 — one cycle vmlal.u8 q3, d1, d4 — one cycle vmlal.u8 q3, d2, d5 — one cycle # shift and store: vshrn.u16 d6, q3, #8 — one cycle vst1.8 {d6}, [r0]! — ?? cycle subs r2, r2, #1 — one cycle bne .loop — ?? cycle The vector arithmetic operations and the subs operation should account for 5 cycles out of the 16, assuming each of them are single cycle when they are pipelined. So that leaves about 9 cycles for the branch and the vector load and the store operations. I think this should sort of account for the cache misses. I dont know much about ARMs cache structure. Lemme try go read up about it. BTW there is an L3 memory on OMAP3, which is not closely coupled with A8, but may still give better performance than cached-DDR. Comment by Ranjith Parakkal — January 12, 2010 @ 10:18 am 4. Hi, I need some help for using the Neon instructions of Cortex-A8 in my application. I am writing a image algorithm where I want to use NEON instructions of Cortex A8. I have tested the program using RVDS, where I have used init.s and init_cache.s taken from Examples of RVDS. We had also given one scatter file for placement of stack and heap while linking in RVDS. I want to run my program on Beagle Board running Linux. My questions are: 1. Do we need to use init.s and init_cache.s in my program to run it on Beagle board? 2. Do we need scatter file to run the program on Beagle board? If yes, how to give scatter file using gnu ld. 3. When we removed ’1′ and ’2′, the linker gives error “uses VFP registers”. Please help. Mike Comment by Mike — January 12, 2010 @ 12:53 pm 5. Hi Ranjith, It’s not that simple to count cycles in mixed ARM and NEON code. The NEON-Pipeline is very special. In a nutshell the NEON unit sits logically behind the ARM unit and lags 5 cycles behind execution. It also has it’s own instruction queue and only executes one instruction per cycle (in practice). In the code example the following will happen: The instruction decoder decodes two instructions per cycle. The ARM-pipeline can execute two instructions per cycle. NEON instructions are treated as a NOPs so to say. So all the NEON instructions flow through the ARM pipeline and do nothing. After the 5 cycles they get stuffed into the NEON instruction queue. The NEON-unit will now start to execute the queued instructions. It can only do one instruction per cycle (not exactly true, there are some pipeline possibilities, but I doubt my code uses any). Anyway, in practice the speed is just half of the ARM-unit. This has the following effect on timing: During the first two or three iterations of the conversion function the ARM unit generates NEON instructions much faster than the NEON unit can execute them. The queue gets filled up fast. Once the queue is filled, the NEON-unit will always able to execute while the ARM-unit is mostly waiting for a free queue entry. All other ARM instruction will execute in parallel with the NEON-unit, and the NEON-unit will never run out of work. So effectively all ARM instructions that are mixed between the NEON-code are free. The entire performance is dominated by just the NEON-timing. We end up with the six instructions executing within 16 cycles. I haven’t measured, but I guess that the the load-instruction accounts for at least 50% of the time (due to cache misses) and the data-processing and store does the other half. It would be fun to run the code on zero wait-state memory. Unfortunately I don’t know about any of such memory, even on the L3. Well – I could abuse the internal memory of the DSP but I don’t know how fast access from the ARM is. Comment by Nils — January 12, 2010 @ 4:14 pm 6. Hi Mike. I’ve never worked with RVDS, but if you use init.s and init_cache.s it you’re compiling your program for use without any OS. E.g. bare-bone or boot-loader style. Since you want to execute the code on the BeagleBoard using Linux you need an executable in ELF-format. Check the manual how to generate such an output-file. I’m sure it’s documented somewhere. Comment by Nils — January 12, 2010 @ 4:20 pm 7. Hi, Thanks for your reply. I build my application without init.s, init_cache.s & scatter file using -mfu=neon. It was build successfully. When I ran my application on beagleboard, it exited giving “Illegal Instruction”, which I suspect when we encounter first neon instruction.We are using beagleboard version B4 & Kernal 2.6.22.18. Could you tell me how to solve this issue? Please help. Mike Comment by Mike — January 13, 2010 @ 6:44 am 8. Hi Mike. Have you tried debugging the executable with gdb to find out which instruction triggers the “Ilegal Instrucion”-fault? That could give some clues what is going wrong. Maybe there is still some code in your application that tries to directly access the hardware? Moves to the co-processor that configure cache and the like will trigger an illegal instruction fault. You may get better answers on the beaglboard mailing list. I suggest you try to ask there. I’ve never worked with RVDS, and I don’t have any problems compiling my code using GCC on the PC or the beagleboard. Link to the beagleboard communiy: http://groups.google.com/group/beagleboard I’m using kernel 2.6.29 by the way? 2.6.22 is rather old. How does it come that you’re still using such an old kernel? Comment by Nils — January 13, 2010 @ 5:49 pm 9. Hi Nils, How did you get the assembler output generated? Did you use any specific compiler flag to generate assembler output. I am using Cross Compiler arm-none-linux-gnueabi-gcc to compile my programs and test them on OMAP Zoom 2 platform. I tried using -S option but it does not generate any assembler output file. Thanks and Best Regards, Venkat Comment by Venkat — January 21, 2010 @ 2:42 pm 10. Hi Nils, I have one more question. How did you build an executable using hand coded assembler file? Could you please let me know the steps? Thanks and Best Regards, Venkat Comment by Venkat — January 21, 2010 @ 5:04 pm 11. Hi Venkat, I generated the assembler output using the gnu disassembler. arm-none-linux-gnueabi-objdump -d yourfile.o will do the trick! To compile a raw assembler file do the same what you do with your .c files. E.g. arm-none-linux-gnueabi-gcc -c yourfile.s will generate a yourfile.o object file. I've uploaded the file onto my webspace, so you can use it as a boilerplate: rgb_to_gray.s Cheers, Nils Comment by Nils — January 21, 2010 @ 10:03 pm 12. Nils (and everyone), I’m an open source advocate at TI for OMAP. A couple of our engineers pointed me to this article. We are working with ARM and CodeSourcery to improve the quality of ARM compilers. We are pushing everything directly to gcc, and we hope to have some real progress by the end of 2010. We’ll keep you in the loop on the progress. Cheers, Chris P Comment by Chris — March 8, 2010 @ 3:00 pm 13. Hi Chris, That’s great news.. I follow the gcc-dev mailing list for two years now and I developed a feeling how long/difficult some changes are and how gcc works on the inside. If you don’t mind I would like to contact you privately and give some additional insights on what goes wrong inside gcc when it comes to performant ARM code. There are a couple of low hanging fruits that would give instant performance improvements for Cortex-A8 code. Also – since you’ve contacted me via the blog: Don’t miss Mans blog on ARM-code: http://www.hardwarebug.org He is _very_ competent when it comes to ARM and NEON. I’m sure he would be glad to share some of his insights with you as well. Cheers, Nils Comment by Nils — March 8, 2010 @ 4:50 pm 14. The inefficient code generated by the vldX where X>1 intrinsics is a known problem in gcc: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43118 Comment by Samuel — March 15, 2010 @ 1:13 pm 15. Hi, I want to understand about how neon pipeline works. I have read cortex-a8 technical referance manual but still want to understand instruction level sceduling. Thanks Saurabh Comment by Saurabh — March 26, 2010 @ 6:21 am 16. Hi, just a speculative post under an old blog entry in case you happen to know the answer to this? I’ve been poring over various RealView documents about NEON, trying to understand the similarities and differences from Intel SSE-x. Am I right in saying that there’s no equivalent to SSE’s _mm_movemask_epi8 (which basically takes a SIMD vector and generates a standard integer bitmask based upon a predicate (in this case hardwired to be “top-bit set”) applied to each vector element)? There doesn’t seem to be but maybe I’m missing one hidden away. This is probably an instruction that’s not that important for multimedia generation, but I’m interested in image analysis where one often wants to see if a “thing” is in some “computationally defined” set (eg, is an RGB pixel within some box in RGB space) and use SIMD parallelisation. Without this kind of instruction, you can do the testing in parallel but then you’ve got to build a mask of results by manually extracting each vector element to a scalar and build the bitmap, so it seems an odd thing to leave out of an instruction set design. Cheers, Orthochronous Comment by Orthochronous — March 27, 2010 @ 6:01 pm 17. Hi Orthochronous, An instruction like this seems to be missing. The best thing to simulate it looks roughly like this: : d1 = input : d2 = mask (128, 64, 32, 16, 8, 4, 2, 1) vand d1, d1, d2 vpadd.i8 d1, d1, d1 vpadd.i8 d1, d1, d1 vpadd.i8 d1, d1, d1 That fills each byte of d1 with the sum of all elements of d1. Since we’ve masked out the bits before addition no overflow can happen and we end up with a mask similar to movemask. The result will be stored in each byte of d1, but you can just extract the last byte later. The same trick can be extended to 128 bits as well. Comment by Nils — March 27, 2010 @ 10:22 pm 18. Hi, Thanks for the insight. I’d never have thought of this approach, so it’s very useful. Comment by Orthochronous — March 29, 2010 @ 6:10 am 19. I have some performence overhead while i add the Neon Intrinsics with my Array Operation Code The Code fragment shown below 1. C Code Fragment kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue); kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue); kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue); kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue); 2. Neon Specific Code /* Predicted Array is unsigned char type so a type cast activity done by assign it to Integer AryPred*/ AryPred[0] = *Predicted++; AryPred[1] = *Predicted++; AryPred[2] = *Predicted++; AryPred[3] = *Predicted++; neonResidue = vld1q_s32(Residue); /* Code for SHIFTR */ neonResidue = vaddq_s32(neonResidue, addconst); neonResidue = vshrq_n_s32(neonResidue, 6); neonPredict = vld1q_s32(AryPred); addedResult = vaddq_s32(neonResidue, neonPredict); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 0)); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 1)); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 2)); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 3)); Residue += 4; #define CLIPS(iF, iS)(iS > 0 ? (iS < iF ? iS : iF): (0 int Residue-> int Predicted->unsigned char Original->unsigned char The following data types are used to define the above array operations with Neon Operations AryPred -> int addedResult, neonPredict, neonResidue ->int32x4_t What is the reason for the overhead when i repeat the code 4 times? Is there any memory alignment issue or any extra pipeline stalls comes here ? Is there any performence gain while we change the data type of each array in to same (Neglect the Chance of Overflow)? Rgds eDave Comment by eDave — May 6, 2010 @ 12:33 pm 20. Nice post, I’ve also made the same observations when using the compiler vs hand-optimized neon assembler. C functions that contain inline asm often produce very unpredictable and unwanted results when ‘optimized’ with both -O2 and -O3. Even -Os still suffers slightly in performance compared to a hand-optimized assembly routine. In short? we’ll get there, just not yet Comment by Christopher Friedt — June 6, 2010 @ 10:51 am 21. Nils, would you please tell me how to measure the timing of the code in a Linux environment? it seems that you execute the code on the BeagleBoard using Linux through an executable in ELF-format. How to make sure the timeing is not occupied by other tasks in Linux? Thanks, Shawn Comment by shawn — October 20, 2010 @ 10:13 am 22. Hello, how did you measure or estimate the number of cycles each of the variant takes without looking at the assembly output? Comment by jani — December 21, 2010 @ 12:10 pm 23. HI, iam very much new to cortex a9 and neon codging, i have written neon code for a C fucntion and exectute on beagle board performance of C version is better than neon code. what could be the problem? please help me in this regard my compiler options are
正在阅读:
ARM NEON Optimization - An Example11-08
TEM TE TM模的区别05-20
stata学习资料-第六章01-27
2018届高三语文阳光启学标准模拟信息卷一Word版含答案01-31
镇上半年工作总结及下步工作打算08-08
广播电视技术工作总结(精选多篇)09-28
商业银行股份有限公司四不当工作自查报告09-18
建行个人收入证明(通用版)09-05
总务主任个人工作总结报告03-20
- 1ABSTRACT Leakage Power Modeling and Optimization in Interconnection Networks
- 2ARM认证考试
- 3Stochastic convergence analysis and parameter selection of the standard particle swarm optimization
- 4ARM学习笔记
- 5ARM初学入门
- 6A convex optimization-based nonlinear filtering algorithm wi
- 7Fitness inheritance for noisy evolutionary multi-objective optimization
- 8outline-example2-英语电影片名的翻译
- 9JSP_网上搜集的分页及example搜索相关代码
- 10ARM初学入门
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- Optimization
- Example
- NEON
- ARM
- 中国专利优秀奖预获奖项目507项,中国外观设计优秀奖预获奖项目57项 - 图文
- 第十章、化学动力学练习题
- 面粉特性
- 台北101工程介绍 - 图文
- 北师大版九年级历史全册测试 - 图文
- 综合教程3第三版后答案
- 中铁十二局质量自控体系文件
- 工程教育专业认证自评报告-2015版
- 展示设计调研报告 - 图文
- 中共四大纪念馆讲解稿
- 测量学复习题(全)
- 卫生部关于印发二、三级综合医院药学部门基本标准(试行)的通知
- 山西省战略性新兴产业发展“十二五”规划
- 四川长虹公司营运能力分析
- 全面解读乔治华盛顿大学
- 2013年第二次全国税务人员执法资格统一考试国税试卷+标准答案
- 浅谈我国企业绩效考核中存在的原因与对策-论文
- 2015《长方体、正方体的体积》 练习题(一)
- 冶炼公司“最美青工”个人先进事迹材料
- S7-300和S7-400集成PN口如何与S7-200 SMART PLC S7通信