ARM NEON Optimization - An Example

更新时间:2023-11-08 15:22:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

ARM NEON Optimization. An Example

Filed under: Beagleboard,OMAP3530 — Nils @ 8:15 pm

Since there is so little information about NEON optimizations out there I thought I’d write a little about it.

Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven’t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.

For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It’s just an example after all. First a reference implementation in C:

void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) {

int i;

for (i=0; i

int r = *src++; // load red int g = *src++; // load green int b = *src++; // load blue

// build weighted average: int y = (r*77)+(g*151)+(b*28);

// undo the scale by 256 and write to memory: *dest++ = (y>>8); } }

Optimization with NEON Intrinsics

Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I’ll show you later..

Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) {

int i;

uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28); n/=8;

for (i=0; i

uint16x8_t temp;

uint8x8x3_t rgb = vld3_u8 (src); uint8x8_t result;

temp = vmull_u8 (rgb.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac);

result = vshrn_n_u16 (temp, 8); vst1_u8 (dest, result); src += 8*3; dest += 8; } }

Lets take a look at it step by step:

First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register. uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28);

Now I load 8 pixels at once into three registers. uint8x8x3_t rgb = vld3_u8 (src);

The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop.

After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third. Now calculate the weighted average:

temp = vmull_u8 (rgb.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac);

vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair. vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.

So we end up with just three instructions for weighted average of eight pixels. Nice. Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register. result = vshrn_n_u16 (temp, 8);

And finally store the result. vst1_u8 (dest, result);

First Results:

How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings: C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel.

That’s only a speed-up of factor 1.5. I expected much more from the NEON

implementation. It processes 8 pixels with just 6 instructions after all. What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:

160: f46a040f vld3.8 {d16-d18}, [sl] 164: e1a0c005 mov ip, r5

168: ecc80b06 vstmia r8, {d16-d18} 16c: e1a04007 mov r4, r7

170: e2866001 add r6, r6, #1 ; 0x1 174: e28aa018 add sl, sl, #24 ; 0x18 178: e8bc000f ldm ip!, {r0, r1, r2, r3} 17c: e15b0006 cmp fp, r6 180: e1a08005 mov r8, r5

184: e8a4000f stmia r4!, {r0, r1, r2, r3} 188: eddd0b06 vldr d16, [sp, #24] 18c: e89c0003 ldm ip, {r0, r1} 190: eddd2b08 vldr d18, [sp, #32]

194: f3c00ca6 vmull.u8 q8, d16, d22 198: f3c208a5 vmlal.u8 q8, d18, d21 19c: e8840003 stm r4, {r0, r1} 1a0: eddd3b0a vldr d19, [sp, #40]

1a4: f3c308a4 vmlal.u8 q8, d19, d20 1a8: f2c80830 vshrn.i16 d16, q8, #8 1ac: f449070f vst1.8 {d16}, [r9]

1b0: e2899008 add r9, r9, #8 ; 0x8 1b4: caffffe9 bgt 160

Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register. What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler . NEON and assembler

Since the compiler can’t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all. convert_asm_neon:

# r0: Ptr to destination data

# r1: Ptr to source data # r2: Iteration count: push {r4-r5,lr} lsr r2, r2, #3

# build the three constants: mov r3, #77 mov r4, #151 mov r5, #28 vdup.8 d3, r3 vdup.8 d4, r4 vdup.8 d5, r5

.loop:

# load 8 pixels:

vld3.8 {d0-d2}, [r1]!

# do the weight average: vmull.u8 q3, d0, d3 vmlal.u8 q3, d1, d4 vmlal.u8 q3, d2, d5

# shift and store: vshrn.u16 d6, q3, #8 vst1.8 {d6}, [r0]!

subs r2, r2, #1 bne .loop

pop { r4-r5, pc }

Final Results:

Time for some benchmarking again. How does the hand-written assembler version compares? Well – here are the results: C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel. Assembler: 2.0 cycles per pixel.

That’s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn’t even optimized the assembler loop.

My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler.

Btw: Sorry for the ugly syntax-highlighting. I’m still looking for a nice wordpress plug-in. Comments (34) 34 Comments ?

1. Your post is very interesting. I was reading up about Neon intrinsics. And

I came across your post. This information is very useful. BTW I have a few doubts/comments.

1. Where are the src and dst pointers pointing to ? Is it DDR ? or some faster memory like L3 memory in OMAP3 ? Can you achieve 2 cycles per pixel performance if data is in DDR, and A8 accesses it through Dcache. My guess was that, if data is in DDR, Dcache misses would be a gating factor and will result in worse performance, than 2 cycles per pixel. Can you please clarify. 2. Can you try declaring the following variables outside the loop. uint16x8_t temp;

uint8x8x3_t rgb = vld3_u8 (src); uint8x8_t result;

I am wondering if declaring the variables inside the loop will cause the compliler to do strange things like reading/writing from stack every iteration. Only a guess. Isnt neccesarily right. Have you tried decaring them outside the loop ?

3. Have you tried any other compiler. TI and ARM has A8 compilers that support Neon intrinsics. I have heard the TI compiler is pretty good. Thanks and Regards Ranjith

Comment by Ranjith Parakkal — January 11, 2010 @ 12:45 pm

2. Hi Ranjith,

The two pointers point to DDR memory. Source is 192kb and dest 64kb in size (256*256 pixels each). No internal has been used, and the memory blocks are much larger than the cache. Also as far as I know internal or tightly coupled memory does not exist on the ARM-side of the OMAP3530.

I’ll try to move the variable declaration out of the loop this evening. It shouldn’t make a difference, but well? You’ll never know until you’ve tried. For trying the TI-compiler: That’s a good idea.

Comment by admin — January 11, 2010 @ 7:28 pm

3. Nils,

Thanks a LOT for u reply. I will wait for you to try moving the declaration outside the loop and declare the results here ..

Just thinking out loud on how you have achieved 2 cycles per pixel.

Since you are processing 8 cycles per pixel every iteration. On an average the following code should take 16 cycles. .loop

vld3.8 {d0-d2}, [r1]! — ?? cycle # do the weight average:

vmull.u8 q3, d0, d3 — one cycle vmlal.u8 q3, d1, d4 — one cycle vmlal.u8 q3, d2, d5 — one cycle # shift and store:

vshrn.u16 d6, q3, #8 — one cycle vst1.8 {d6}, [r0]! — ?? cycle subs r2, r2, #1 — one cycle bne .loop — ?? cycle

The vector arithmetic operations and the subs operation should account for 5 cycles out of the 16, assuming each of them are single cycle when they are pipelined. So that leaves about 9 cycles for the branch and the vector load and the store operations. I think this should sort of account for the cache misses. I dont know much about ARMs cache structure. Lemme try go read up about it.

BTW there is an L3 memory on OMAP3, which is not closely coupled with A8, but may still give better performance than cached-DDR.

Comment by Ranjith Parakkal — January 12, 2010 @ 10:18 am

4. Hi,

I need some help for using the Neon instructions of Cortex-A8 in my application. I am writing a image algorithm where I want to use NEON instructions of Cortex A8.

I have tested the program using RVDS, where I have used init.s and

init_cache.s taken from Examples of RVDS. We had also given one scatter file for placement of stack and heap while linking in RVDS. I want to run my program on Beagle Board running Linux. My questions are:

1. Do we need to use init.s and init_cache.s in my program to run it on Beagle board?

2. Do we need scatter file to run the program on Beagle board? If yes, how to give scatter file using gnu ld.

3. When we removed ’1′ and ’2′, the linker gives error “uses VFP registers”. Please help. Mike

Comment by Mike — January 12, 2010 @ 12:53 pm

5. Hi Ranjith,

It’s not that simple to count cycles in mixed ARM and NEON code. The NEON-Pipeline is very special. In a nutshell the NEON unit sits logically behind the ARM unit and lags 5 cycles behind execution. It also has it’s own instruction queue and only executes one instruction per cycle (in practice).

In the code example the following will happen: The instruction decoder decodes two instructions per cycle. The ARM-pipeline can execute two

instructions per cycle. NEON instructions are treated as a NOPs so to say. So all the NEON instructions flow through the ARM pipeline and do nothing. After the 5 cycles they get stuffed into the NEON instruction queue. The NEON-unit will now start to execute the queued instructions. It can only do one instruction per cycle (not exactly true, there are some pipeline possibilities, but I doubt my code uses any). Anyway, in practice the speed is just half of the ARM-unit.

This has the following effect on timing: During the first two or three iterations of the conversion function the ARM unit generates NEON

instructions much faster than the NEON unit can execute them. The queue gets filled up fast. Once the queue is filled, the NEON-unit will always able to execute while the ARM-unit is mostly waiting for a free queue entry. All other ARM instruction will execute in parallel with the NEON-unit, and the NEON-unit will never run out of work.

So effectively all ARM instructions that are mixed between the NEON-code are free. The entire performance is dominated by just the NEON-timing. We end up with the six instructions executing within 16 cycles. I haven’t measured, but I guess that the the load-instruction accounts for at least 50% of the time (due to cache misses) and the data-processing and store does the other half.

It would be fun to run the code on zero wait-state memory. Unfortunately I don’t know about any of such memory, even on the L3. Well – I could abuse the internal memory of the DSP but I don’t know how fast access from the ARM is.

Comment by Nils — January 12, 2010 @ 4:14 pm

6. Hi Mike.

I’ve never worked with RVDS, but if you use init.s and init_cache.s it you’re compiling your program for use without any OS. E.g. bare-bone or boot-loader style.

Since you want to execute the code on the BeagleBoard using Linux you need an executable in ELF-format.

Check the manual how to generate such an output-file. I’m sure it’s documented somewhere.

Comment by Nils — January 12, 2010 @ 4:20 pm

7. Hi,

Thanks for your reply.

I build my application without init.s, init_cache.s & scatter file using -mfu=neon. It was build successfully.

When I ran my application on beagleboard, it exited giving “Illegal

Instruction”, which I suspect when we encounter first neon instruction.We are using beagleboard version B4 & Kernal 2.6.22.18. Could you tell me how to solve this issue?

Please help. Mike

Comment by Mike — January 13, 2010 @ 6:44 am

8. Hi Mike.

Have you tried debugging the executable with gdb to find out which instruction triggers the “Ilegal Instrucion”-fault? That could give some clues what is going wrong. Maybe there is still some code in your application that tries to directly access the hardware? Moves to the co-processor that configure cache and the like will trigger an illegal instruction fault.

You may get better answers on the beaglboard mailing list. I suggest you try to ask there. I’ve never worked with RVDS, and I don’t have any problems compiling my code using GCC on the PC or the beagleboard. Link to the beagleboard communiy:

http://groups.google.com/group/beagleboard

I’m using kernel 2.6.29 by the way? 2.6.22 is rather old. How does it come that you’re still using such an old kernel?

Comment by Nils — January 13, 2010 @ 5:49 pm

9. Hi Nils,

How did you get the assembler output generated? Did you use any specific compiler flag to generate assembler output.

I am using Cross Compiler arm-none-linux-gnueabi-gcc to compile my programs and test them on OMAP Zoom 2 platform.

I tried using -S option but it does not generate any assembler output file. Thanks and Best Regards, Venkat

Comment by Venkat — January 21, 2010 @ 2:42 pm

10. Hi Nils,

I have one more question. How did you build an executable using hand coded assembler file? Could you please let me know the steps?

Thanks and Best Regards, Venkat

Comment by Venkat — January 21, 2010 @ 5:04 pm

11. Hi Venkat,

I generated the assembler output using the gnu disassembler. arm-none-linux-gnueabi-objdump -d yourfile.o will do the trick!

To compile a raw assembler file do the same what you do with your .c files. E.g.

arm-none-linux-gnueabi-gcc -c yourfile.s

will generate a yourfile.o object file. I've uploaded the file onto my webspace, so you can use it as a boilerplate: rgb_to_gray.s Cheers, Nils

Comment by Nils — January 21, 2010 @ 10:03 pm

12. Nils (and everyone),

I’m an open source advocate at TI for OMAP. A couple of our engineers pointed me to this article. We are working with ARM and CodeSourcery to improve the quality of ARM compilers. We are pushing everything directly to gcc, and we hope to have some real progress by the end of 2010. We’ll keep you in the loop on the progress. Cheers, Chris P

Comment by Chris — March 8, 2010 @ 3:00 pm

13. Hi Chris,

That’s great news..

I follow the gcc-dev mailing list for two years now and I developed a feeling how long/difficult some changes are and how gcc works on the inside. If you don’t mind I would like to contact you privately and give some additional insights on what goes wrong inside gcc when it comes to performant ARM code. There are a couple of low hanging fruits that would give instant performance improvements for Cortex-A8 code.

Also – since you’ve contacted me via the blog: Don’t miss Mans blog on ARM-code: http://www.hardwarebug.org He is _very_ competent when it comes to ARM and NEON. I’m sure he would be glad to share some of his insights with you as well. Cheers, Nils

Comment by Nils — March 8, 2010 @ 4:50 pm

14. The inefficient code generated by the vldX where X>1 intrinsics is a known

problem in gcc:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43118

Comment by Samuel — March 15, 2010 @ 1:13 pm

15. Hi,

I want to understand about how neon pipeline works. I have read cortex-a8 technical referance manual but still want to understand instruction level sceduling. Thanks Saurabh

Comment by Saurabh — March 26, 2010 @ 6:21 am

16. Hi, just a speculative post under an old blog entry in case you happen to

know the answer to this?

I’ve been poring over various RealView documents about NEON, trying to understand the similarities and differences from Intel SSE-x. Am I right in saying that there’s no equivalent to SSE’s _mm_movemask_epi8 (which basically takes a SIMD vector and generates a standard integer bitmask based upon a predicate (in this case hardwired to be “top-bit set”) applied to each vector element)? There doesn’t seem to be but maybe I’m missing one hidden away.

This is probably an instruction that’s not that important for multimedia generation, but I’m interested in image analysis where one often wants to see if a “thing” is in some “computationally defined” set (eg, is an RGB pixel within some box in RGB space) and use SIMD parallelisation. Without this kind of instruction, you can do the testing in parallel but then you’ve got to build a mask of results by manually extracting each vector element to a scalar and build the bitmap, so it seems an odd thing to leave out of an instruction set design. Cheers,

Orthochronous

Comment by Orthochronous — March 27, 2010 @ 6:01 pm

17. Hi Orthochronous,

An instruction like this seems to be missing. The best thing to simulate it looks roughly like this:

: d1 = input

: d2 = mask (128, 64, 32, 16, 8, 4, 2, 1) vand d1, d1, d2 vpadd.i8 d1, d1, d1 vpadd.i8 d1, d1, d1 vpadd.i8 d1, d1, d1

That fills each byte of d1 with the sum of all elements of d1. Since we’ve masked out the bits before addition no overflow can happen and we end up with a mask similar to movemask. The result will be stored in each byte of d1, but you can just extract the last byte later. The same trick can be extended to 128 bits as well.

Comment by Nils — March 27, 2010 @ 10:22 pm

18. Hi,

Thanks for the insight. I’d never have thought of this approach, so it’s very useful.

Comment by Orthochronous — March 29, 2010 @ 6:10 am

19. I have some performence overhead while i add the Neon Intrinsics with my Array

Operation Code

The Code fragment shown below 1. C Code Fragment

kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue);

kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue);

kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue);

kvalue = SHIFTR(*Residue++, 6) + *Predicted++; *Original++ = (byte) CLIPS(255, kvalue); 2. Neon Specific Code

/* Predicted Array is unsigned char type so a type cast activity done by assign it to Integer AryPred*/

AryPred[0] = *Predicted++; AryPred[1] = *Predicted++; AryPred[2] = *Predicted++; AryPred[3] = *Predicted++; neonResidue = vld1q_s32(Residue); /* Code for SHIFTR */

neonResidue = vaddq_s32(neonResidue, addconst); neonResidue = vshrq_n_s32(neonResidue, 6);

neonPredict = vld1q_s32(AryPred);

addedResult = vaddq_s32(neonResidue, neonPredict);

*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 0)); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 1)); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 2)); *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 3)); Residue += 4;

#define CLIPS(iF, iS)(iS > 0 ? (iS < iF ? iS : iF): (0 int Residue-> int

Predicted->unsigned char Original->unsigned char

The following data types are used to define the above array operations with Neon Operations

AryPred -> int

addedResult, neonPredict, neonResidue ->int32x4_t

What is the reason for the overhead when i repeat the code 4 times? Is there any memory alignment issue or any extra pipeline stalls comes here ?

Is there any performence gain while we change the data type of each array in to same (Neglect the Chance of Overflow)? Rgds eDave

Comment by eDave — May 6, 2010 @ 12:33 pm

20. Nice post,

I’ve also made the same observations when using the compiler vs

hand-optimized neon assembler. C functions that contain inline asm often produce very unpredictable and unwanted results when ‘optimized’ with both -O2 and -O3. Even -Os still suffers slightly in performance compared to a hand-optimized assembly routine. In short? we’ll get there, just not yet

Comment by Christopher Friedt — June 6, 2010 @ 10:51 am

21. Nils,

would you please tell me how to measure the timing of the code in a Linux environment?

it seems that you execute the code on the BeagleBoard using Linux through an executable in ELF-format.

How to make sure the timeing is not occupied by other tasks in Linux?

Thanks, Shawn

Comment by shawn — October 20, 2010 @ 10:13 am

22. Hello,

how did you measure or estimate the number of cycles each of the variant takes without looking at the assembly output?

Comment by jani — December 21, 2010 @ 12:10 pm

23. HI,

iam very much new to cortex a9 and neon codging,

i have written neon code for a C fucntion and exectute on beagle board performance of C version is better than neon code.

what could be the problem? please help me in this regard my compiler options are

本文来源:https://www.bwwdw.com/article/thu2.html

Top