Why C is faster than assembly - Embedded Software

By Colin Walls

I am very interested in the pros and cons of various programming languages for embedded applications. Although I mostly favor C, I sometimes prefer C++, but I am open to suggestions. I started out writing assembly language and still feel that this is the “real thing”. This is a topic I posted about some time ago. I expanded my thoughts in a more recent article on embedded.com

I am always pleased when one of my colleagues offers to write a guest blog, as I think that different views can be interesting. Brooks Moses [who has kindly contributed a number of times before, like here] has some very interesting views and experience to share …

The debates about which programming language is fastest are never ending. Is Fortran faster than C? Is C++ slower? What about Java? The only thing people seem to agree on is that when you really want the absolute best performance on some critical inner loop, you use assembly code.

Like most things people say in these debates, this is wrong.

That’s a pretty audacious claim to be making, so let me back this up with some real-world numbers. The Embedded Software Division at Mentor Graphics has recently been writing a computational library for Freescale’s new e6500 Power-architecture processors. We call it the Mentor Embedded Performance Library, so it’s our reputation (as well as the reputation of Freescale’s e6500 chips) that’s on the line for this to perform well.

I’ve recently been working on the elementwise functions in this library — these are the functions that compute the square root or the cosine or so forth for each element in an array. They’re pretty simple functions, but there are also about 190 of them in the library, and we have an aggressive schedule that meant I had only a couple weeks to get all of these done.

The e6500 has an AltiVec SIMD vector unit, so the inner loop of one of these functions looks about like this:

for (i = 0; i < N; i+=4)

load a 4-element SIMD vector of input

process it with several AltiVec instructions

save a 4-element SIMD vector of output

Whether you write that in C or assembly, it’s going to be slow if you actually write it like that. (For example, a square root function written this way takes 26 cycles for each SIMD vector, while a better-optimized version can do it in 11.) The e6500 CPU can start executing a typical AltiVec floating-point instruction every cycle, but there is a 6-cycle latency between when the instruction starts executing and when the output register can be used by another instruction. If you try to use the output register sooner than that, the CPU waits in a “pipeline stall” until it’s ready. That’s what’s happening here; each AltiVec instruction in the chain of instructions that processes the input is working with the output from the previous instruction, so the CPU is spending most of its time stalled.

Any good assembly programmer knows the way to solve that. The computation for each SIMD vector of four elements is independent, so if we partly “unroll” the loop and compute four of them on each pass through the loop, we can interleave the instructions and avoid stalls. The CPU can start the first AltiVec instruction for the first vector, and then rather than stalling on the next cycle it can start the first AltiVec instruction for the next vector, and so on. Oh, and there are some additional nuances; an e6500 CPU can start a computation instruction and a load-store instruction on the same cycle, so we want to adjust the instruction order to take advantage of that as much as we can. Then, because all these instructions are operating at once, we need to allocate which register is used by which set of computations and make sure that’s all kept straight.

Or we could just let the compiler do all the instruction ordering and register allocation. This is one of the things that compilers are particularly good at and have been for decades. Compilers aren’t (currently) very good at unrolling loops or converting simple C code to AltiVec instructions, but we can write an unrolled loop in C with GCC “compiler intrinsic” functions to tell it what AltiVec instructions to use. The compiler will then find a near-optimal order for the instructions and allocate the registers in a fraction of a second rather than a fraction of an hour, and we can then take a minute or two to have a look at the generated assembly code to make sure it hasn’t done something dumb. The result will be just as good as what someone might write by hand in assembly code.

But that’s not “faster”, I hear you objecting; it’s “just as good.” Where does “faster” come in?

Earlier, I said we’d compute four vectors on each pass through the loop. Why four? It turns out that four is a reasonable guess for the best choice, but for some of these functions doing eight at once is better, while for some it’s better to do only two, or occasionally only one.

And that’s where “faster” comes in. In assembly, changing the number of vectors we do on each pass through the loop requires laboriously going through that ordering and register-allocation step all over again, and probably debugging a few bugs. But, in C, it’s a simple cut-and-paste. So simple even a Python script could do it — and that’s what I did, and generated a table of the performance results for a bunch of different options for all of the various functions. Here’s an excerpt — remember, there are over a hundred lines of this and several additional columns:

loop unroll amount	1	2	4	8
addition	9	8	7	5
arccosine	155	260	311	357
arctangent	56	34	42	64
complex multiply	39	28	23	30
cosine	66	53	73	82
division	21	15	11	8
error function	89	102	107	104
exponent	57	34	34	48
natural log	56	33	42	66
reciprocal	15	12	9	6
square root	26	16	13	11
tangent	149	172	189	224

Loop execution times (cycles per SIMD vector)

It’s immediately clear that for square root, division, and addition we want to do eight vectors at a time, but only two for arctangent and cosine, and only one for tangent and arccosine. If we’d written everything in assembly, there’s no way we could have done this sort of experimentation in the time we had. About the best we could have done was pick a sample function and try a few things, and then use that to guess the right number to use for all the other functions — and we’d probably get it wrong on many of them. As you can see in the table, guessing wrong can lead to measurably slower code.

So. if you want the absolute best performance on a critical inner loop, write it in C. Or Fortran, or C++; it doesn’t matter. Read the generated assembly, and adjust the C code until it looks good. Then experiment with unrolling loops and trying different variations, and measure each one until you get the best numbers. The reason C is faster than assembly is because the only way to write optimal code is to measure it on a real machine, and with C you can run many more experiments, much faster.

Oh, and use the right algorithm; that matters more than everything else put together. Last week I was working on our floating-point sort routine, and after I had a rather good quicksort implementation working, I tried a better sort algorithm and cut the execution time by more than half — again, a win for experimentation. But that’s a story for another time!

Comments

0 thoughts about “Why C is faster than assembly”

bandit gangwere says:

February 24, 2013 at 7:05 pm

Good analysis. People rush to “optimize the code” but they don’t bother to measure or look for the best algorithm.

I have had to optimize code four times in the last 20 years, and knew, by design, where that was going to be. I got the code to work, measured – yep, that needs to run faster – and optimized it. Either re-arraigned the code to help the compiler, or cloned the code and cut out the error checking. I could run the error checking version when time was not critical, then the really fast when time was critical.

Log in to Reply
bandit gangwere says:

February 24, 2013 at 7:06 pm

BTW, this architecture reminds me of the Floating Point Systems array processors. Adds to one cycle, mults two (and you had to push data thru the pipe, and memory fetches were three.

Log in to Reply
Charlie says:

November 3, 2015 at 4:50 pm

Can you not keep making the same argument and claim we should generate in language N+1 where higher means more abstract? The interesting question then becomes: at what point do you diminishing returns in terms of the mapping from N+1 to assembly. Seems you believe that point to be language 1 (C), I’d be interested to hear stories / or even better experiments of people in languages 3 and above 🙂
Thanks for your article.

Log in to Reply
Chris says:

March 11, 2016 at 6:38 pm

I have some misgivings here, I’m afraid! I often call this “fighting the compiler”. So you write some C, disassemble the output, readjust the C, recompile, disassemble again, repeat until you get the assembler you want. What you are effectively doing is using the C compiler to write assembly code. It’s often easier to write the assembly code by hand in the first place, to be honest, since you obviously have a clear idea of what output you want the compiler to produce.

The only clear advantage to doing it your way is that the resulting C code will recompile for other architectures, even though it might not work as well.

Log in to Reply
example says:

January 5, 2020 at 3:42 pm

Premature Optimization is strong with this article.

1) Programs need to know the loop length to determine which is more efficient.
2) Latency and Throughput matters. For small iteration loops – latency matters more. For large iteration loops – throughput matters more.

The article tries to assume always have large iteration and then try to fill in on the small iteration by saying sometimes 2x unroll or 1x unroll (no unrolling at all). With large iteration 8x unroll and 4x unroll are good. What this article is doing is let the compiler do the register renaming, but is really writing assembler code because the author cannot cut, paste, and rename properly, even when saying “so simple even a Python script could do it”, but the author cannot do it.

It’s actually so simple that any macro assembler could do it. Yes, any macro assembler could do proper loop unrolling to the appropriate level with a switch statement on iteration count.

Log in to Reply

What to read next:

Comments

Leave a Reply Cancel reply