Tuesday, April 4, 2023

If you can't write assembly like a poet, you can read disassembly like a hunter

This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

Even if you can't write assembly like a poet, you can read disassembly like a hunter

Reading disassembly is more like reading tracks than reading a book. To read a book you have to know the language. Reading tracks, although it gets better with skills and experience, mostly requires attentiveness and imagination.

Most of the time we read disassembly only to answer one simple question: does the compiler do what we expect it to do? In 3 simple exercises, I’ll show you that often enough you too can answer this question even if you have no previous knowledge of assembly. I’ll use C++ as a source language, but what I’m trying to show is more or less universal, so it doesn’t matter if you write in C or Swift, C# or Rust. If you compile to any kind of machine code — you can benefit from understanding your compiler.

1. Compile-time computation

Any decent compiler tries to make your binary code not only correct but fast. This means doing as little work in runtime as possible. Sometimes it can even conduct the whole computation in compile-time, so your machine code will only contain the precomputed answer.

This source code defines the number of bits in a byte and returns the size of int in bits.

static int BITS_IN_BYTE = 8;

int main() {
    return sizeof(int)*BITS_IN_BYTE;
}

The compiler knows the size of an int. Let's say for the target platform it is 4 bytes. We also set the number of bits in a byte explicitly. Since all we want is a simple multiplication, and both numbers are known during the compilation, a compiler can simply compute the resulting number itself instead of generating the code that computes the same number each time it's being run.

Although, this is not something guaranteed by the standard. A compiler may or may not provide this optimization.

Now look at two possible disassemblies for this source code and decide what variant does compile-time computation and what doesn’t.

BITS_IN_BYTE:
  .long 8
main:
  mov eax, DWORD PTR BITS_IN_BYTE[rip]
  cdqe
  sal eax, 2
  ret
main:
  mov eax, 32
  ret

Of course, the one on the right does.

On a 32-bit platform int's size is 4 bytes, which is 32 bits, which is exactly the number in the code. You might not know that an integer function conventionally returns its output in eax(← click me! I'm expandable)eax which is a registerregister. There are quite a few registers but most important for us are the general-purpose registers, more specifically eax, ebx, ecx, and edx. Their names respectively are: accumulator, base, counter, and data. They are not necessarily interchangeable. You can think of them as ultrafast predefined variables of known sizesize. For instance, rax is 64 bits long. The lower 32 bits of it are accessible by the name eax. The lower 16 bit of it as ax, which in its own turn consists of two bytes ah and al. These are all the parts of the same register. Registers do not reside in the RAM, so you can't read and write any register by the address the address. The square brackets usually indicate address manipulations.
mov rax, dword ptr [BITS_IN_BYTE] means put whatever lives by the address of BITS_IN_BYTE in rax register as a double word
. But the thing is, the code on the right already has an answer in it, so it doesn't even matter.

2. Function inlining

Calling a function implies some overhead by preparing input data in a particular order; then starting the execution from another piece of memory; then preparing output data; and then returning back.

Not that it is all too slow but if you only want to call a function once, you don’t have to actually call the function. It just makes sense to copy or “inline” the function's body to the place it is called from and skip all the formalities. Compilers can often do this for you so you don't even have to bother.

If the compiler makes such an optimization, this code:

inline int square(int x)  {
    return x * x;
}


int main(int argc, char** argv)  {
    return square(argc);
}

Virtually becomes this:

// not really a source code, just explaining the idea
int main(int argc, char** argv)  {
    return argc * argc;
}
        

But the standard does not promise that all the functions marked as inline shall get inlined. It's more a suggestion than a directive.

Now look at these two disassembly variants below and choose the one where the function gets inlined after all.

main:
  imul edi, edi
  mov eax, edi
  ret
square(int):
  imul edi, edi
  mov eax, edi
  ret
main:
  sub rsp, 8
  call square(int)
  add rsp, 8
  ret

Not really a mystery either. It’s the one on the left. You might not know, that the instruction to call a function is really called the callcall. It stores a special register that points to the next instruction after the call in the stack and then sets it to the function's address. A processor hence jumps to run the function. The function then uses ret to getret to get a stored address from the stackstack (which is a piece of memory organized in a first in last out fashion so if you, for instance, push rax and rbx there and then pop rax and rbx, these two will get swapped) back to the register, and make processor jump back to from where it has been called. But since the disassembly on the left doesn't even contain any recall of square, the function has to be inlined anyway.

3. Loop unrolling

Just like calling functions, going in loops implies some overhead. You have to increment the counter; then compare it against some number; then jump back to the loop's beginning.

Compilers know that in some contexts it is more effective to unroll the loop. It means that some piece of code will actually be repeated several times in a row instead of messing with the counter comparison and jumping here and there.

Let's say we have this piece of code:

int main(int argc, char**) {
    int result = 1;
    for(int i = 0; i < 3; ++i)
        result *= argc;
    return result;
}
        

The compiler has all the reasons to unroll such a simple loop, but it might as well choose not to.

Which disassembly has the unrolled loop?

main:
  mov eax, 1
  mov ecx, 3
.LBB0_1:
  imul eax, edi
  dec ecx
  jne .LBB0_1
  ret
main:
  mov eax, edi
  imul eax, edi
  imul eax, edi
  ret

It's the one on the right.

You might not know that j<*> is the family of jump instructionsjump instructions. There is one unconditional jump jmp, and a bunch of conditional jumps like: jz — jump when zero; jg — jump when greater; or, like in our code, jne — jump when not equal. These react on the boolean flags previously set by a processorflags previously set by a processor. These are the bits residing in a special register that gets triggered by arithmetic instructions such as addadd (there is usually a whole family of instructions for a simple mnemonic, for instance, add has these relatives: fadd — floating-point addition; paddb — add packed byte integers; addss — add scalar single-precision floating-point values; and many more) or sub, or by a special instruction to compare things cmp. Then again, the code on the right clearly has a repeating pattern, while the one on the left has a number three that is the loop's exit condition, and that should be enough to make a conclusion.

Conclusion

You can argue that these examples were deliberately simplified. That these are not some real-life examples. This is true to some degree. I refined them to be more demonstrative. But conceptually they are all taken from my own practice.

Using static dispatch instead of dynamic made my image processing pipeline up to 5 times faster. Repairing broken inlining helped to win back 50% of the performance for an edge-to-edge distance function. And changing the counter type to enable loop unrolling won me about 10% performance gain on matrix transformations, which is not much, but since all it took to achieve was simply changing short int to size_t in one place, I think of it as a good return of investment.

Apparently, old versions of MSVC fail to unroll loops with counters of non-native type. Who would have thought? Well, even if you know this particular quirk, you can't possibly know every other quirk of every compiler out there, so looking at disassembly once in a while might be good for you.

And you don't even have to spend years learning every assembly dialect. Reading disassembly is often easier than it looks. Try it.



from Hacker News https://ift.tt/CBEjnKl

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.