Why, in 2021, does anyone need to learn about assembly language? First, reading assembly language is the way to know exactly what your program is doing. Why, exactly, is that C++ program 1 MiB (say) instead of 100 KiB? Is it possible to squeeze some more performance out of that function that gets called all the time?

For C++ in particular, it is easy to forget or just not notice some operation (e.g., an implicit conversion or a call to a copy constructor or destructor) that is implied by the source code and language semantics, but not spelled out explicitly. Looking at the assembly generated by the compiler puts everything in plain sight.

Second, the more practical reason: so far, posts on this blog haven’t required an understanding of assembly language, despite constant links to Compiler Explorer. By popular demand, however, our next topic will be parameter passing, and for that, we will need a basic understanding of assembly language. We will focus only on reading assembly language, not writing it.

Instructions

The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. (The x86 architecture has lots of not-so-small instructions as well. Some of these are legacy cruft built up over the 40-odd years of the architecture’s existence, and others are newfangled additions. )

Example 1: Vector norm

Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
    int64_t z;
};

int64_t normSquared(Vec2 v) {
    return v.x * v.x + v.y * v.y;
}

and here is the resulting x86_64 assembly from clang 11, via Compiler Explorer:

        imulq   %rdi, %rdi
        imulq   %rsi, %rsi
        leaq    (%rsi,%rdi), %rax
        retq

Let’s talk about that first instruction: imulq %rdi, %rdi. This instruction performs signed integer multiplication. The q suffix tells us that it is operating on 64-bit quantities. (In contrast, l, w, and b would denote 32-bit, 16-bit, and 8-bit, respectively.) It multiplies the value in the first given register (rdi; register names are prefixed with a % sign) by the value in the second register and stores the result in that second register. This is squaring v.x in our example C++ code.

The second instruction does the same with the value in %rsi, which squares v.y.

Next, we have an odd instruction: leaq (%rsi,%rdi), %rax. lea stands for “load effective address”, and it stores the address of the first operand into the second operand. (%rsi, %rdi) means “the memory location pointed to by %rsi + %rdi”, so this is just adding %rsi and %rdi and storing the result in %rax. lea is a quirky x86-specific instruction; on a more RISC-y architecture like ARM64, we would expect to see a plain old add instruction.

Finally, retq returns from the normSquared function.

Registers

Let’s take a brief detour to explain what the registers we saw in our example are. Registers are the “variables” of assembly langauge. Unlike your favorite programming language (probably), there are a finite number of them, they have standardized names, and the ones we’ll be talking about are at most 64 bits in size. Some of them have specific uses that we’ll see later. I wouldn’t be able to rattle this off from memory, but per Wikipedia, the full list of 16 registers on x86_64 is rax, rcx, rdx, rbx, rsp, rbp, rsi, rdi, r8, r9, r10, r11, r12, r13, r14, and r15.

Example 2: The stack

Now, let’s extend our example to debug print the Vec2 in normSquared:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
    void debugPrint() const;
};

int64_t normSquared(Vec2 v) {
    v.debugPrint();
    return v.x * v.x + v.y * v.y;
}

and, again, let’s see the generated assembly:

        subq    $24, %rsp
        movq    %rdi, 8(%rsp)
        movq    %rsi, 16(%rsp)
        leaq    8(%rsp), %rdi
        callq   Vec2::debugPrint() const
        movq    8(%rsp), %rcx
        movq    16(%rsp), %rax
        imulq   %rcx, %rcx
        imulq   %rax, %rax
        addq    %rcx, %rax
        addq    $24, %rsp
        retq

In addition to the obvious call to Vec2::debugPrint() const, we have some other new instructions and registers! %rsp is special: it is the “stack pointer”, used to maintain the function call stack. It points to the bottom of the stack, which grows “down” (toward lower addresses) on x86. So, our subq $24, %rsp instruction is making space for three 64-bit integers on the stack. (In general, setting up the stack and registers at the start of your function is called the function prologue.) Then, the following two mov instructions store the first and second arguments to normSquared, which are v.x and v.y (more about how parameter passing words in the next blog post!) to the stack, effectively creating a copy of v in memory at the address %rsp + 8. Next, we load the address of our copy of v into %rdi with leaq 8(%rsp), %rdi and then call Vec2::debugPrint() const.

After debugPrint has returned, we load v.x and v.y back into %rcx and %rax. We have the same imulq and addq instructions as before. Finally, we addq $24, %rsp to clean up the 24 bytes of stack space we allocated at the start of our function (called the function epilogue), and then return to our caller with retq.

Example 3: Frame pointer and control flow

Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings. We might write something like the following:

#include <cstdio>
#include <cstring>
#include <memory>

void copyUppercase(char *dest, const char *src);

constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;

void printUpperCase(const char *s) {
    auto sSize = strlen(s);
    if (sSize <= MAX_STACK_ARRAY_SIZE) {
        char temp[sSize + 1];
        copyUppercase(temp, s);
        puts(temp);
    } else {
        // std::make_unique_for_overwrite is missing on Godbolt.
        std::unique_ptr<char[]> temp(new char[sSize + 1]);
        copyUppercase(temp.get(), s);
        puts(temp.get());
    }
}

Here is the generated assembly:

printUpperCase(char const*):                  # @printUpperCase(char const*)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movq    %rdi, %r14
        callq   strlen
        leaq    1(%rax), %rdi
        cmpq    $1024, %rax                     # imm = 0x400
        ja      .LBB0_2
        movq    %rsp, %r15
        movq    %rsp, %rbx
        addq    $15, %rdi
        andq    $-16, %rdi
        subq    %rdi, %rbx
        movq    %rbx, %rsp
        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   copyUppercase(char*, char const*)
        movq    %rbx, %rdi
        callq   puts
        movq    %r15, %rsp
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
.LBB0_2:
        callq   operator new[](unsigned long)
        movq    %rax, %rbx
        movq    %rax, %rdi
        movq    %r14, %rsi
        callq   copyUppercase(char*, char const*)
        movq    %rbx, %rdi
        callq   puts
        movq    %rbx, %rdi
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        jmp     operator delete[](void*)                          # TAILCALL

Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:

        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movq    %rdi, %r14

The pushq %rbp; movq %rsp, %rbp sequence is very common: it pushes the frame pointer stored in %rbp to the stack and saves the old stack pointer (which is the new frame pointer) in %rbp. The following four pushq instructions store registers that we need to save before using. Finally, we save our first argument (%rdi) in %r14.

On to the function body. We call strlen(s) with callq strlen and store sSize + 1 in %rdi with lea 1(%rax), %rdi.

Next, we finally see our first if statement! cmpq $1024, %rax sets the flags register according to the result of %rax - $1024, and then ja .LBB0_2 (“jump if above”) transfers control to the location labeled .LBB0_2 if the flags indicate that %rax > 1024. In general, higher-level control-flow primitives like if/else statements and loops are implemented in assembly using conditional jump instructions.

Let’s first look at the path where %rax <= 1024 and thus the branch to .LBB0_2 was not taken. We have a blob of instructions to create char temp[sSize + 1] on the stack:

        movq    %rsp, %r15
        movq    %rsp, %rbx
        addq    $15, %rdi
        andq    $-16, %rdi
        subq    %rdi, %rbx
        movq    %rbx, %rsp

We save %rsp to %r15 and %rbx for later use. Then, we add 15 to %rdi (which, remember, contains the size of our array), mask off the lower 4 bits with andq $-16, %rdi, and subtract the result from %rbx, which we then put back into %rsp. In short, this rounds the array size up to the next multiple of 16 bytes and makes space for it on the stack.

The following block simply calls copyUppercase and puts as written in the code:

        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   copyUppercase(char*, char const*)
        movq    %rbx, %rdi
        callq   puts

Finally, we have our function epilogue:

        movq    %r15, %rsp
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        retq

We restore the stack pointer to deallocate our variable-length array using leaq. Then, we popq the registers we saved during the function prologue and return control to our caller, and we are done.

Next, let’s look at the path when %rax > 1024 and we branch to .LBB0_2. This path is more straightforward. We call operator new[], save the result (returned in %rax) to %rbx, and call copyUppercase and puts as before. We have a separate function epilogue for this case, and it looks a bit different:

        movq    %rbx, %rdi
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        jmp     operator delete[](void*)                          # TAILCALL

The first mov sets up %rdi with a pointer to our heap-allocated array that we saved earlier. As with the other function epilogue, we then restore the stack pointer and pop our saved registers. Finally, we have a new instruction: jmp operator delete[](void *). jmp is just like goto: it transfers control to the given label or function. Unlike callq, it does not push the return address onto the stack. So, when operator delete[] returns, it will instead transfer control to printUpperCase’s caller. In essence, we’ve combined a callq to operator delete with our own retq. This is called tail call optimization, hence the # TAILCALL comment helpfully emitted by the compiler.

Practical application: catching surprising conversions

I said in the introduction that reading assembly makes implicit copy and destroy operations abundantly clear. We saw some of that in our previous example, but I want to close by looking at a common C++ move semantics debate. Is it OK to take parameters by value in order to avoid having one overload for lvalue references and another overload for rvalue references? There is a school of thought that says “yes, because in the lvalue case you will make a copy anyway, and in the rvalue case it’s fine as long as your type is cheap to move”. If we take a look at an example for the rvalue case, we will see that “cheap to move” does not mean “free to move”, as much as we might prefer otherwise. If we want maximum performance, we can demonstrate that the overload solution will get us there and the by-value solution will not. (Of course, if we aren’t willing to write extra code to improve performance, then “cheap to move” is probably cheap enough.)

#include <string>

class MyString {
 std::string str;
 public:
  explicit MyString(const std::string& s);
  explicit MyString(std::string&& s);
};

class MyOtherString {
  std::string str;
 public:
  explicit MyOtherString(std::string s);
};

void createRvalue1(std::string&& s) {
    MyString s2(std::move(s));
};

void createRvalue2(std::string&& s) {
    MyOtherString s2(std::move(s));
};

If we look at the generated assembly (which is too long to include even though I’ve intentionally outlined the constructors in question), we can see that createRvalue1 does 1 move operation (inside the body of MyString::MyString(std::string&&)) and 1 std::string::~string() call (the operator delete before returning). In contrast, createRvalue2 is much longer: it does a total of 2 move operations (1 inline, into the s parameter for MyOtherString::MyOtherString(std::string s), and 1 in the body of that same constructor) and 2 std::string::~string calls (1 for the aforementioned s parameter and 1 for the MyOtherString::str member). To be fair, moving std::string is cheap and so is destroying a moved-from std::string, but it is not free in terms of either CPU time or code size.

SymmetricalDataSecurity

Monday, March 1, 2021

How to Read Assembly Language

Instructions

Example 1: Vector norm

Registers

Example 2: The stack

Example 3: Frame pointer and control flow

Practical application: catching surprising conversions

Further reading

No comments:

Post a Comment

Blog Archive

Search This Blog

Total Pageviews