Why, in 2021, does anyone need to learn about assembly language? First, reading assembly language is the way to know exactly what your program is doing. Why, exactly, is that C++ program 1 MiB (say) instead of 100 KiB? Is it possible to squeeze some more performance out of that function that gets called all the time?
For C++ in particular, it is easy to forget or just not notice some operation (e.g., an implicit conversion or a call to a copy constructor or destructor) that is implied by the source code and language semantics, but not spelled out explicitly. Looking at the assembly generated by the compiler puts everything in plain sight.
Second, the more practical reason: so far, posts on this blog haven’t required an understanding of assembly language, despite constant links to Compiler Explorer. By popular demand, however, our next topic will be parameter passing, and for that, we will need a basic understanding of assembly language. We will focus only on reading assembly language, not writing it.
Instructions
The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. (The x86 architecture has lots of not-so-small instructions as well. Some of these are legacy cruft built up over the 40-odd years of the architecture’s existence, and others are newfangled additions. )
Example 1: Vector norm
Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
int64_t z;
};
int64_t normSquared(Vec2 v) {
return v.x * v.x + v.y * v.y;
}
and here is the resulting x86_64 assembly from clang 11, via Compiler Explorer:
imulq %rdi, %rdi
imulq %rsi, %rsi
leaq (%rsi,%rdi), %rax
retq
Let’s talk about that first instruction: imulq %rdi, %rdi
. This instruction performs signed integer multiplication. The q
suffix tells us that it is operating on 64-bit quantities. (In contrast, l
, w
, and b
would denote 32-bit, 16-bit, and 8-bit, respectively.) It multiplies the value in the first given register (rdi
; register names are prefixed with a %
sign) by the value in the second register and stores the result in that second register. This is squaring v.x
in our example C++ code.
The second instruction does the same with the value in %rsi
, which squares v.y
.
Next, we have an odd instruction: leaq (%rsi,%rdi), %rax
. lea
stands for “load effective address”, and it stores the address of the first operand into the second operand. (%rsi, %rdi)
means “the memory location pointed to by %rsi + %rdi
”, so this is just adding %rsi
and %rdi
and storing the result in %rax
. lea
is a quirky x86-specific instruction; on a more RISC-y architecture like ARM64, we would expect to see a plain old add
instruction.
Finally, retq
returns from the normSquared
function.
Registers
Let’s take a brief detour to explain what the registers we saw in our example are. Registers are the “variables” of assembly langauge. Unlike your favorite programming language (probably), there are a finite number of them, they have standardized names, and the ones we’ll be talking about are at most 64 bits in size. Some of them have specific uses that we’ll see later. I wouldn’t be able to rattle this off from memory, but per Wikipedia, the full list of 16 registers on x86_64 is rax
, rcx
, rdx
, rbx
, rsp
, rbp
, rsi
, rdi
, r8
, r9
, r10
, r11
, r12
, r13
, r14
, and r15
.
Example 2: The stack
Now, let’s extend our example to debug print the Vec2
in normSquared
:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
void debugPrint() const;
};
int64_t normSquared(Vec2 v) {
v.debugPrint();
return v.x * v.x + v.y * v.y;
}
and, again, let’s see the generated assembly:
subq $24, %rsp
movq %rdi, 8(%rsp)
movq %rsi, 16(%rsp)
leaq 8(%rsp), %rdi
callq Vec2::debugPrint() const
movq 8(%rsp), %rcx
movq 16(%rsp), %rax
imulq %rcx, %rcx
imulq %rax, %rax
addq %rcx, %rax
addq $24, %rsp
retq
In addition to the obvious call to Vec2::debugPrint() const
, we have some other new instructions and registers! %rsp
is special: it is the “stack pointer”, used to maintain the function call stack. It points to the bottom of the stack, which grows “down” (toward lower addresses) on x86. So, our subq $24, %rsp
instruction is making space for three 64-bit integers on the stack. (In general, setting up the stack and registers at the start of your function is called the function prologue.) Then, the following two mov
instructions store the first and second arguments to normSquared
, which are v.x
and v.y
(more about how parameter passing words in the next blog post!) to the stack, effectively creating a copy of v
in memory at the address %rsp + 8
. Next, we load the address of our copy of v
into %rdi
with leaq 8(%rsp), %rdi
and then call Vec2::debugPrint() const
.
After debugPrint
has returned, we load v.x
and v.y
back into %rcx
and %rax
. We have the same imulq
and addq
instructions as before. Finally, we addq $24, %rsp
to clean up the 24 bytes of stack space we allocated at the start of our function (called the function epilogue), and then return to our caller with retq
.
Example 3: Frame pointer and control flow
Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings. We might write something like the following:
#include <cstdio>
#include <cstring>
#include <memory>
void copyUppercase(char *dest, const char *src);
constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;
void printUpperCase(const char *s) {
auto sSize = strlen(s);
if (sSize <= MAX_STACK_ARRAY_SIZE) {
char temp[sSize + 1];
copyUppercase(temp, s);
puts(temp);
} else {
// std::make_unique_for_overwrite is missing on Godbolt.
std::unique_ptr<char[]> temp(new char[sSize + 1]);
copyUppercase(temp.get(), s);
puts(temp.get());
}
}
Here is the generated assembly:
printUpperCase(char const*): # @printUpperCase(char const*)
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
movq %rdi, %r14
callq strlen
leaq 1(%rax), %rdi
cmpq $1024, %rax # imm = 0x400
ja .LBB0_2
movq %rsp, %r15
movq %rsp, %rbx
addq $15, %rdi
andq $-16, %rdi
subq %rdi, %rbx
movq %rbx, %rsp
movq %rbx, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
movq %r15, %rsp
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.LBB0_2:
callq operator new[](unsigned long)
movq %rax, %rbx
movq %rax, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
movq %rbx, %rdi
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp operator delete[](void*) # TAILCALL
Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
movq %rdi, %r14
The pushq %rbp; movq %rsp, %rbp
sequence is very common: it pushes the frame pointer stored in %rbp
to the stack and saves the old stack pointer (which is the new frame pointer) in %rbp
. The following four pushq
instructions store registers that we need to save before using. Finally, we save our first argument (%rdi
) in %r14
.
On to the function body. We call strlen(s)
with callq strlen
and store sSize + 1
in %rdi
with lea 1(%rax), %rdi
.
Next, we finally see our first if
statement! cmpq $1024, %rax
sets the flags register according to the result of %rax - $1024
, and then ja .LBB0_2
(“jump if above”) transfers control to the location labeled .LBB0_2
if the flags indicate that %rax > 1024
. In general, higher-level control-flow primitives like if
/else
statements and loops are implemented in assembly using conditional jump instructions.
Let’s first look at the path where %rax <= 1024
and thus the branch to .LBB0_2
was not taken. We have a blob of instructions to create char temp[sSize + 1]
on the stack:
movq %rsp, %r15
movq %rsp, %rbx
addq $15, %rdi
andq $-16, %rdi
subq %rdi, %rbx
movq %rbx, %rsp
We save %rsp
to %r15
and %rbx
for later use. Then, we add 15 to %rdi
(which, remember, contains the size of our array), mask off the lower 4 bits with andq $-16, %rdi
, and subtract the result from %rbx
, which we then put back into %rsp
. In short, this rounds the array size up to the next multiple of 16 bytes and makes space for it on the stack.
The following block simply calls copyUppercase
and puts
as written in the code:
movq %rbx, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
Finally, we have our function epilogue:
movq %r15, %rsp
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
We restore the stack pointer to deallocate our variable-length array using leaq
. Then, we popq
the registers we saved during the function prologue and return control to our caller, and we are done.
Next, let’s look at the path when %rax > 1024
and we branch to .LBB0_2
. This path is more straightforward. We call operator new[]
, save the result (returned in %rax
) to %rbx
, and call copyUppercase
and puts
as before. We have a separate function epilogue for this case, and it looks a bit different:
movq %rbx, %rdi
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp operator delete[](void*) # TAILCALL
The first mov
sets up %rdi
with a pointer to our heap-allocated array that we saved earlier. As with the other function epilogue, we then restore the stack pointer and pop our saved registers. Finally, we have a new instruction: jmp operator delete[](void *)
. jmp
is just like goto
: it transfers control to the given label or function. Unlike callq
, it does not push the return address onto the stack. So, when operator delete[]
returns, it will instead transfer control to printUpperCase
’s caller. In essence, we’ve combined a callq
to operator delete
with our own retq
. This is called tail call optimization, hence the # TAILCALL
comment helpfully emitted by the compiler.
Practical application: catching surprising conversions
I said in the introduction that reading assembly makes implicit copy and destroy operations abundantly clear. We saw some of that in our previous example, but I want to close by looking at a common C++ move semantics debate. Is it OK to take parameters by value in order to avoid having one overload for lvalue references and another overload for rvalue references? There is a school of thought that says “yes, because in the lvalue case you will make a copy anyway, and in the rvalue case it’s fine as long as your type is cheap to move”. If we take a look at an example for the rvalue case, we will see that “cheap to move” does not mean “free to move”, as much as we might prefer otherwise. If we want maximum performance, we can demonstrate that the overload solution will get us there and the by-value solution will not. (Of course, if we aren’t willing to write extra code to improve performance, then “cheap to move” is probably cheap enough.)
#include <string>
class MyString {
std::string str;
public:
explicit MyString(const std::string& s);
explicit MyString(std::string&& s);
};
class MyOtherString {
std::string str;
public:
explicit MyOtherString(std::string s);
};
void createRvalue1(std::string&& s) {
MyString s2(std::move(s));
};
void createRvalue2(std::string&& s) {
MyOtherString s2(std::move(s));
};
If we look at the generated assembly (which is too long to include even though I’ve intentionally outlined the constructors in question), we can see that createRvalue1
does 1 move operation (inside the body of MyString::MyString(std::string&&)
) and 1 std::string::~string()
call (the operator delete
before returning). In contrast, createRvalue2
is much longer: it does a total of 2 move operations (1 inline, into the s
parameter for MyOtherString::MyOtherString(std::string s)
, and 1 in the body of that same constructor) and 2 std::string::~string
calls (1 for the aforementioned s
parameter and 1 for the MyOtherString::str
member). To be fair, moving std::string
is cheap and so is destroying a moved-from std::string
, but it is not free in terms of either CPU time or code size.
Further reading
Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.
My first practical exposure to x86 assembly in particular was in the context of security exploits, or learning to become a “l33t h4x0r”, as the kids used to say. If this strikes you as a more entertaining reason to learn about assembly, great! The classic work in the space is Smashing the Stack for Fun and Profit. Unfortunately, modern security mitigations complicate running the examples in that article on your own, so I recommend finding a more modern practice environment. Microcorruption is an industry-created example, or you could try finding an application security project from a college security course to follow along with (e.g., Project 1 from Berkeley’s CS 161, which seems to be publicly available currently).
Finally, there is always Google and Hacker News. Pat Shaughnessy’s “Learning to Read x86 Assembly Language from 2016 covers the topic from the perspective of Ruby and Crystal, and there was also a recent (2020) discussion on how to learn x86_64 assembly.
Good luck, and happy hacking!
from Hacker News https://ift.tt/3uADxgz
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.