ARM64 is a computer architecture that competes with the popular Intel x86-64 architecture used for the CPUs in desktops, laptops, and so on. ARM64 is common in mobile phones, as well as Graviton-based Amazon EC2 instances and the much ballyhooed Apple M1 chips, so knowing about it might be useful! In fact, I have almost certainly spent more time with ARM64 than x86-64 because of the iPhone.
ARM64, and apparently a huge majority of Android phones do too.
This post is an alternate version of my previous post on How to Read Assembly Language. It walks through the same examples, showing ARM64 assembly instead. Background content like explanations of instructions and registers is also rehashed for your reading convenience.
Instructions
The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. Unlike x86-64, each ARM64 instruction is exactly 4 bytes long, so you can tell how much memory a piece of ARM64 code takes up just by counting instructions.
Example 1: Vector norm
Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
int64_t normSquared(Vec2 v) {
return v.x * v.x + v.y * v.y;
}
and here is the resulting ARM64 assembly from clang 11:
mul x8, x1, x1
madd x0, x0, x0, x8
ret
The first instruction, mul x8, x1, x1
, performs multiplication. Unlike the x86-64 assembly syntax we used previously, the destination operand is on the left. This mul
instruction squares the contents of x1
and stores the result into x8
.
Next, we have madd x0, x0, x0, x8
. madd
stands for “multiply-add”: it squares x0
, adds x8
, and stores the result in x0
.
Finally, ret
returns from normSquared
.
Registers
Let’s take a brief detour to explain what the registers we saw in our example are. Registers are the “variables” of assembly langauge. Unlike variables in your favorite programming language (probably), there are a finite number of them, they have standardized names, and the ones we’ll be talking about are at most 64 bits in size. ARM64 has 31 general-purpose registers named x0
through x30
. To refer to their lower 32 bits instead of the full 64 bits, we can write w0
through w30
. There is also a dedicated sp
(stack pointer) register. Full documentation for core register names is on ARM’s website.
Example 2: The stack
Now, let’s extend our example to debug print the Vec2
in normSquared
:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
void debugPrint() const;
};
int64_t normSquared(Vec2 v) {
v.debugPrint();
return v.x * v.x + v.y * v.y;
}
and, again, let’s see the generated assembly:
sub sp, sp, #32
stp x29, x30, [sp, #16]
add x29, sp, #16
stp x0, x1, [sp]
mov x0, sp
bl Vec2::debugPrint() const
ldp x8, x9, [sp]
ldp x29, x30, [sp, #16]
mul x8, x8, x8
madd x0, x9, x9, x8
add sp, sp, #32
ret
We start off with a new register: sp
. Like %rsp
on x86-64, it is the “stack pointer”, used to maintain the function call stack. It points to the bottom of the stack, which grows “down” (toward lower addresses) on ARM64. So, our sub sp, sp, #32
instruction is making space for four 64-bit integers on the stack by SUBtracting from the stack pointer. Next, stp x29, x30, [sp, #16]
is SToring a Pair of registers: it is saving the old frame pointer (x29
) and link register (x30
– it contains the return address, as we’ll see below) on the stack starting at the address sp + 16
. (The square brackets denote a memory access.) We calculate the new frame pointer with add x29, sp, #16
; it is required to point to the previously-saved frame pointer and stack pointer. This concludes the 3-instruction function prologue.
Then, the following stp x0, x1, [sp]
instruction stores the first and second arguments to normSquared
, which are v.x
and v.y
, to the stack, effectively creating a copy of v
in memory at the address in sp
. Next, we put a pointer to that copy of v
in x0
with mov x0, sp
and call Vec2::debugPrint() const
with bl
. bl
is a mnemonic for “branch with link”, and it works slightly differently from the x86-64 call
instruction: rather than pushing the return address onto the stack, it saves it in register x30
, also known as the link register or lr
.
After debugPrint
has returned, we LoaD the Pair of registers r8
and r9
with v.x
and v.y
from the stack. We also restore the old values of the frame pointer and stack pointer. Then, we have the same mul
and madd
instructions as in the previous example. Finally , we add sp, sp, #32
to clean up the 32 bytes of stack space we allocated at the start of our function (called the function epilogue; I would include the load of the old frame pointer and stack pointer even though it happened to come before the mul
& madd
) and then return to our caller with ret
.
Example 3: Control flow
Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings. We might write something like the following:
#include <cstdio>
#include <cstring>
#include <memory>
void copyUppercase(char *dest, const char *src);
constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;
void printUpperCase(const char *s) {
auto sSize = strlen(s);
if (sSize <= MAX_STACK_ARRAY_SIZE) {
char temp[sSize + 1];
copyUppercase(temp, s);
puts(temp);
} else {
// std::make_unique_for_overwrite is missing on Godbolt.
std::unique_ptr<char[]> temp(new char[sSize + 1]);
copyUppercase(temp.get(), s);
puts(temp.get());
}
}
Here is the generated assembly%3B%0A%0Aconstexpr+size_t+MAX_STACK_ARRAY_SIZE+%3D+1024%3B%0A%0Avoid+printUpperCase(const+char+*s)+%7B%0A++++auto+sSize+%3D+strlen(s)%3B%0A++++if+(sSize+%3C%3D+MAX_STACK_ARRAY_SIZE)+%7B%0A++++++++char+temp%5BsSize+%2B+1%5D%3B%0A++++++++copyUppercase(temp,+s)%3B%0A++++++++puts(temp)%3B%0A++++%7D+else+%7B%0A++++++++//+std::make_unique_for_overwrite+is+missing+on+Compiler+Explorer.%0A++++++++std::unique_ptr%3Cchar%5B%5D%3E+temp(new+char%5BsSize+%2B+1%5D)%3B%0A++++++++copyUppercase(temp.get(),+s)%3B%0A++++++++puts(temp.get())%3B%0A++++%7D%0A%7D’),l:‘5’,n:‘0’,o:‘C%2B%2B+source+%231’,t:‘0’)),k:43.31039755351682,l:‘4’,n:‘0’,o:“,s:0,t:‘0’),(g:!((h:compiler,i:(compiler:armv8-clang1101,filters:(b:‘0’,binary:‘1’,commentOnly:‘0’,demangle:‘0’,directives:‘0’,execute:‘1’,intel:‘1’,libraryCode:‘1’,trim:‘1’),fontScale:14,j:1,lang:c%2B%2B,libs:!(),options:‘-O3+-std%3Dc%2B%2B20+-fno-vectorize+-fno-unroll-loops+-fno-exceptions’,selection:(endColumn:23,endLineNumber:15,positionColumn:23,positionLineNumber:15,selectionStartColumn:23,selectionStartLineNumber:15,startColumn:23,startLineNumber:15),source:1),l:‘5’,n:‘0’,o:‘armv8-a+clang+11.0.1+(Editor+%231,+Compiler+%231)+C%2B%2B’,t:‘0’)),k:33.25435790785323,l:‘4’,n:‘0’,o:“,s:0,t:‘0’),(g:!((h:compiler,i:(compiler:clang1101,filters:(b:‘0’,binary:‘1’,commentOnly:‘0’,demangle:‘0’,directives:‘0’,execute:‘1’,intel:‘1’,libraryCode:‘0’,trim:‘1’),fontScale:14,j:2,lang:c%2B%2B,libs:!(),options:‘-O3+-std%3Dc%2B%2B20+-fno-vectorize+-fno-unroll-loops+-fno-exceptions’,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:‘5’,n:‘0’,o:‘x86-64+clang+11.0.1+(Editor+%231,+Compiler+%232)+C%2B%2B’,t:‘0’)),k:23.435244538629952,l:‘4’,n:‘0’,o:“,s:0,t:‘0’)),l:‘2’,n:‘0’,o:“,t:‘0’)),version:4):
stp x29, x30, [sp, #-48]! // 16-byte Folded Spill
str x21, [sp, #16] // 8-byte Folded Spill
stp x20, x19, [sp, #32] // 16-byte Folded Spill
mov x29, sp
mov x19, x0
bl strlen
cmp x0, #1024 // =1024
add x0, x0, #1 // =1
b.hi .LBB0_2
add x9, x0, #15 // =15
mov x8, sp
and x9, x9, #0xfffffffffffffff0
sub x20, x8, x9
mov x21, sp
mov sp, x20
mov x0, x20
mov x1, x19
bl copyUppercase(char*, char const*)
mov x0, x20
bl puts
mov sp, x21
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
ret
.LBB0_2:
bl operator new[](unsigned long)
mov x1, x19
mov x20, x0
bl copyUppercase(char*, char const*)
mov x0, x20
bl puts
mov x0, x20
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
b operator delete[](void*)
Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:
stp x29, x30, [sp, #-48]! // 16-byte Folded Spill
str x21, [sp, #16] // 8-byte Folded Spill
stp x20, x19, [sp, #32] // 16-byte Folded Spill
mov x29, sp
As we saw before, we are saving the old frame pointer and stack pointer to the stack. However, we are doing it using a more complicated store instruction: stp x29, x30, [sp, #-48]!
does two things. First, it stores x29
and x30
to the address sp - 48
. Second, it updates the stack pointer with that same sp - 48
value (that’s what the exclamation point is for; it’s the “pre-index addressing mode” described in ARM’s documentation).
Next, we save x21
, x20
, and x19
to the stack; we will use them later and we are required to preserve their current values (in other words, they are “callee-saved” registers). Finally, we set up the new frame pointer in x29
.
(By the way, the term “spill” in the compiler-generated comments just means that we are saving registers to the stack.)
On to the function body:
mov x19, x0
bl strlen
cmp x0, #1024 // =1024
add x0, x0, #1 // =1
b.hi .LBB0_2
We save our argument, s
(stored in x0
) in x19
and call strlen
with bl
, as we saw before. When strlen
returns, we CoMPare its result against 1024 as the first step in our if
statement. This sets the NZCV register according to the result of the comparsion, and then b.hi .LBB0_2
Branches to .LBB0_2
if it turns out that x0
was in fact more than 1024. Because both branches of our if
statement care about sSize + 1
and not sSize
, we add 1 to x0
(which stores sSize
) before the branch. In general, higher-level control-flow primitives like if
/else
statements and loops are implemented in assembly using conditional jump instructions.
Let’s first look at the path where x0 <= 1024
and thus the branch to .LBB0_2
was not taken. We have a blob of instructions to create char temp[sSize + 1]
on the stack:
add x9, x0, #15 // =15
mov x8, sp
and x9, x9, #0xfffffffffffffff0
sub x20, x8, x9
mov x21, sp
mov sp, x20
We add 15 to x0
and put the result in x9
. Then, we mask off the lower 4 bits of x9
. Together, these two operations put the target array size rounded up to the next multiple of 16 into x9
. Then, we subtract the array size from the stack pointer, save the old stack pointer value into x21
, and set the new stack pointer value.
The following block simply calls copyUppercase
and puts
as written in the code:
mov x0, x20
mov x1, x19
bl copyUppercase(char*, char const*)
mov x0, x20
bl puts
Finally, we have our function epilogue:
mov sp, x21
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
ret
We restore the stack pointer using the value of the frame pointer. Then, we load the registers we previously saved to the stack. Here we’ve see a new “post-index” addresing mode: ldp x29, x30, [sp], #48
means to load x29
and x30
from the current value of the stack pointer, and then add 48 to it afterwards. Finally, we return control to our caller, and we are done.
Next, let’s take a look at the path when x0 > 1024
and we branch to .LBB0_2
to allocate our array on the heap. This path is more straightforward. We call operator new[]
, save the result (returned in x0
) into x20
, and call copyUppercase
and puts
as before. We have a separate function epilogue for this case, and it looks a bit different:
mov x0, x20
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
b operator delete[](void*)
The forst mov
sets up x0
with a pointer to our heap-allocated array that we saved earlier. As with the other function epilogue, we then restore the stack pointer, load our saved registers, and update it by adding 48 bytes back. Finally, we have a new instruction: b operator delete[](void*)
. b
(for “branch”) is just like goto
: it transfers control to the given label or function. Unlike bl
, it does not save the return address for a future ret
. So, when operator delete[]
returns, it will instead transfer control to printUpperCase
’s caller. In essence, we’ve combined a bl
to opreator delete[]
with our own ret
. This is called tail call optimization.
Further reading
Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.
My first practical exposure to ARM64 assembly in particular was through iPhone development. I already knew the general way assembly works from previous exposure in college, so I got started by just googling “ARM64 ldp instruction” (or whatever other instruction) each time and reading what it did. Over time, I remembered what I had learned and didn’t have to Google again.
If you would like a more technical walkthrough of ARM64 assembly language, there is also a “learn the architecture” guide on ARM’s website. It may help you to know that the official name for the architecture is actually AArch64, but “ARM64” seems to be much more common.
from Hacker News https://ift.tt/2OmETLH
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.