Monday, November 14, 2022

ARM: Pragmatism, Not Purity

Yet another train getting across France-Spain border, yet another post.

In the last few weeks before the sabbatical I was working on bringing up AArch64 support for native code generation in Luau. Before that my interaction with this architecture was limited to occasional glance at a disassembly emitted by the compiler. Coming from x86 world and having often heard how ARM is a much cleaner architecture than x86, I expected smooth sailing and clean, simple and unambiguous mapping between instruction bytes and assembly text.

Well, I came away from this exercise being somewhat disillusioned. I think what I expected was a pure and clean mapping with separation of concerns, but instead there's many quirks of the ISA that I haven't really expected1. It's still less messy than x64, and I don't even disagree with the choices made - but a lot of them seem to eschew purity and instead attack the problem from a pragmatic angle, stated as "what useful instructions can we fit into a 32-bit encoding space".

Here are a few notes, in no particular order, of things that surprised me.

  • I expected there to be very few instruction "archetypes" - that is, ways to encode an instruction. Instead I ended up with 12 archetypes for 39 instructions. At least each instruction is always 32 bits!
  • It's tempting to think of ARM as RISC - each instruction is simple and does just one thing. Yet, there's many fairly complex instructions, for example add, and some other instructions, can carry an implicit shift left by up to 63 bits, or an implicit sign extension of the source 32-bit register, among other possibilities.
  • Loads/stores in particular are rather complicated, with not only shifts, but also ability to pre-index or post-index load/store - this refers to incrementing the register used as an offset from another register before or after the operation is done. Disappointingly, with all this, loads can only shift the offset by the size of the load, so ldr x0, [x1 + x2 * 16] is not something that can be encoded.
  • While the architecture supports address-relative loads, that requires two instructions - one to generate an address, and one to load from it, with additional complexity based on whether you need to support a large range of offsets or if +-1MB suffices.
  • On the face of it, conditional branches are simple - CMP instruction that sets flags, and B.cond that jumps based on the flags. However, there are some conditions that don't exist - eg unsigned comparison only supports > and <=, so for < and >= you need to flip the arguments (correction: you need to use carry conditionals, thanks Fabian!), whereas signed comparisons support all 4, with additional complexity for floating point operations that I was hoping to not have to deal with after being traumatized by SSE handling of comparisons.
  • There's no implicit "zero" flag so there's two additional conditional branch instructions, branch if zero and branch if not zero, that allow to omit a comparison instruction. That's nice; what's less nice is that the other pair of useful instructions, "jump if a specific bit is set/omitted", only has a 14-bit jump offset, and as such requires even more careful handling for larger functions.
  • Speaking of offset bits, bits in load offsets are rather asymmetrical - you can load a X-byte value from an offset of X*Y where Y is in [0, 511] range (called scaled offset), but for negative offsets you must use a different instruction encoding that only permits unscaled offsets in [-256, 255] byte range.
  • One additional odd thing about CMP is that this instruction doesn't truly exist - it's a mnemonic for subs zr, reg, reg/imm, where zr is register number 31. In some instructions like subs, using zr as the destination register essentially throws away the result. In some instructions, using zr as the source register produces value 0.
  • In some instructions, however, the encoding for zr (31) is instead used to refer to sp (stack pointer register). From the assembly mnemonics perspective this also results in asymmetry, where mov reg, sp is actually encoded as add reg, sp, 0, but for other registers, mov reg, reg is actually encoded as orr reg, xzr, reg2.
  • The immediate encoding for operations like add is relatively straightforward and uses the available bits in the instruction space to encode the value. However, for bitwise operations like and, the immediate version of the instruction encodes the mask in a peculiar format that uses 12 bits to represent a subset of the full mask space, optimized for "values that you will probably want to use as masks"...

Overall, none of these are deal-breakers and many are pretty easy to deal with - that said, I expected the A64 assembler to be much simpler than our X64 assembler (for the subset of instructions we need), and instead it's going to end up with roughly the same amount of complexity3. It now feels like the real benefit that A64 ISA has is the fact that the instructions are all 4 bytes in size, which makes it much easier to implement wide frontends4 - coincidentally this of course seems to be where all the irregularities come from, as you effectively need to devise an encoding that makes it possible to unambiguously encode a large set of instructions and as much data as possible in some of them. From the perspective of writing an assembler or a disassembler, however, I'm not sure the gap is that wide otherwise...


  1. Note that I've had limited time to work with this ISA so I am likely missing a significant amount of information wrt history and reasoning.

  2. This requires additional archetypes for our implementation that wants to be able to display disassembly on the fly while building the binary stream.

  3. Neither implementation supports the full set of instructions, which is honestly a relief, for example considering that A64 has 96 instructions for memcpy-like operations, organized in 32 groups of 3, or the number of atomic-related instruction variants.

  4. Keep in mind that I'm not a hardware engineer so who knows if this is true! I'm also omitting the simpler handling of multiple register widths, although A64 still supports 32/64 bit variants of most instructions that operate on 64-bit registers.



from Hacker News https://ift.tt/NBnVUC2

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.