Thursday, August 31, 2023

Writing a bare-metal RISC-V application in D

Writing a bare-metal RISC-V application in D

2023/02/08

Categories: d osdev riscv

This post will show you how to use D to write a bare-metal “Hello world” program that targets the RISC-V QEMU simulator. In a future blog post (now available) we’ll build on this to target actual hardware: the VisionFive 2 SBC. See blog-code for the final code from this post. For a more complex example, see Multiplix, an operating system I am developing that runs on the VisionFive 2 (and Raspberry Pis).

Why D?

Recently I’ve been writing bare-metal code in C, and I’ve become a bit frustrated with the lack of features that C provides. I started searching for a good replacement, and revisited D (a language I used for a project a few years ago). It turns out D has introduced a mode called betterC (sounds exactly like what I want), which essentially disables all language features that require the D runtime. This makes it roughly as easy to use D for bare-metal programming as C. You don’t get all the features of D, but you get enough that it covers all the things I want (in fact, for systems programming I prefer the betterC subset of D over full D). D in betterC mode is exactly what it sounds like, and retains the feel of C – going forward I think I will be using it instead of C in all situations where I would have otherwise used C (even in non-bare-metal situations).

Here are the positives about D I value most:

  • A decent import system (no more header files and #include).
  • Automatic bounds checking, and bounded strings and arrays.
  • Methods in structs.
  • Compile-time code evaluation (run D code at compile-time!).
  • Powerful templating and generics.
  • Iterators.
  • Default support for thread-local storage.
  • Scope guards and RAII.
  • Some memory safety protections with @safe.
  • A fairly comprehensive and readable online specification.
  • An active discord channel with people that answer my questions in minutes.
  • Both an LLVM-based compiler (LDC) and a GNU compiler (GDC), which is officially part of the GCC project.
    • And these compilers both export roughly the same flags and intrinsics as Clang and GCC respectively.

These features, combined with the lack of a runtime and the C-like feel of the language (making it easy to port previous code), make it a no-brainer for me to have D as the base choice for any project where I would otherwise use C.

Now that I’ve told you about my reasons for choosing D, let’s try using it to write a bare-metal application that targets RISC-V. If you want to follow along, the first step is to download the toolchain (the following tools should work on Linux or MacOS). You’ll need three different components:

  1. LDC 1.30 (the LLVM-based D compiler). Can be downloaded from GitHub. Make sure to use version 1.30.
  2. A riscv64-unknown-elf GNU toolchain. Can be downloaded from SiFive’s Freedom Tools repository.
  3. The QEMU RISC-V simulator: qemu-system-riscv64. Can be downloaded from SiFive’s Freedom Tools repository, or also usually available as part of your system’s QEMU package.

We’ll be using LDC since it ships with the ability to target riscv64. I have used GDC for bare-metal development as well, but it requires building a toolchain from source since nobody ships pre-built riscv64-unknown-elf-gdc binaries. We’ll use the GNU toolchain for assembling, linking, and for other tools like objcopy and objdump, and QEMU for simulating the hardware.

With these installed you should be able to run:

$ ldc2 --version
LDC - the LLVM D compiler (1.30.0):
...

$ riscv64-unknown-elf-ld
riscv64-unknown-elf-ld: no input files

$ qemu-system-riscv64 -h
...

CPU entrypoint

We’re writing bare-metal code, so there’s no operating system, no console, no files – nothing. The CPU just starts executing instructions at a pre-specified address after performing some initial setup. We’ll figure out what that address is later when we set up the linkerscript. For now we can just define the _start symbol as our entrypoint, and assume the linker will place the code at this label at the CPU entrypoint.

A D function requires a valid stack pointer, so before we can execute any D code we need to load the stack pointer register sp with a valid address.

Let’s make a file called start.s and put the following in it:

.section ".text.boot"

.globl _start
_start:
    la sp, _stack_start
    call dstart
_hlt:
    j _hlt

For now let’s assume _stack_start is a symbol with the address of a valid stack, and in the linkerscript we’ll set this up properly. After loading sp, we call a D function called dstart, defined in the next part.

D entrypoint

Now we can define our dstart function in dstart.d. For now we’ll just cause an infinite loop.

module dstart;

extern (C) void dstart() {
    while (1) {}
}

Linkerscript

Before we can compile this program we need a bit of linkerscript to tell the linker how our code should be laid out. We’ll need to specify the address where the text section should start (the entry address), and reserve space for all the data sections (.rodata, .data, .bss), and the stack.

Entry address

Today we’ll be targeting the QEMU virt RISC-V machine, so we have to figure out what its entrypoint is.

We can ask QEMU for a list of all devices in the virt machine by telling it to dump the its device tree:

$ qemu-system-riscv64 -machine virt,dumpdtb=virt.dtb
$ dtc virt.dtb > virt.dts

In virt.dts you’ll find the following entry:

memory@80000000 {
    device_type = "memory";
    reg = <0x00 0x80000000 0x00 0x8000000>;
};

This means that RAM starts at address 0x80000000 (everything below is special memory or inaccessible). The CPU entrypoint for the virt machine is the first instruction in RAM, stored at 0x80000000.

In the linkerscript, we need to tell the linker that it should place the _start function at 0x80000000. We do this by telling it to put the .text.boot section first in the .text section, located at 0x80000000. Then we include the rest of the .text sections, followed by read-only data, writable data, and the BSS.

In link.ld:

ENTRY(_start)

SECTIONS
{
    .text 0x80000000 : {
        KEEP(*(.text.boot))  
        *(.text*) 
    }
    .rodata : {
        . = ALIGN(8);
        *(.rodata*)
        *(.srodata*)
        . = ALIGN(8);
    }
    .data : { 
        . = ALIGN(8);
        *(.sdata*)
        *(.data*)
        . = ALIGN(8);
    } 
    .bss : {
        . = ALIGN(8);
        _bss_start = .;
        *(.sbss*)
        *(.bss*)
        *(COMMON)
        . = ALIGN(8);
        _bss_end = .;
    }

    .kstack : {
        . = ALIGN(16);
        . += 4K;
        _stack_start = .;
    }

    /DISCARD/ : { *(.comment .note .eh_frame) }
}

What is the BSS?

The BSS is a region of memory that the compiler assumes is initialized to all zeroes. Usually the static data for a program is directly copied into the ELF executable – if you have a string "hello world" in your program, those exact bytes will live somewhere in the binary (in the read-only data section). However, a lot of static data is initialized to zero, so instead of putting those zero bytes directly into the ELF file, the linker lets us save space by making a special section (the BSS) that must be initialized to all zeroes at runtime, but won’t actually contain that data in the ELF file itself. So even if you have a giant 1MB array of zeroes, your ELF binary will be small because that section will be expanded into RAM only when the application starts. Usually the OS sets up the BSS before it launches a program, but we’re running bare-metal, so we have to do that manually in the dstart function (in the next section). To make this initialization possible, we define the _bss_start and _bss_end symbols in the linkerscript. These are symbols whose addresses will be the start and end of the BSS section respectively.

Reserving space for the stack

We also reserve one page for the .kstack section and mark the _stack_start symbol to be located to the end of it (remember the stack grows down). The stack must be 16-byte aligned.

Compile!

Now we have everything we need to compile a basic bare-metal program.

$ ldc2 -Oz -betterC -mtriple=riscv64-unknown-elf -mattr=+m,+a,+c --code-model=medium -c dstart.d
$ riscv64-unknown-elf-as -mno-relax -march=rv64imac start.S -c -o start.o
$ riscv64-unknown-elf-ld -Tlink.ld start.o dstart.o -o prog.elf

Let’s look at some of these flags:

  • Oz: optimize aggressively for size.
  • betterC: enable betterC mode (disable the built-in D runtime).
  • mtriple=riscv64-unknown-elf: build for the riscv64 bare-metal ELF target.
  • mattr=+m,+a,+c: enable the following RISC-V extensions: m (multiply/divide), a (atomics), and c (compressed instructions).
  • code-model=medium: code models in RISC-V control how pointers to far away locations are constructed. The medium code model (also called medany) allows us to address any symbol located within 2 GiB of the current address, and is recommended for 64-bit programs. See the SiFive post for more information.
  • mno-relax: disables linker relaxation in the assembler (it is already disabled by default in LDC). Linker relaxation is a RISC-V-specific optimization that allows the linker to make use of the gp (global pointer) register. I explain it in more detail in the linker relaxation section.

It’s going to get tedious to type out these commands repeatedly, so let’s create a Makefile (or a Knitfile if you’re cool):

SRC=$(wildcard *.d)
OBJ=$(SRC:.d=.o)

all: prog.bin

%.o: %.d
      ldc2 -Oz -betterC -mtriple=riscv64-unknown-elf -mattr=+m,+a,+c,+relax --code-model=medium --makedeps=$*.dep $< -c -of $@
%.o: %.s
      riscv64-unknown-elf-as -march=rv64imac $< -c -o $@
prog.elf: start.o $(OBJ)
      riscv64-unknown-elf-ld -Tlink.ld $^ -o $@
%.bin: %.elf
      riscv64-unknown-elf-objcopy $< -O binary $@
%.list: %.elf
      riscv64-unknown-elf-objdump -D $< > $@
run: prog.bin
      qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
clean:
      rm -f *.bin *.list *.o *.elf *.dep

-include *.dep

and compile with

This file is a raw dump of our program. At this point it clocks in at a whopping 22 bytes.

To see the disassembled program, run

$ make prog.list
...
$ cat prog.list
prog.elf:     file format elf64-littleriscv

Disassembly of section .text:

0000000080000000 <_start>:
    80000000: 00001117                auipc   sp,0x1
    80000004: 02010113                addi    sp,sp,32 # 80001020 <_stack_start>
    80000008: 00000097                auipc   ra,0x0
    8000000c: 00c080e7                jalr    12(ra) # 80000014 <dstart>

0000000080000010 <_hlt>:
    80000010: a001                    j       80000010 <_hlt>
    ...

0000000080000014 <dstart>:
    80000014: a001                    j       80000014 <dstart>

Looks like our _start function is being linked properly at 0x80000000 and has the expected assembly!

If you try to run with

$ make run
qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin

it will just enter an infinite loop (press Ctrl-A Ctrl-X to quit QEMU). We still have a bit more work to do before we get output.

More setup: initializing the BSS

Now let’s modify dstart to initialize the BSS. We need to declare some extern variables so that the linker symbols _bss_start and _bss_end are available to our D code. Then we can just loop from _bss_start to _bss_end and assign all the bytes in that range to zero. Once complete, our BSS is initialized and we can run arbitrary D code (using globals that may be initialized to zero).

extern (C) {
    extern __gshared uint _bss_start, _bss_end;

    void dstart() {
        uint* bss = &_bss_start;
        uint* bss_end = &_bss_end;
        while (bss < bss_end) {
            *bss++ = 0;
        }

        import main;
        kmain();
    }
}

And in main.d we have our bare-metal main entrypoint:

module main;

void kmain() {}

Creating a minimal D runtime

Several D language features are unavailable because of our lack of runtime. For example, types such as string and size_t are undefined, and we can’t use assertions (we’ll get to those later). The first step to creating a minimal runtime is to create an object.d file. The D compiler will search for this special file and import it automatically everywhere. So we can create definitions for types like string and size_t here. Here is the minimal definition I like to use, which also defines ptrdiff_t, noreturn, and uintptr.

module object;

alias string = immutable(char)[];
alias size_t = typeof(int.sizeof);
alias ptrdiff_t = typeof(cast(void*) 0 - cast(void*) 0);

alias noreturn = typeof(*null);

static if ((void*).sizeof == 8) {
    alias uintptr = ulong;
} else static if ((void*).sizeof == 4) {
    alias uintptr = uint;
} else {
    static assert(0, "pointer size must be 4 or 8 bytes");
}

Writing to the UART device

Most systems have a UART device. Generally how this works is that you write a byte to a special place in memory, and that byte will be transmitted using the UART protocol over some pins on the board. In order to read the bytes with your host computer you need a UART to USB adapter plugged into your host, and then you can read from the corresponding device file (usually /dev/ttyUSB0) on the host computer. Today we’ll just be simulating our bare-metal code in QEMU, so you don’t need to have a special adapter. QEMU will emulate a UART device and print out the bytes written to its transmit register.

Enabling volatile loads/stores

When writing to device memory it is important to ensure that the compiler does not remove our loads/stores. For example, if a device is located at 0x10000000, we might write directly to that address by casting the integer to a pointer. To the compiler, it just looks like we are writing to random addresses, which might be undefined behavior or result in dead code (e.g., if we never read the value back, the compiler may determine that it can eliminate the write). We need to inform the compiler that these reads/writes of device memory must be preserved and cannot be optimized out. D uses the volatileStore and volatileLoad intrinsics for this.

We can define these in our object.d:

pragma(LDC_intrinsic, "ldc.bitop.vld") ubyte volatileLoad(ubyte* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") ushort volatileLoad(ushort* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") uint volatileLoad(uint* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") ulong volatileLoad(ulong* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ubyte* ptr, ubyte value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ushort* ptr, ushort value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(uint* ptr, uint value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ulong* ptr, ulong value);

Controlling the UART

With that set up, let’s figure out where QEMU’s UART device is located in memory so we can write to it.

The QEMU virt machine defines a number of virtual devices, one of which is a UART device. Looking through the QEMU device tree again in virt.dts, you’ll see the following:

uart@10000000 {
    interrupts = <0x0a>;
    interrupt-parent = <0x03>;
    clock-frequency = <0x384000>;
    reg = <0x00 0x10000000 0x00 0x100>;
    compatible = "ns16550a";
};

This says that a ns16550a UART device exists at address 0x10000000.

On real hardware the UART would need to be properly initialized by writing some memory-mapped configuration registers (for setting up the baud rate and other options). However the QEMU device does not require initialization. It emulates an ns16550a device, and writing to its transmit register is enough to cause a byte to be written over the UART (which appears on the console when simulating with QEMU). The transmit register for the ns16550a is the first mapped register, so it is located at 0x10000000.

In uart.d:

module uart;

struct Ns16650a(ubyte* base) {
    static void tx(ubyte b) {
        volatileStore(base, b);
    }
}

alias Uart = Ns16650a!(cast(ubyte*) 0x10000000);

Now in kmain, we can test the UART.

module main;

import uart;

void kmain() {
    Uart.tx('h');
    Uart.tx('i');
    Uart.tx('\n');
}
$ make prog.bin
$ qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
hi

Press Ctrl-A Ctrl-x to quit QEMU (the program will enter an infinite loop after returning from kmain).

Making a simple print function

Now we can just wrap the Uart.tx function up with a println function and we’ll have a bare-metal Hello world! in no time.

In object.d:

import uart;

void printElem(char c) {
    Uart.tx(c);
}

void printElem(string s) {
    foreach (c; s) {
        printElem(c);
    }
}

void print(Args...)(Args args) {
    foreach (arg; args) {
        printElem(arg);
    }
}

void println(Args...)(Args args) {
    print(args, '\n');
}

And in main.d:

void kmain() {
    println("Hello world!");
}
$ make prog.bin
$ qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
Hello world!

There you have it, (simulated) bare-metal hello world!

Some of the initialization we’ve done hasn’t been strictly necessary (we didn’t end up using any variables in the BSS), but it should set you up properly for writing more complex bare-metal programs. The next sections discuss some further steps.

Bonus content

Adding support for assertions and bounds-checking

If you try to use a D assert expression, you might notice that the linking step fails:

riscv64-unknown-elf-ld: dstart.o: in function `_D6dstart5kmainFZv':
dstart.d:(.text+0x3c): undefined reference to `__assert'

It is looking for an __assert function, so let’s create one in the object.d file:

size_t strlen(const(char)* s) {
    size_t n;
    for (n = 0; *s != '\0'; ++s) {
        ++n;
    }
    return n;
}

extern (C) noreturn __assert(const(char)* msg, const(char)* file, int line) {
    // convert a char pointer into a bounded string with the [0 .. length] syntax
    string smsg = cast(string) msg[0 .. strlen(msg)];
    string sfile = cast(string) file[0 .. strlen(file)];
    println("fatal error: ", sfile, ": ", smsg);
    while (1) {}
}

Now you can use assert statements!

D also supports bounds-checking, and internally the compiler will also call __assert when a bounds check fails. This means we also have working bounds checks now.

Try this in main.d:

void kmain() {
    char[10] array;
    int x = 12;
    println(array[x]);
}

Running it gives

fatal error: main.d: array index out of bounds

Bounds-checked arrays!

This code doesn’t print the line number because that requires converting an int to a string – something left as an exercise to the reader.

Enabling linker relaxation

Linker relaxation is an optimization in the RISC-V toolchain that allows globals to be accessed through the global pointer (stored in the gp register). This value is a pointer to somewhere in the data section, which allows instructions to load globals by directly offsetting from gp, instead of constructing the address of the global from scratch (which may require multiple instructions on RISC-V).

To enable linker relaxation we have to do three things:

  1. Modify the linkerscript so that it defines a symbol for the global pointer.
  2. Load the gp register with this value in the _start function.
  3. Enable linker relaxation in our compiler.

To modify the linkerscript we just add the following at the beginning of the .rodata section definition:

__global_pointer$ = . + 0x800;

This sets up the __global_pointer$ symbol (a special symbol that the linker assumes is stored in gp) to point 0x800 bytes into the data segment (RISC-V instructions can load/store values offset up to 0x800 bytes from the gp register in either direction in one instruction). This allows offsets from gp to cover most/all of static data.

Next we add to _start:

.option push
.option norelax
la gp, __global_pointer$
.option pop

We need to temporarily enable the norelax option, otherwise the assembler will optimize this to mv gp, gp.

Finally, we can remove the -mno-relax flag from the riscv64-unknown-elf-as invocation, and add -mattr=+m,+a,+c,+relax to the ldc2 invocation to enable linker relaxation in the compiler.

Removing unused functions

If you take a look at the disassembly of the program (make prog.list), you might notice there are definitions for functions that are never called. This is because those functions have been inlined, but the definitions were not removed. Functions/globals in D are always exported in the object file, even if they are marked private (I’m not really sure why). Luckily modern linkers can be pretty smart and it’s easy to have the linker remove these unused functions. Pass --function-sections and --data-sections to LDC to have it put each function/global in its own section (still within .text, .data etc.). Now if you pass the --gc-sections flag to the linker, it will remove any unreferenced sections (hence removing any unused functions/globals). With these flags I got the final “hello world” binary down to 160 bytes.

This is a basic form of optimization performed by the linker. There are more advanced forms of link-time optimization (LTO), which I won’t discuss in much detail. If you pass -flto=thin or -flto=full to LDC, the object files that it generates will be LLVM bitcode. Then you will need to invoke the linker with the LLVMgold linker plugin (or use LLD) so that it can read these files. With this method, the linker will apply full compiler optimizations across object files.

Thread-local storage and globals

Globals are thread-local by default in D. That means if you declare a global as int x; then whenever you access x, the compiler will do so through the system’s thread pointer (on RISC-V this is stored in the tp register). That means if you use a thread-local variable, you had better make sure tp points to a block of memory where x is located, and if you have multiple threads each thread’s tp should point to a distinct thread-local block (each thread will have its own private copy of x). I won’t explain in detail how to set that up here, but briefly, you’ll need to initialize the .tdata and .tbss sections for each thread in dstart, and load tp with a pointer to the current thread’s local .tdata.

To make a global shared across all threads, you need to mark it as immutable or shared. A variable marked as shared imposes some limits, and basically forces you to mark everything it touches as shared. You can still read/write it without checks, but at least you should be able to easily know if you are accessing a shared variable (and manually verify you have the appropriate synchronization). In a future version of D it is likely that directly accessing a shared variable will be disallowed, except through atomic intrinsics. If you have a lock to protect the variable, then you will need to cast away the shared qualifier manually, which isn’t perfect but forces the programmer to acknowledge the possible unsafety of accessing the shared global. You can always use the __gshared attribute as an escape hatch, which makes the global shared but does not make any changes to the type (no limitations). A global marked as __gshared is equivalent to a C global.

I hope this provided a simple introduction to D for bare-metal programming, and that you might consider using D instead of C in some future project as a result. This post has only covered running in a simulated environment. In a future post I’ll show how to write bare-metal code for the VisionFive 2, a recently released RISC-V SBC produced by StarFive. Stay tuned! (now available)

If you want to see a larger example, I am developing an operating system called Multiplix in D. It has support for RISC-V and AArch64, and targets the VisionFive, VisionFive 2, Raspberry Pi 3, and Raspberry Pi 4 (and likely more boards in the future). Check it out! It is currently heavily in-progress, but I plan to make a post about it when it is further along.

The code from this post is available in my blog-code repository.



from Hacker News https://ift.tt/CM1rHvV

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.