This post will show you how to use D to write a bare-metal “Hello world” program that targets the RISC-V QEMU simulator. In a future blog post (now available) we’ll build on this to target actual hardware: the VisionFive 2 SBC. See blog-code for the final code from this post. For a more complex example, see Multiplix, an operating system I am developing that runs on the VisionFive 2 (and Raspberry Pis).
Why D?
Recently I’ve been writing bare-metal code in C, and I’ve become a bit frustrated with the lack of features that C provides. I started searching for a good replacement, and revisited D (a language I used for a project a few years ago). It turns out D has introduced a mode called betterC (sounds exactly like what I want), which essentially disables all language features that require the D runtime. This makes it roughly as easy to use D for bare-metal programming as C. You don’t get all the features of D, but you get enough that it covers all the things I want (in fact, for systems programming I prefer the betterC subset of D over full D). D in betterC mode is exactly what it sounds like, and retains the feel of C – going forward I think I will be using it instead of C in all situations where I would have otherwise used C (even in non-bare-metal situations).
Here are the positives about D I value most:
- A decent import system (no more header files and
#include
). - Automatic bounds checking, and bounded strings and arrays.
- Methods in structs.
- Compile-time code evaluation (run D code at compile-time!).
- Powerful templating and generics.
- Iterators.
- Default support for thread-local storage.
- Scope guards and RAII.
- Some memory safety protections with
@safe
. - A fairly comprehensive and readable online specification.
- An active discord channel with people that answer my questions in minutes.
- Both an LLVM-based compiler (LDC) and a GNU compiler (GDC), which is officially part of the GCC project.
- And these compilers both export roughly the same flags and intrinsics as Clang and GCC respectively.
These features, combined with the lack of a runtime and the C-like feel of the language (making it easy to port previous code), make it a no-brainer for me to have D as the base choice for any project where I would otherwise use C.
Now that I’ve told you about my reasons for choosing D, let’s try using it to write a bare-metal application that targets RISC-V. If you want to follow along, the first step is to download the toolchain (the following tools should work on Linux or MacOS). You’ll need three different components:
- LDC 1.30 (the LLVM-based D compiler). Can be downloaded from GitHub. Make sure to use version 1.30.
- A
riscv64-unknown-elf
GNU toolchain. Can be downloaded from SiFive’s Freedom Tools repository. - The QEMU RISC-V simulator:
qemu-system-riscv64
. Can be downloaded from SiFive’s Freedom Tools repository, or also usually available as part of your system’s QEMU package.
We’ll be using LDC since it ships with the ability to target riscv64
. I have used GDC for bare-metal development as well, but it requires building a toolchain from source since nobody ships pre-built riscv64-unknown-elf-gdc
binaries. We’ll use the GNU toolchain for assembling, linking, and for other tools like objcopy
and objdump
, and QEMU for simulating the hardware.
With these installed you should be able to run:
$ ldc2 --version
LDC - the LLVM D compiler (1.30.0):
...
$ riscv64-unknown-elf-ld
riscv64-unknown-elf-ld: no input files
$ qemu-system-riscv64 -h
...
CPU entrypoint
We’re writing bare-metal code, so there’s no operating system, no console, no files – nothing. The CPU just starts executing instructions at a pre-specified address after performing some initial setup. We’ll figure out what that address is later when we set up the linkerscript. For now we can just define the _start
symbol as our entrypoint, and assume the linker will place the code at this label at the CPU entrypoint.
A D function requires a valid stack pointer, so before we can execute any D code we need to load the stack pointer register sp
with a valid address.
Let’s make a file called start.s
and put the following in it:
.section ".text.boot"
.globl _start
_start:
la sp, _stack_start
call dstart
_hlt:
j _hlt
For now let’s assume _stack_start
is a symbol with the address of a valid stack, and in the linkerscript we’ll set this up properly. After loading sp
, we call a D function called dstart
, defined in the next part.
D entrypoint
Now we can define our dstart
function in dstart.d
. For now we’ll just cause an infinite loop.
module dstart;
extern (C) void dstart() {
while (1) {}
}
Linkerscript
Before we can compile this program we need a bit of linkerscript to tell the linker how our code should be laid out. We’ll need to specify the address where the text section should start (the entry address), and reserve space for all the data sections (.rodata
, .data
, .bss
), and the stack.
Entry address
Today we’ll be targeting the QEMU virt
RISC-V machine, so we have to figure out what its entrypoint is.
We can ask QEMU for a list of all devices in the virt
machine by telling it to dump the its device tree:
$ qemu-system-riscv64 -machine virt,dumpdtb=virt.dtb
$ dtc virt.dtb > virt.dts
In virt.dts
you’ll find the following entry:
memory@80000000 {
device_type = "memory";
reg = <0x00 0x80000000 0x00 0x8000000>;
};
This means that RAM starts at address 0x80000000
(everything below is special memory or inaccessible). The CPU entrypoint for the virt
machine is the first instruction in RAM, stored at 0x80000000
.
In the linkerscript, we need to tell the linker that it should place the _start
function at 0x80000000
. We do this by telling it to put the .text.boot
section first in the .text
section, located at 0x80000000
. Then we include the rest of the .text
sections, followed by read-only data, writable data, and the BSS.
In link.ld
:
ENTRY(_start)
SECTIONS
{
.text 0x80000000 : {
KEEP(*(.text.boot))
*(.text*)
}
.rodata : {
. = ALIGN(8);
*(.rodata*)
*(.srodata*)
. = ALIGN(8);
}
.data : {
. = ALIGN(8);
*(.sdata*)
*(.data*)
. = ALIGN(8);
}
.bss : {
. = ALIGN(8);
_bss_start = .;
*(.sbss*)
*(.bss*)
*(COMMON)
. = ALIGN(8);
_bss_end = .;
}
.kstack : {
. = ALIGN(16);
. += 4K;
_stack_start = .;
}
/DISCARD/ : { *(.comment .note .eh_frame) }
}
What is the BSS?
The BSS is a region of memory that the compiler assumes is initialized to all zeroes. Usually the static data for a program is directly copied into the ELF executable – if you have a string "hello world"
in your program, those exact bytes will live somewhere in the binary (in the read-only data section). However, a lot of static data is initialized to zero, so instead of putting those zero bytes directly into the ELF file, the linker lets us save space by making a special section (the BSS) that must be initialized to all zeroes at runtime, but won’t actually contain that data in the ELF file itself. So even if you have a giant 1MB array of zeroes, your ELF binary will be small because that section will be expanded into RAM only when the application starts. Usually the OS sets up the BSS before it launches a program, but we’re running bare-metal, so we have to do that manually in the dstart
function (in the next section). To make this initialization possible, we define the _bss_start
and _bss_end
symbols in the linkerscript. These are symbols whose addresses will be the start and end of the BSS section respectively.
Reserving space for the stack
We also reserve one page for the .kstack
section and mark the _stack_start
symbol to be located to the end of it (remember the stack grows down). The stack must be 16-byte aligned.
Compile!
Now we have everything we need to compile a basic bare-metal program.
$ ldc2 -Oz -betterC -mtriple=riscv64-unknown-elf -mattr=+m,+a,+c --code-model=medium -c dstart.d
$ riscv64-unknown-elf-as -mno-relax -march=rv64imac start.S -c -o start.o
$ riscv64-unknown-elf-ld -Tlink.ld start.o dstart.o -o prog.elf
Let’s look at some of these flags:
Oz
: optimize aggressively for size.betterC
: enable betterC mode (disable the built-in D runtime).mtriple=riscv64-unknown-elf
: build for the riscv64 bare-metal ELF target.mattr=+m,+a,+c
: enable the following RISC-V extensions:m
(multiply/divide),a
(atomics), andc
(compressed instructions).code-model=medium
: code models in RISC-V control how pointers to far away locations are constructed. Themedium
code model (also calledmedany
) allows us to address any symbol located within 2 GiB of the current address, and is recommended for 64-bit programs. See the SiFive post for more information.mno-relax
: disables linker relaxation in the assembler (it is already disabled by default in LDC). Linker relaxation is a RISC-V-specific optimization that allows the linker to make use of thegp
(global pointer) register. I explain it in more detail in the linker relaxation section.
It’s going to get tedious to type out these commands repeatedly, so let’s create a Makefile (or a Knitfile if you’re cool):
SRC=$(wildcard *.d)
OBJ=$(SRC:.d=.o)
all: prog.bin
%.o: %.d
ldc2 -Oz -betterC -mtriple=riscv64-unknown-elf -mattr=+m,+a,+c,+relax --code-model=medium --makedeps=$*.dep $< -c -of $@
%.o: %.s
riscv64-unknown-elf-as -march=rv64imac $< -c -o $@
prog.elf: start.o $(OBJ)
riscv64-unknown-elf-ld -Tlink.ld $^ -o $@
%.bin: %.elf
riscv64-unknown-elf-objcopy $< -O binary $@
%.list: %.elf
riscv64-unknown-elf-objdump -D $< > $@
run: prog.bin
qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
clean:
rm -f *.bin *.list *.o *.elf *.dep
-include *.dep
and compile with
This file is a raw dump of our program. At this point it clocks in at a whopping 22 bytes.
To see the disassembled program, run
$ make prog.list
...
$ cat prog.list
prog.elf: file format elf64-littleriscv
Disassembly of section .text:
0000000080000000 <_start>:
80000000: 00001117 auipc sp,0x1
80000004: 02010113 addi sp,sp,32 # 80001020 <_stack_start>
80000008: 00000097 auipc ra,0x0
8000000c: 00c080e7 jalr 12(ra) # 80000014 <dstart>
0000000080000010 <_hlt>:
80000010: a001 j 80000010 <_hlt>
...
0000000080000014 <dstart>:
80000014: a001 j 80000014 <dstart>
Looks like our _start
function is being linked properly at 0x80000000
and has the expected assembly!
If you try to run with
$ make run
qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
it will just enter an infinite loop (press Ctrl-A
Ctrl-X
to quit QEMU). We still have a bit more work to do before we get output.
More setup: initializing the BSS
Now let’s modify dstart
to initialize the BSS. We need to declare some extern
variables so that the linker symbols _bss_start
and _bss_end
are available to our D code. Then we can just loop from _bss_start
to _bss_end
and assign all the bytes in that range to zero. Once complete, our BSS is initialized and we can run arbitrary D code (using globals that may be initialized to zero).
extern (C) {
extern __gshared uint _bss_start, _bss_end;
void dstart() {
uint* bss = &_bss_start;
uint* bss_end = &_bss_end;
while (bss < bss_end) {
*bss++ = 0;
}
import main;
kmain();
}
}
And in main.d
we have our bare-metal main entrypoint:
module main;
void kmain() {}
Creating a minimal D runtime
Several D language features are unavailable because of our lack of runtime. For example, types such as string
and size_t
are undefined, and we can’t use assertions (we’ll get to those later). The first step to creating a minimal runtime is to create an object.d
file. The D compiler will search for this special file and import it automatically everywhere. So we can create definitions for types like string
and size_t
here. Here is the minimal definition I like to use, which also defines ptrdiff_t
, noreturn
, and uintptr
.
module object;
alias string = immutable(char)[];
alias size_t = typeof(int.sizeof);
alias ptrdiff_t = typeof(cast(void*) 0 - cast(void*) 0);
alias noreturn = typeof(*null);
static if ((void*).sizeof == 8) {
alias uintptr = ulong;
} else static if ((void*).sizeof == 4) {
alias uintptr = uint;
} else {
static assert(0, "pointer size must be 4 or 8 bytes");
}
Writing to the UART device
Most systems have a UART device. Generally how this works is that you write a byte to a special place in memory, and that byte will be transmitted using the UART protocol over some pins on the board. In order to read the bytes with your host computer you need a UART to USB adapter plugged into your host, and then you can read from the corresponding device file (usually /dev/ttyUSB0
) on the host computer. Today we’ll just be simulating our bare-metal code in QEMU, so you don’t need to have a special adapter. QEMU will emulate a UART device and print out the bytes written to its transmit register.
Enabling volatile loads/stores
When writing to device memory it is important to ensure that the compiler does not remove our loads/stores. For example, if a device is located at 0x10000000
, we might write directly to that address by casting the integer to a pointer. To the compiler, it just looks like we are writing to random addresses, which might be undefined behavior or result in dead code (e.g., if we never read the value back, the compiler may determine that it can eliminate the write). We need to inform the compiler that these reads/writes of device memory must be preserved and cannot be optimized out. D uses the volatileStore
and volatileLoad
intrinsics for this.
We can define these in our object.d
:
pragma(LDC_intrinsic, "ldc.bitop.vld") ubyte volatileLoad(ubyte* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") ushort volatileLoad(ushort* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") uint volatileLoad(uint* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vld") ulong volatileLoad(ulong* ptr);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ubyte* ptr, ubyte value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ushort* ptr, ushort value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(uint* ptr, uint value);
pragma(LDC_intrinsic, "ldc.bitop.vst") void volatileStore(ulong* ptr, ulong value);
Controlling the UART
With that set up, let’s figure out where QEMU’s UART device is located in memory so we can write to it.
The QEMU virt
machine defines a number of virtual devices, one of which is a UART device. Looking through the QEMU device tree again in virt.dts
, you’ll see the following:
uart@10000000 {
interrupts = <0x0a>;
interrupt-parent = <0x03>;
clock-frequency = <0x384000>;
reg = <0x00 0x10000000 0x00 0x100>;
compatible = "ns16550a";
};
This says that a ns16550a UART device exists at address 0x10000000
.
On real hardware the UART would need to be properly initialized by writing some memory-mapped configuration registers (for setting up the baud rate and other options). However the QEMU device does not require initialization. It emulates an ns16550a device, and writing to its transmit register is enough to cause a byte to be written over the UART (which appears on the console when simulating with QEMU). The transmit register for the ns16550a is the first mapped register, so it is located at 0x10000000
.
In uart.d
:
module uart;
struct Ns16650a(ubyte* base) {
static void tx(ubyte b) {
volatileStore(base, b);
}
}
alias Uart = Ns16650a!(cast(ubyte*) 0x10000000);
Now in kmain
, we can test the UART.
module main;
import uart;
void kmain() {
Uart.tx('h');
Uart.tx('i');
Uart.tx('\n');
}
$ make prog.bin
$ qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
hi
Press Ctrl-A
Ctrl-x
to quit QEMU (the program will enter an infinite loop after returning from kmain
).
Making a simple print function
Now we can just wrap the Uart.tx
function up with a println
function and we’ll have a bare-metal Hello world!
in no time.
In object.d
:
import uart;
void printElem(char c) {
Uart.tx(c);
}
void printElem(string s) {
foreach (c; s) {
printElem(c);
}
}
void print(Args...)(Args args) {
foreach (arg; args) {
printElem(arg);
}
}
void println(Args...)(Args args) {
print(args, '\n');
}
And in main.d
:
void kmain() {
println("Hello world!");
}
$ make prog.bin
$ qemu-system-riscv64 -nographic -bios none -machine virt -kernel prog.bin
Hello world!
There you have it, (simulated) bare-metal hello world!
Some of the initialization we’ve done hasn’t been strictly necessary (we didn’t end up using any variables in the BSS), but it should set you up properly for writing more complex bare-metal programs. The next sections discuss some further steps.
Bonus content
Adding support for assertions and bounds-checking
If you try to use a D assert expression, you might notice that the linking step fails:
riscv64-unknown-elf-ld: dstart.o: in function `_D6dstart5kmainFZv':
dstart.d:(.text+0x3c): undefined reference to `__assert'
It is looking for an __assert
function, so let’s create one in the object.d
file:
size_t strlen(const(char)* s) {
size_t n;
for (n = 0; *s != '\0'; ++s) {
++n;
}
return n;
}
extern (C) noreturn __assert(const(char)* msg, const(char)* file, int line) {
// convert a char pointer into a bounded string with the [0 .. length] syntax
string smsg = cast(string) msg[0 .. strlen(msg)];
string sfile = cast(string) file[0 .. strlen(file)];
println("fatal error: ", sfile, ": ", smsg);
while (1) {}
}
Now you can use assert
statements!
D also supports bounds-checking, and internally the compiler will also call __assert
when a bounds check fails. This means we also have working bounds checks now.
Try this in main.d
:
void kmain() {
char[10] array;
int x = 12;
println(array[x]);
}
Running it gives
fatal error: main.d: array index out of bounds
Bounds-checked arrays!
This code doesn’t print the line number because that requires converting an int
to a string
– something left as an exercise to the reader.
Enabling linker relaxation
Linker relaxation is an optimization in the RISC-V toolchain that allows globals to be accessed through the global pointer (stored in the gp
register). This value is a pointer to somewhere in the data section, which allows instructions to load globals by directly offsetting from gp
, instead of constructing the address of the global from scratch (which may require multiple instructions on RISC-V).
To enable linker relaxation we have to do three things:
- Modify the linkerscript so that it defines a symbol for the global pointer.
- Load the
gp
register with this value in the_start
function. - Enable linker relaxation in our compiler.
To modify the linkerscript we just add the following at the beginning of the .rodata
section definition:
__global_pointer$ = . + 0x800;
This sets up the __global_pointer$
symbol (a special symbol that the linker assumes is stored in gp
) to point 0x800
bytes into the data segment (RISC-V instructions can load/store values offset up to 0x800
bytes from the gp
register in either direction in one instruction). This allows offsets from gp
to cover most/all of static data.
Next we add to _start
:
.option push
.option norelax
la gp, __global_pointer$
.option pop
We need to temporarily enable the norelax
option, otherwise the assembler will optimize this to mv gp, gp
.
Finally, we can remove the -mno-relax
flag from the riscv64-unknown-elf-as
invocation, and add -mattr=+m,+a,+c,+relax
to the ldc2
invocation to enable linker relaxation in the compiler.
Removing unused functions
If you take a look at the disassembly of the program (make prog.list
), you might notice there are definitions for functions that are never called. This is because those functions have been inlined, but the definitions were not removed. Functions/globals in D are always exported in the object file, even if they are marked private
(I’m not really sure why). Luckily modern linkers can be pretty smart and it’s easy to have the linker remove these unused functions. Pass --function-sections
and --data-sections
to LDC to have it put each function/global in its own section (still within .text
, .data
etc.). Now if you pass the --gc-sections
flag to the linker, it will remove any unreferenced sections (hence removing any unused functions/globals). With these flags I got the final “hello world” binary down to 160 bytes.
This is a basic form of optimization performed by the linker. There are more advanced forms of link-time optimization (LTO), which I won’t discuss in much detail. If you pass -flto=thin
or -flto=full
to LDC, the object files that it generates will be LLVM bitcode. Then you will need to invoke the linker with the LLVMgold linker plugin (or use LLD) so that it can read these files. With this method, the linker will apply full compiler optimizations across object files.
Thread-local storage and globals
Globals are thread-local by default in D. That means if you declare a global as int x;
then whenever you access x
, the compiler will do so through the system’s thread pointer (on RISC-V this is stored in the tp
register). That means if you use a thread-local variable, you had better make sure tp
points to a block of memory where x
is located, and if you have multiple threads each thread’s tp
should point to a distinct thread-local block (each thread will have its own private copy of x
). I won’t explain in detail how to set that up here, but briefly, you’ll need to initialize the .tdata
and .tbss
sections for each thread in dstart
, and load tp
with a pointer to the current thread’s local .tdata
.
To make a global shared across all threads, you need to mark it as immutable
or shared
. A variable marked as shared
imposes some limits, and basically forces you to mark everything it touches as shared
. You can still read/write it without checks, but at least you should be able to easily know if you are accessing a shared variable (and manually verify you have the appropriate synchronization). In a future version of D it is likely that directly accessing a shared variable will be disallowed, except through atomic intrinsics. If you have a lock to protect the variable, then you will need to cast away the shared
qualifier manually, which isn’t perfect but forces the programmer to acknowledge the possible unsafety of accessing the shared global. You can always use the __gshared
attribute as an escape hatch, which makes the global shared but does not make any changes to the type (no limitations). A global marked as __gshared
is equivalent to a C global.
I hope this provided a simple introduction to D for bare-metal programming, and that you might consider using D instead of C in some future project as a result. This post has only covered running in a simulated environment. In a future post I’ll show how to write bare-metal code for the VisionFive 2, a recently released RISC-V SBC produced by StarFive. Stay tuned! (now available)
If you want to see a larger example, I am developing an operating system called Multiplix in D. It has support for RISC-V and AArch64, and targets the VisionFive, VisionFive 2, Raspberry Pi 3, and Raspberry Pi 4 (and likely more boards in the future). Check it out! It is currently heavily in-progress, but I plan to make a post about it when it is further along.
The code from this post is available in my blog-code repository.
from Hacker News https://ift.tt/CM1rHvV
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.