Sunday, December 27, 2020

Cosmopolitan Libc: build-once run-anywhere C library

Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn't require interpreters or virtual machines be installed beforehand. Cosmo provides the same portability benefits as high-level languages like Go and Rust, but it doesn't invent a new language and you won't need to configure a CI system to build separate binaries for each operating system. What Cosmopolitan focuses on is fixing C by decoupling it from platforms, so it can be pleasant to use for writing small unix programs that are easily distributed to a much broader audience.

Getting Started

Assuming you have GCC on Linux, then all you need are the five additional files which are linked below:

# create simple c program on command line
echo '
  main() {
    printf("hello world\n");
  }
' >hello.c

# run gcc compiler in freestanding mode
gcc -g -Os -static -fno-pie -nostdlib -nostdinc -o hello.com hello.c \
  -Wl,--oformat=binary -Wl,--gc-sections -Wl,-z,max-page-size=0x1000 \
  -Wl,-T,ape.lds -include cosmopolitan.h crt.o ape.o cosmopolitan.a

# ~40kb static binary (can be ~16kb w/ MODE=tiny)
./hello.com

The above command fixes GCC so it outputs portable binaries that will run on every Linux distro in addition to Mac OS X, Windows NT, FreeBSD, and OpenBSD too. For details on how this works, please read the αcτµαlly pδrταblε εxεcµταblε blog post. This novel binary format is also optional: conventional ELF binaries can be compiled too by removing the -Wl,--oformat=binary flag.

Your program will also boot on bare metal too. In other words, you've written a normal textbook C program, and thanks to Cosmopolitan's low-level linker magic, you've effectively created your own operating system which happens to run on all the existing ones as well. Now that's something no one's done before.

Performance

Cosmopolitan has been optimized by hand for excellent performance on modern desktops and servers. Compared with glibc, you should expect Cosmopolitan to be almost as fast, but with an order of a magnitude tinier code size. Compared with Musl or Newlib, you can expect that Cosmopolitan will generally go much faster, while having roughly the same code size, if not tinier.

In the case of the most important libc function, memcpy(), Cosmopolitan outperformed every other open source library tested. The chart below shows how quickly memory is transferred depending on the size of the copy. Since it's log scale, each grid square represents a 2x difference in performance. What makes Cosmopolitan so fast here is it uses uses several different memory copying strategies. For small sizes it uses an indirect branch with overlapping moves; for medium sizes it uses simd vectors, and for large copies it uses nontemporal hints which prevent cache trash. Other libraries usually fall short because they use a one-size-fits-all strategy. For example, Newlib goes 10x slower for the optimal block size (half L1 cache) because it always does nontemporal moves.

memcpy() performance for varying n values

Trickle-Down Performance

Performing the best on benchmarks isn't enough. Cosmopolitan also uses a second technique that the above benchmark doesn't measure, which we call "trickle-down performance". For an example of how that works, consider the following common fact about C which that's often overlooked. External function calls such as the following:

memcpy(foo, bar, n);

Are roughly equivalent to the following assembly, which leads compilers to assume that most cpu state is clobbered:

asm volatile("call memcpy"
             : "=a"(rax), "=D"(rdi), "=S"(rsi), "=d"(rdx)
             : "1"(foo), "2"(bar), "3"(n)
             : "rcx", "r8", "r9", "r10", "r11", "memory", "cc",
               "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6");

In other words the compiler assumes that, in calling the function, fifteen separate registers and all memory will be overwritten. See the System V ABI for further details. This can be problematic for frequently-called functions such as memcpy, since it inhibits many optimizations and it tosses a wrench in the compiler register allocation algorithm, thus causing stack spillage which further degrades performance while bloating the output binary size.

So what Cosmopolitan does for memcpy() and many other frequently-called core library leaf functions, is defining a simple macro wrapper, which tells the compiler the correct subset of the abi that's actually needed, e.g.

#define memcpy(DEST, SRC, N) ({       \
  void *Dest = (DEST);                \
  void *Src = (SRC);                  \
  size_t Size = (N);                  \
  asm("call memcpy"                   \
      : "=m"(*(char(*)[Size])(Dest))  \
      : "D"(Dest), "S"(Src), "d"(n),  \
        "m"(*(char(*)[Size])(Src))    \
      : "rcx", "xmm3", "xmm4", "cc"); \
    Dest;                             \
  })

What this means, is that Cosmopolitan memcpy() is not simply fast, it also makes unrelated code in the functions that call it faster too as a side-effect. When this technique was first implemented for memcpy() alone, many of the functions in the Cosmopolitan codebase had their generated code size reduced by a third.

For an example of one such function, consider strlcpy, which is the BSD way of saying strcpy:

/**
 * Copies string, the BSD way.
 *
 * @param d is buffer which needn't be initialized
 * @param s is a NUL-terminated string
 * @param n is byte capacity of d
 * @return strlen(s)
 * @note d and s can't overlap
 * @note we prefer memccpy()
 */
size_t strlcpy(char *d, const char *s, size_t n) {
  size_t slen, actual;
  slen = strlen(s);
  if (n) {
    actual = MIN(n, slen);
    memcpy(d, s, actual);
    d[actual] = '\0';
  }
  return slen;
}

If we compile our strlcpy function, then here's the assembly code that the compiler outputs:

/ compiled with traditional libc
strlcpy:
        push       %rbp
        mov        %rsp,%rbp
        push       %r14
        mov        %rsi,%r14
        push       %r13
        mov        %rdi,%r13
        mov        %rsi,%rdi
        push       %r12
        push       %rbx
        mov        %rdx,%rbx
        call       strlen
        mov        %rax,%r12
        test       %rbx,%rbx
        jne        1f
        pop        %rbx
        mov        %r12,%rax
        pop        %r12
        pop        %r13
        pop        %r14
        pop        %rbp
        ret
1:      cmp        %rbx,%rax
        mov        %r14,%rsi
        mov        %r13,%rdi
        cmovbe     %rax,%rbx
        mov        %rbx,%rdx
        call       memcpy
        movb       $0,0(%r13,%rbx)
        mov        %r12,%rax
        pop        %rbx
        pop        %r12
        pop        %r13
        pop        %r14
        pop        %rbp
        ret
        .endfn     strlcpy,globl
/ compiled with cosmopolitan libc
strlcpy:
        mov        %rdx,%r8
        mov        %rdi,%r9
        mov        %rsi,%rdi
        call       strlen
        test       %r8,%r8
        je 1f
        cmp        %r8,%rax
        mov        %r8,%rdx
        mov        %r9,%rdi
        cmovbe     %rax,%rdx
        call       MemCpy
        movb       $0,(%r9,%rdx)
1:      ret
        .endfn     strlcpy,globl

That's a huge improvement in generated code size. The above two compiles used the same gcc flags and no changes to the code needed to be made. All that changed was we used cosmopolitan.h (instead of the platform c library string.h) which contains ABI specialization macros for memcpy and strlen. It's a great example of how merely choosing a better C library can systemically eliminate bloat throughout your entire codebase.

Feedback
jtunney@gmail.com


from Hacker News https://ift.tt/34N3qOY

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.