Sunday, March 20, 2022

Thoughts on Bootstrapping GHC (2018)

I am returning from the reproducible builds summit 2018 in Paris. The latest hottest thing within the reproducible-builds project seems to be bootstrapping: How can we build a whole operating system from just and only source code, using very little, or even no, binary seeds or auto-generated files. This is actually concern that is somewhat orthogonal to reproducibility: Bootstrappable builds help me in trusting programs that I built, while reproducible builds help me in trusting programs that others built.

And while they make good progress bootstrapping a full system from just a C compiler written in Scheme, and a Scheme interpreter written in C, that can build each other (Janneke’s mes project), and there are plans to build that on top of stage0, which starts with a 280 bytes of binary, the situation looks pretty bad when it comes to Haskell.

Unreachable GHC

The problem is that contemporary Haskell has only one viable implementation, GHC. And GHC, written in contemporary Haskell, needs GHC to be build. So essentially everybody out there either just downloads a binary distribution of GHC. Or they build GHC from source, using a possibly older (but not much older) version of GHC that they already have. Even distributions like Debian do nothing different: When they build the GHC package, the builders use, well, the GHC package.

There are other Haskell implementations out there. But if they are mature and active developed, then they are implemented in Haskell themselves, often even using advanced features that only GHC provides. And even those are insufficient to build GHC itself, let alone the some old and abandoned Haskell implementations.

In all these cases, at some point an untrusted binary is used. This is very unsatisfying. What can we do? I don’t have the answers, but please allow me to outline some venues of attack.

Retracing history

Obviously, even GHC does not exist since the beginning of time, and the first versions surely were built using something else than GHC. The oldest version of GHC for which we can find a release on the GHC web page is version 0.29 from July 1996. But the installation instructions write:

GHC 0.26 doesn’t build with HBC. (It could, but we haven’t put in the effort to maintain it.)

GHC 0.26 is best built with itself, GHC 0.26. We heartily recommend it. GHC 0.26 can certainly be built with GHC 0.23 or 0.24, and with some earlier versions, with some effort.

GHC has never been built with compilers other than GHC and HBC.

So it seems that besides GHC, only ever HBC was used to compiler GHC. HBC is a Haskell compiler where we find the sources of one random version only thanks to archive.org. Parts of it are written in C, so I looked into this: Compile HBC, use it to compile GHC-0.29, and then step for step build every (major) version of GHC until today.

The problem is that it is non-trivial to build software from the 90s using today’s compilers. I briefly looked at the HBC code base, and had to change some files from using varargs.h to stdargs.v, and this is surely just one of many similar stumbling blocks trying to build that tools. Oh, and even the hbc source state

# To get everything done: make universe
# It is impossible to make from scratch.
# You must have a running lmlc, to
# recompile it (of course).

So I learned that actually, most of it is written in LML, and the LML compiler is written in LML. So this is a dead end. (Thanks to Lennart for clearing up a misunderstanding on my side here.

Going back, but doing it differently

Another approach is to go back in time, to some old version of GHC, but maybe not all the way to the beginning, and then try to use another, officially unsupported, Haskell compiler to build GHC. This is what rekado tried to do in 2017: He use the most contemporary implementation of Haskell in C, the Hugs interpreter. Using this, he compiled nhc98 (yet another abandoned Haskell implementation), with the hope of building GHC with nhc98. He made impressive progress back then, but ran into a problem where the runtime crashed. Maybe someone is interested in picking up up from there?

Removing, simplifying, extending, in the present.

Both approaches so far focus on building an old version of GHC. This adds complexity: other tools (the shell, make, yacc etc.) may behave different now in a way that causes hard to debug problems. So maybe it is more fun and more rewarding to focus on today’s GHC? (At this point I am starting to hypothesize).

I said before that no other existing Haskell implementation can compile today’s GHC code base, because of features like mutually recursive modules, the foreign function interface etc. And also other existing Haskell implementations often come with a different, smaller set of standard libraries, but GHC assumes base, so we would have to build that as well…

But we don’t need to build it all. Surely there is much code in base that is not used by GHC. Also, much code in GHC that we do not need to build GHC, and . So by removing that, we reduce the amount of Haskell code that we need to feed to the other implementation.

The remaining code might use some features that are not supported by our bootstrapping implementation. Mutually recursive module could be manually merged. GADTs that are only used for additional type safety could be replaced by normal ones, which might make some pattern matches incomplete. Syntactic sugar can be desugared. By simplifying the code base in that way, one might be able a fork of GHC that is within reach of the likes of Hugs or nhc98.

And if there are features that are hard to remove, maybe we can extend the bootstrapping compiler or interpreter to support them? For example, it was mostly trivial to extend Hugs with support for the # symbol in names – and we can be pragmatic and just allow it always, since we don’t need a standards conforming implementation, but merely one that works on the GHC code base. But how much would we have to implement? Probably this will be more fun in Haskell than in C, so maybe extending nhc98 would be more viable?

Help from beyond Haskell?

Or maybe it is time to create a new Haskell compiler from scratch, written in something other than Haskell? Maybe some other language that is reasonably pleasant to write a compiler in (Ocaml? Scala?), but that has the bootstrappability story already sorted out somehow.

But in the end, all variants come down to the same problem: Writing a Haskell compiler for full, contemporary Haskell as used by GHC is hard and really a lot of work – if it were not, there would at least be implementations in Haskell out there. And as long as nobody comes along and does that work, I fear that we will continue to be unable to build our nice Haskell ecosystem from scratch. Which I find somewhat dissatisfying.



from Hacker News https://ift.tt/cLuwviK

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.