Saturday, July 18, 2020

Optimizing Ray Tracing in Haskell

Optimizing Ray Tracing in Haskell

My first Haskell program and how easily I optimized it from 33m to 17s.

1200x675 scene generated containing 533 objects rendered with 100 samples and 5000 rays per pixel.

Background

Few weeks back, my colleague at work, Eyal Kalerdon, shared his implementation of ray-tracing-in-one-weekend in Rust, which inspired me to try this as well. I, however, chose Haskell for this, thinking I’ll not only learn ray tracing techniques but a new language as well. Haskell has been one of my favorite language for many years but I never got any opportunity to actually write code in it.

Anyway, the implementation is pretty straightforward, as my only job was to port the C++ implementation from the article (which we both followed) to Haskell.

Parameters And Assumptions

In this article, I’ll execute a program called ray-tracers which is parameterized by 3 arguments, to be passed on the command line and all other parameters (like aspectRatio, the camera settings) are hardcoded in the program. The executable ray-tracers takes the following arguments:

$ ./ray-tracers <image-width> <samples-per-pixel> <rays-per-sample>

We’ll try different values of image-width, keeping the other two parameters fixed, viz. 100 and 50 respectively, in all the experiments. Please note that image-height is computed by the program as:

imageHeight = <image-width> / aspectRatiowhere, aspectRatio = 16.0/9.0 # hardcoded in the program

The First (Naive) Version

The first version, ray-tracers-0.1.0, was naive though in the sense that I consciously didn’t make any attempt to optimize any part of the code (though it is compiled with -O2), e.g Vec is implemented in terms of list simply because standard list operations can be used to implement other functionalities quickly and so on.

I made these design decisions to see how well the naive implementation performs and to get an idea of the changes required to make the program perform really really well. I also wanted to compare the readability of the performant code with the naive implementation. So let’s see how well it performs.

Kaboom! Turns out that our naive program hit the default stack size with image-width set to 100, and I found that GHC has an open bug which says that +RTS -K2M -RTS wont really increase the stack size to 2M. So we have to work within this limit. Let’s run it a couple of times with lower image widths.

As we see the pixels per second statistic stays roughly the same for different values of image-width. Based on that we can extrapolate that 384x216 image would take 33mins roughly which is too bad — even though it is compiled with -O2. So in rest of the article, we’ll see how this scenario improves. 🤞

Our Goal

Our goal is to optimize the code gradually, in step by step manner and see results at each stage — I’ll also share the commit so that you can see the diff at each stage. We’ll use 384 100 50 as command line arguments to our program when comparing performance. However, for inspection and other purposes, we might use smaller sizes so that the program runs quickly.

Machine and Environment

I’ll run all the tests on my MacBook Pro which has these configurations:

Apart from that, I use GHC 8.6.5 and cabal 3.2.2.0.

Use RTS to inspect the memory usage

One of the obvious reason why the program is slow is because it’s currently single-threaded. But it’s not very wise to make a program multi-threaded simply because single-threaded performance is poor. If the single-threaded program is consuming too much memory, then the multi-threaded one will consume even more memory in a shorter time span further degrading performance.

So I added -rtsopts to ghc-options. It’ll enable us to control the behavior of the runtime system (RTS) when we run the program. So I rebuilt the program with -rtsopts and ran it, passing +RTS -s RTS as arguments to the runtime system. The remaining arguments were passed to our program. The option-s will tell RTS to show some useful stats on the terminal itself as we can see below.

The most interesting info is this line: 1653 MB total memory in use. This is the actual memory consumed and used by our program. 1653 MB is too high for 40x22 size image, isn’t it? Apart from that, also note that more than 4s is spent by the GC — which is almost 20% of total time taken by the program (20s). That actually makes sense — the more memory you consume, the more work the GC will have to do to reclaim it..

Laziness can be a curse sometime: Thunks

Haskell is a non-strict language — most of the expressions are not evaluated right way. Instead the compiler creates thunks to represent the unevaluated expressions, as a result of which the composite expressions, such as a + b * c — d, do not reduce to their values. Rather, they make graph like structures, and compose further to become even bigger graphs, consuming excessive memory and slowing down the program in the process. That explains 1653 MB memory usage figure we saw in the previous section.

Anyway, lazy computations are good if it is not known that the result will eventually be used. But if we know that a result will be used, then laziness could be a curse. In our case, it is!

So let’s force expressions to evaluate expressions fully, so that we don’t consume too much memory. To do that, I used deepseq library which provides many functionalities, one of which is aptly named as force . It is simple to use: just put force in front of the expressions you want to evaluate.



from Hacker News https://ift.tt/3eM5Q2d

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.