GPU-Motunui
03 Oct 2020Disney Animation’s Moana island dataset is a production-scale scene with memory requirements that make it challenging to render. This post summarizes some of those challenges, and describes how the GPU-Motunui project is able to efficiently render the scene on a consumer-grade GPU with less than 8GB of memory. Click here to skip ahead to the results.
The Moana island
In 2018, Disney Animation released the Moana island dataset to the rendering research community. Compared to traditional research scenes, the scale of the Moana island scene is massive: the scene contains 90 million quad primitives, 5 million curves, and more than 28 million instances. All told, the island consists of over 15 billion primitives, weighing in at just under 30GB of geometry files.
The shots included with the dataset are beautiful, and showcase the amazing imagery that can be created by combining the best artists in the world with path tracing techniques and modern hardware. Here are two reference images, rendered with Disney’s proprietary Hyperion renderer:
Hyperion shotCam reference
Hyperion beachCam reference
GPU-Motunui Project
The goal of the GPU-Motunui project is to render all the Moana shots efficiently and accurately on a consumer-grade graphics card. There are two main challenges to accomplishing this with the Moana dataset. First, with a typical graphics card having only 8GB of memory, an out-of-core rendering solution is required to handle the large amounts of geometry. Second, the scene’s textures are provided in the Ptex format, and Ptex doesn’t have a publicly available CUDA implementation. This project currently only solves the first problem, and Ptex texture lookup is done on the CPU (although conveniently its cost is fully hidden by being computed concurrently with GPU shadow ray tracing).
The Hyperion reference images are impossible to match exactly; for example the varying brown and green colors along the palm tree fronds in the palmsCam shot are not provided in the dataset. Other features of the scene are possible to render but out of my initial scope, notably subdivision surfaces and their displacement maps, and a full Disney BSDF implementation.
Example of an unreproducible material variation on the palm tree frond
All ray tracing operations are run through Nvidia’s OptiX 7 API. This means GPU-Motunui gets the full benefits of available RT cores and a world-class BVH implementation. The following sections describe how GPU-Motunui maps dataset assets to OptiX data structures, and how GPU-Motunui’s out-of-core rendering solution works.
Scene representation
The Moana scene makes widespread use of multi-level instancing. In OptiX, this requires a three-level hierarchy of acceleration structures to manage: two levels of IASs, and a base level of GASs (Instance Acceleration Structures and Geometry Acceleration Structures, respectively). GPU-Motunui makes use of OptiX’s AS compaction and relocation APIs to further reduce memory usage.
The isHibiscus element makes a good example of how a typical element in the scene is organized and built. The tree is assembled from a base model in one Wavefront .obj file (containing the trunk and branches), and four primitives: one flower and three leaf models (each with their own .obj file).
Left: The four simple primitives that will be instanced to fill out the hibiscus tree
Right: The base trunk and branches model
In OptiX, each of these models has an associated GAS, and each GAS can be subdivided into multiple build inputs. Build inputs are used to map sections of the model to information needed at shading time by indexing into OptiX’s shader binding table. These GASs form the bottom level of the hierarchy.
Next, an IAS is used to build the full isHibiscus element. This IAS is in the middle level of the hierarchy. The figure below shows each primitive’s instances in isolation, and combined to make the full element:
Left: Isolated instances for each primitive
Right: Full isHibiscus element
Finally, a second IAS is built to track all of the element’s instances present in the scene. This second IAS is the top level of the instance hierarchy.
The shotCam view rendered with only isHibiscus instances
Although the isHibiscus element has a typical structure, there are some more complicated elements included in the dataset. The isCoral element, for example, has different base geometry and instanced primitives for each of its element instances, but the underlying primitive geometries are shared across all the element instances.
The Moana GAS and IASs alone require 18.5 GB, well past the memory budget of my 8GB RTX 2070. Because OptiX has no native support for out-of-core rendering, the traditional OptiX pipeline had to be put aside for a custom-made solution.
Out-of-core rendering
To solve the out-of-core rendering problem, GPU-Motunui divides the scene’s geometry into different sections, and ray traces each separately, while tracking the closest hit. Replacing a traditional device trace call with a host loop comes with many design consequences to the renderer, from asset loading to the core path tracing loop that sends rays through the scene.
Before rendering, the asset loading process allocates a large chunk of GPU memory (currently 6.7GB). A custom allocator is implemented that manages this block of memory. It is responsible for allocating two types of memory: output and temporary. Output memory is allocated from the left of the block, and is used for OptiX structures. Temporary memory is managed on a stack from the right end of the memory block. Managing the temporary memory this way ensures that the output structures are always tightly packed.
After elements are processed into their accelerator structures on the GPU, their used memory is snapshotted onto the host, and the allocator is cleared. The process is repeated until all of the scene’s geometry is processed, resulting in the host managing a list of GPU memory snapshots. The figure below shows an example layout of GPU memory that could be snapshotted:
GPU memory layout after loading the isHibiscus element.
(Dotted arrows show that an IAS holds instances of the pointed-at AS)
As mentioned above, when it comes time to ray trace, each snapshot is processed in a loop. This means a call to cudaMemcpy
and optixLaunch
for each snapshot. A global buffer is maintained that indicates the depth of the current closest intersection. This value is used as the tmax
parameter for the CUDA kernel’s call to optixTrace
, and a successful intersection will update the depth buffer for the next launch.
In a traditional OptiX path tracer, the entire render loop can run in device code inside a single call to optixLaunch
; i.e., a successful intersection will lead to more BSDF and shadow rays being traced in the same kernel launch. Because GPU-Motunui’s design mandates multiple launches for tracing each path segment, the render loop is pulled out into host code. While this potentially diminishes OptiX’s ability to efficiently schedule program execution, it also opens up opportunties for optimization, such as running Ptex texture lookups on the CPU concurrently with GPU kernels and I/O.
Shading
As with any OptiX application, GPU-Motunui makes use of the shader binding table (SBT). SBT records contain pointers to normal buffers and material attributes. The underlying data for the normal buffers is stored alongside OptiX acceleration structures and included in geometry snapshots. This ensures that GPU memory is never wasted on unreachable normal buffer data.
Renders
Included below are GPU-Motunui renders of the six scenes included in the dataset. shotCam is the slowest to render at 18.2 seconds per sample at 1024x429 resolution, and took just over five hours total for the final image. All shots are 1024spp, capped at a maximum of five bounces, and were run on an Nvidia RTX 2070.
shotCam
beachCam
dunesACam
palmsCam
birdseyeCam
rootsCam
grassCam
Optimization
The initial implementation of the renderer required 42.6 seconds per 1spp on the shotCam scene. A few optimizations combined to make significant reductions in rendering time, cutting each pass down to 18.2 seconds (a 57.3% reduction).
CPU/GPU concurrency
Tracing shadow rays on the GPU in parallel with Ptex lookups on the CPU cut rendering time by 23.4%. It was disappointing to be forced to do texture lookups on the CPU, but the time savings make up for it.
Multiple Ptex caches
Parallelizing the Ptex lookups and using multiple Ptex caches eliminated texture lookups as a bottleneck to the system; shadow ray casting time fully dominates the texture lookup. Empirically, spawning two threads per core (totaling 12 on an Intel i7-8700K) and sharing three Ptex caches comfortably reduced the texture lookup time beneath the shadow ray budget. This improved the time savings to a 33.9% reduction over the baseline.
Pinned memory
The acceleration structure snapshots are all saved to pinned host memory. Switching from normal to pinned host memory increased the transfer throughput from 7.73 GB/s to 11.84 GB/s, cutting the baseline render time by 19.5%.
Future Steps
Getting this scene running on my RTX 2070 card was a very fun and rewarding project, but there are still many improvements to be made:
- Implementing the Disney BSDF
- Rendering subdivision surfaces along with displacement mapping
- More efficiently packing the acceleration structures, and optimizing ray tracing throughput
- Experimenting with how various research results hold up on production scenes (e.g., testing select path guiding techniques)
References
from Hacker News https://ift.tt/2SDxE0a
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.