DNND 1: a Deep Neural Network Dive

Rendering of a dragon rendered in Path tracing in 14 ms and denoised in 5.5ms using a CUDA porting on the OIDN denoiser described in this miniseries.

This is the first episode of a mini-series where I invite you to join me on a journey of discovery in the vast world of Deep Neural Networks. We will be exploring the Deep Neural Network in the Intel® Open Image Denoise open source library (OIDN) and learning some basic concepts of Convolutional Neural Networks. We will use this understanding to put together a CUDA/CUDNN-based denoiser that produces high-quality results identical to the original implementation, all at real-time frame rates. I will be narrating this in the first person, as if I am telling a story as it unfolds, in an effort to make it more engaging than just another tutorial.

To follow along you don’t need any prior knowledge of Machine Learning. You need to be comfortable with very basic C++, but not much more than that. I expect you to know what a call stack is, what a debugger is and how to use it. Later in the series I will implement some compute kernels in CUDA. I will explain what I do along the way, but this won’t be a CUDA masterclass. I expect you to know some basic concept of path tracing, but really all you need to know about it is that it produces noisy images, and denoising is often required to obtain a final image.


A good approach to learning is to take apart something that works and put it back together. When I was a child, a relative of mine was a clockmaker. I used to watch him work on mechanical watches, carefully opening them up, cleaning their gears of dust and residue, replacing damaged parts, and lubricating rubies. There is only one way to disassemble a mechanism, with each part taken out and placed in a compartment tray in the order of removal. There is also only one way to put it back together. The process is precise, and leaving a screw behind after reassembly is a sign of an amateur.

Photo by Pixabay on Pexels.com

Good software has parallels to clocks, as they are both made up of many parts that work together to perform a precise job. The inner workings of a program may seem mysterious, like peering through the gaps between the layers of gears in a mechanical clock. Where does the motion originate, and how does it propagate? Why do some parts seem stale even when they serve a purpose?

However, there are also stark differences between software and clocks. Unlike clocks, it is not immediately obvious how to take software apart, and there is no one correct way to do it. Nonetheless, we do it all the time – when we study API examples, review our teammates’ work, or decipher an obscure function we wrote a year ago that only the man upstairs seems to understand.

This isn’t about clocks, nor about just any software; it’s about Artificial Neural Networks, which even the experts may not fully understand. As someone who has never taken on a Machine Learning project before, I am eager to make some strides. I have found that the beginner reading materials are not leaving a lasting impression on me, as abstract knowledge tends to fade quickly from my biological neurons without practical applications.

Why did I choose this specific project? I enjoy implementing path tracer renderers, I do like the challenge, and love the results. Path tracing is based on Monte Carlo integration, which it is known to produce noisy images in a short time frame. The more samples are taken per pixel, the more the image refines, until it converges to a smooth noise-free result. This process of convergence can take long time, minutes or even hours. The big innovation in the past few years to make path tracing possible in real time is due to three factors: advancement in hardware-accelerated raytracing, advancements in light transport and sampling techniques, and image denoising breakthroughs.

Rendering of a dragon rendered in Path tracing in 14 ms and denoised in 5.5ms using a CUDA porting on the OIDN denoiser described in this miniseries.

Denoising is the process of filtering an image to remove Monte Carlo sampling noise, while preserving as much of the detail in the image as possible. This tends to be achieved in a few different ways:

  • Filters and signal reconstruction techniques.
  • Machine Learning based techniques (what this series is about).
  • Machine Learning used to infer filter parameters for signal reconstruction.

Realtime path tracing would not be possible without denoising. Any real time raytracing demo we may have seen in the past few years is as much as denoising demo as it is a raytracing one.

The very basic

I believe some of the most common introductory Machine Learning tutorial is about classification of handwritten digits using the MNIST training set… in a nutshell, a relatively small neural network is constructed and trained to extract specific information from a set of fixed resolution images: which digit is presented in the image.

A digital neuron is comprised of a value that expresses its current state of activation. In biological neurons, connections are called synapses, which can either stimulate or inhibit the state of activation of other neurons. However, in artificial neural networks, the term synapse is not commonly used, and instead, the term weight is used instead. To evaluate a neuron, a weighted summation of its inputs is computed, where weights can be positive (stimulating) or negative (inhibiting). Additionally, a bias (linear shift in value) can be applied to the result of the summation, with the bias also being positive or negative and adjusting the threshold of activation. The activation function determines when a neuron is considered activated, and generally, a positive value is considered “active.” However, several functions are commonly used to remap the output produced by weights and biases, such as clamping negative values (ReLU) or providing a smooth transition (tanh, sigmoid, etc.).

A plot of the typical neural activation function. Follow the link for the interactive graph.

I don’t know when to use one or another, but I remember reading somewhere that simpler activation functions are easier to “train”, since that relies on derivatives, and training is the most complex and expensive part of any ML project. In this series I am not going to make any attempt at understanding training, I am going to reuse the weights of an already trained network.

Back to the handwritten digits classification: there is an input neuron for each of the pixel in the source image. Conceptually these don’t have weights, they are seeded with the pixel values. This is called the input layer. A network made in layers is also called a Deep Neural Network. All neurons in a layer are connected to all neurons in the previous layer, the term Fully-Connected Network is used to refer to this type of topology. The network ends with an output layer with as many neurons as there are possible answers to the problem: the digits from 0 to 9. Any layer in between the input and the output layers are referred to as “hidden layers”. The process of training learns the weights for each of the neurons so that when the network is evaluated, starting from the pixel values, the last 10 neurons are updated with the classification likelihood: the number in the image is “3” with a 90% probability, 7% it’s an “8”, 1% a “9”, etc…

Cool, but what if the input images are varying in resolution? What then? Do we need to construct and train a different network? The learnt weights and biases are specific to the network topology. How do people create networks able to handle any image? This reminds me of the naive type of questions puzzling me at the beginning of my career, such as: how can an application display a GUI while waiting for a user action? Obvious to you all, but it wasn’t to me back then.

The best way for me to cement technical knowledge is to apply it to something I can appreciate, something I have a critical eye for, something I may understand from a user perspective. Most of the work I do is in graphics, and more specifically related to path-tracing. A project to study Machine Learning-based denoiser seems to be a natural progression and a great investment. It may seem daunting, but “we don’t do this because it’s easy, we do it because we thought it was going to be easy”, right?

I drop here a teaser of the result I get at the end of this journey! Now let’s find out how hard it has been.

The final result, the DNN denoiser running in 5.5ms in CUDA on an NVIDIA RTX 6000 Ada GPU.

First step

OIDN is an open source library released by Intel® under the permissive Apache 2.0 license. Here is the link to the GitHub repository. Because the purpose of this text is educational, that seems to be a great place to start. If you want to follow the steps, go on, clone the repository and check the README.md. There are some dependencies to install. It is not critical that you do, since we are not going to use the library as-is, we are going to pull it apart, learn from it, then put it back together in a much smaller form.

Typically what I do when looking at some new code, especially on a mid size project such as OIDN, I do tend to have the library compiled and any example that comes with it up and running. I want to verify that it works, but also in this specific case I do want to compare my results against the original, to make sure I didn’t miss something important; even more so on a Machine Learning project, where there is no equation, no traditional analytic implementation that can be followed one math operation at a time, as for most graphics papers. Here, if something goes sideways, the results will look psychedelic at best.

Here is my intuition to get started: the library comes with a command line tool to denoise an input image from a file. That is excellent, it means I have something to run without having to integrate the library within some other application.

oidnDenoise is a minimal working example demonstrating how to use Intel® Open Image Denoise, which can be found at apps/oidnDenoise.cpp. It uses the C++11 convenience wrappers of the C99 API. […]

Running oidnDenoise without any arguments or the -h argument will bring up a list of command-line options.

From oidn README.md

Also, the README.md file states that the library comes with a pre-trained high quality DNN, and the ability to train it from scratch using arbitrary datasets. This sentence is particularly interesting:

The training result produced by the train.py script cannot be immediately used by the main library. It has to be first exported to the runtime model weights format, a Tensor Archive (TZA) file.

From oidn README.md

If the library can import a “tensor archive” with the DNN weights and biases, maybe we can export the built-in DNN and start from there. I search for TZA in GitHub… and Bingo!

I don’t know if the function parseTZA is called for the builtin DNN, but it is worth a try. Without making much of an effort to understand what that does, I can place a breakpoint inside that function and run the oidnDenoise tool to see if that hits. I imagine a DNN such as this not to be small, perhaps having hundreds of thousands, or even millions of neurons. A tensor archive is probably MBytes in size, at least it is how I imagine it to be. There are a couple of possibilities: since the library can read user trained .tza files, maybe the builtin DNN is stored in such files too. Second possibility is we find some large source file with large C arrays of binary data. I guess the latter, but something makes me nervous about it: the argument is not a const void* and it may not be what we are looking for, it may be an input/output value or something else complex.

oidnDenoise -f RT --ldr input.png -type float -o result.png

I run command in a debugger and it hits the breakpoint inside parseTZA. I can trace back to where that void* buffer argument comes from. I was able to locate where the void* buffer argument originated from. In my experience with various codebases, I’ve found that the clues to the origin of certain values are often buried in deep call stacks, with arguments passed through from function, to function, to function. In cases like these, using a debugger to initiate the exploration tends to proceed more quickly.

But that is not the case here. The callstack is short and the answer is easy to find in frame 1: variable weights is what carries the buffer and it is configured selecting from member variables in some symbol named defaultWeights.

It also reads that the function returns a map of tensor (weightsMap) and stores it as the member variable net of the current class. Listing the places referring defaultWeights I spot its initialization. It seems these are C arrays after all.

Searching for the definition of one of these blob::weights I have the confirmation. Yep, these are ~3.5MB binary blobs. But wait a second! Something is odd, look at the file path in the image here below…

The path leads to a source file in the build folder. These C arrays must be auto-generated! Let’s search for .tza files instead, and sure enough there is a folder in the project root with a very obvious name “weights” and tensor archives inside. There is a python script that is called in custom build command from CMake. Upon cross-checking in GitHub I see that the weights are cloned from the oidn-weights submodule, which has its own license agreement (also Apache 2.0). So I must list these licenses in my project if I end up reusing these tensor files.

The exploration so far have been productive! Lets summarize the findings:

  • There are .tza files ready to reuse.
  • Some logic inside UNetFilter::init() is telling me which one should be used, depending on the combination of input images and options (color, color + albedo + normal, ldf/hdr, and more… I will explain more on this in a post later in the series).
  • Function parseTZA decodes files and constructs a map with named tensors. I have no idea what these do yet, but I’ll find out.
  • UNetFilter… a quick web search reveals that “U-Net” is a specific DNN architecture, so I have reading material to understand how this may work.
  • Function UNetFilter::init() ends with a call named buildNet(), which suggests to me such a function contains some sequence of operations that will use the map of tensors, and assemble it somehow to construct a U-Net.

Time for a break.

A closer look

The plan so far is to rely on .tza files to load into my own program, and with that reassemble the DNN. To extract the parseTZA function I need all of its dependencies. It may result in a blob on unstructured types and functions. Likely these will come from a variety of source files, but feel I don’t need to keep the structure: whatever comes with it I don’t plan to reuse throughout my program. Probably I can make a single header library to read tza files and that’s it.

Here is the parseTZA function source, take a look at it. I don’t need to report it here in full, instead I summarize what it does in pseudocode.


    Parse the magic value, throw on failure

    Parse the version, throw is version is not 2.x

    Parse the table offset and jump to the table

    Parse the number of tensors

    Parse the tensors in a loop:
        Parse the name of the tensor

        Parse the number of dimensions

        Parse the shape of the tensor (vector of dimensions)

        Parse the layout of the tensor, either "x" or "oihw"

        Parse the data type of the tensor, only float is supported

        Parse the offset to the tensor data

        Create a tensor using dimensions, layout, type, and pointer to data
        Add it to the map by its name.

    Return map of tensor

In this function am not seeing any information about how the layers of a DNN may be connected. Additionally, the fact that the tensor description is returned in lexicographic order doesn’t give me any reassurance that the tensor decoded are in execution order, or even if they are all defined: some may be implicit in the technique, so we have more digging to do and more questions to ask. What does it mean that a tensor has “oiwh” format? How many dimensions can a tensor have in practice? Up to now I only know a tensor is a matrix with two or more dimensions. So clearly there is a lot to it that I don’t know.

However, the idea is that a DNN is portable, I should be able to evaluate them verbatim across different DNN library implementations. For my purposes, I want to try to use NVIDIA CUDNN. Here I need to recreate the DNN topology, fill it with weights, and let the low level library do the compute heavy lifting: in theory I could implement all the compute kernels myself, but asking around and apparently few people really do that. Let’s take this one step at a time.

Function parseTZA is not enough, I need to look into buildNet(). This is what the beginning of the function looks like.

Te first few lines of UNetFilter::buildNet implementation: https://github.com/OpenImageDenoise/oidn/blob/master/core/unet.cpp

While the beginning of the buildNet() function seems more involved, if the tensors are the list of ingredients, this function seems to contain the recipe. I don’t need to understand this right now. I have confirmed that extracting the parseTZA function is the first step I need to take.

Conclusion and next step

After the initial exploration I am confident I should be able to understand how OIDN denoiser works, extract the recipe to recreate the same DNN elsewhere, and learn a lot in the process. I hope you feel the same about this.

In the next episode I am going to create a single header library to parse TZA files. I will familiarize myself with the applied notion of tensors and look at concepts of Convolutional Neural Networks, which are at the base of U-Nets.

Hopefully you enjoyed this first episode. I will attempt to publish one new episode per week for the duration of the series. Stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: