This is the second episode of a miniseries in the vast world of Deep Neural Networks, where I start from Intel® Open Image Denoise open-source library (OIDN) and create a CUDA/CUDNN-based denoiser that produces high-quality results at real-time frame rates. In this series, I narrate the story in first person.

The first episode was focused on a quick dive into the OIDN library source code to determine if the Deep Neural Network it contains can be easily extracted and reproduced elsewhere. The library contains ‘tensor archive’ files that can be easily decoded, which provides the data necessary to construct a DNN. I identified the decoder source code and the procedure by which tensors are connected to form the network, but also accumulated many questions about some of the more nebulous concepts: what is a tensor? How do tensors relate to Neural Networks? What is a Convolutional Neural Network? What is a U-Net? And more… It is time to get some answers.

One thing first, I want to thank two people:

- Attila Áfra, the architect behind the OIDN library, for publishing the library for anyone to study.
- Marco Salvi, for the help and guidance on theoretical notions during my exploration.

## Tensors

I let mathematicians explain what a tensor is. However, as correct as the following explanation may be, the universal generalization in it helps me only so much in understanding what I am dealing with here.

Tensors are simply mathematical objects that can be used to describe physical properties, just like scalars and vectors. In fact, tensors are merely a generalization of scalars and vectors; a scalar is a zero-rank tensor, and a vector is a first rank tensor.

The University of Cambridge – What is a Tensor?

In my own words, and for the purpose of this project, a tensor is often a 3D or 4D matrix, typically floating-point numbers; higher number of dimensions are possible, but I will not encounter those in this series. A 1080p RGB image is a 3D tensor with dimensions (W*H*C) 1920x1080x3, where W and H are for *width* and *height* in pixels, respectively; and C is for the number of channels: RGB. A sequence of 10 of such images can be seen as a 4D tensor. Such a tensor has 4 dimensions (N*W*H*C) 10x1920x1080x3, where N is the number of images. Generalizing this, one could say that a single image can still be considered a 4D tensor where N = 1. Dimensions here have nothing to do with Cartesian dimensions of course, they are purely relational.

When expressing the resolution of an image, we commonly say something on the line of “1920 by 1080”. However, if we consider how uncompressed image data is stored in memory, we will observe that images are stored line by line, pixel by pixel, and channel by channel. In the case of a sequence, we have N full images, each with H rows made of W columns of pixels, each pixel made of C channels. Thus, if one would have to come up with a notation to describe how images are commonly stored in memory, it could be NHWC. These letters are commonly used in Machine Learning to describe tensors, and when an image is loaded to be processed by a neural network, it is stored in a tensor with a “format” such as NHWC, even though I do prefer to use the term *data layout*.

Are there other data layouts? Sure, there are! My mind is quickly drawn to familiar similarity in concept of the SIMD programming model (Single Instruction Multiple Data), where there are different ways of storing data resulting in potentially significant difference in performance. In the context of SIMD one may have AOS (Array of Structs) or SOA (Struct of Arrays) layouts. For example:

```
struct color
{
float r,g,b;
};
// An AOS data layout with 64 color elements
color aos[64];
template<int size>
struct color_soa
{
float r[size];
float g[size];
float b[size];
};
// An SOA data layout with 64 color elements
color_soa<64> soa;
```

In the AOS case, the data as stored in memory contains a sequence of rgbrgbrgbrgb…, while in the SOA case, the data in memory appears as rrrrr…gggg…bbbb… SIMD instructions prefer reading SOA data because, with each individual load and store instructions, the processor can fill wide registers with several elements accessed consecutively and sequentially. This makes good use of the memory latency and bandwidth, resulting in faster processing speeds.

Back from this direct analogy. Tensors may be organized in different data layouts depending on the operation we need to run on them, and how the available hardware may prefer to access the data. So, if NHWC is analog to AOS, NCHW is for SOA. In a NCHW data layout, you would have N images concatenated, each made of H*W planar representation of the red channel, followed by the planar representation of the green channel, then the blue…

The ‘N’ dimension is a bit of an impostor. More than a dimension to the data, it is a way to express a batch of identical elements. A notion for some algorithms where to apply the very same computation to many individual entries, without any of them to overlap or interfere.

## Convolution

A graphics programmer is likely to be accustomed to the concept of convolution:

In mathematics (in particular, functional analysis),

Convolution – Wikipediaconvolutionis a mathematical operation on two functions (fandg) that produces a third function () that expresses how the shape of one is modified by the other. The termconvolutionrefers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result (see commutativity). The integral is evaluated for all values of shift, producing the convolution function.

While convolution is an integral of the product of functions. In image processing, the term convolution is often improperly used, a pedantic person would say discrete convolution instead, which is achieved as a much simpler weighted summation.

Discrete convolution is the process of sliding a weighted window (*kernel*) on top of an image, pixel by pixel. As the kernel overlaps a region of pixels in the image, it provides wights to sum such pixels. The result obtained is then stored to the output image. Common use of convolution in image processing includes blur, sharpen, edge detection, etc…

During convolution, something happens at the edges and corners of the image. Say the kernel is a 3×3 window. When its center is aligned to the very first pixel of the input image, some of the kernel extends outside the image boundary. There are multiple ways to define how this case must be handled:

- The values of the pixels at the border implicitly extends outside of the image.
- Rings of values outside of the image are considered zeroes (this is called
*padding*) - The convolution kernel never extends outside the borders of the input image (no padding). With a kernel 3×3, the output image has 1 pixel trimmed off each side.

By controlling *padding,* one can produce an output that is equal or larger that the input image. A common configuration is to preserve the image resolution, which enables the application of multiple filters in succession without reducing the output resolution. For this configuration, the padding value should be half the kernel width rounded down to the nearest integer. For example, for a 3×3 kernel, the padding should be 1 pixel on each side, and for a 5×5 kernel, the padding should be 2 pixels on each side.

#### Convolution and Tensors

Up to now the diagrams have shown convolution being applied to single channel. Things become more interesting when multiple channels come into play. The simplest extension is to apply the same filter weights consistently to all channels, such as for an RGB image where the input rgb pixels are weighted by scalar weights to produce new rgb values. This is common in image processing, such as with blur filters that apply the same effects to all image channels. However, convolution filters can have channels too, allowing for a different set of weights to be applied per input channel. Furthermore, a filter can express a matrix multiplication between a set of input channels and a set of filter channels to produce a set of output channels that may differ in number from the input channels.

Say I have an RGB image, and the purpose of the convolution is to extract a variety of features from it, say vertical edges, horizontal edges, and two diagonal edges. To achieve this, a 4D tensor can be used as the filter. This tensor defines how to compute each of the 4 output features in respect of the 3 input channels. Each of the 12 combinations is a H*W filter. So, we have 4 outputs, times 3 inputs, times H*W filter weights. Following the NCHW notation for tensor, ‘O’ stands for output and ‘I’ stands for input. In our example, the filter is a tensor with data layout OIHW and dimensions [4, 3, 3, 3].

The number of elements in a tensor is given by the products of the dimensions. In this example, the filter tensor has a total of 4*3*3*3 = 108 weights. These weights connect a region in the input tensor of 3*3 pixels, across its 3 channels, to a pixel of the output tensor across its 4 channels.

In the first episode we described Deep Neural Networks as a sequence of neural layers, interconnected by weights… And connecting the dots, a 4D convolution filter whose weights are produced by a ML training process is in fact a type of neural network!

## Convolutional Neural Networks

Also referred as ConvNets, or CNN, Convolutional Neural Networks arises from the observation that certain type of processing is desired to be applied consistently across the input data. If I want to identify handwritten numbers on an image, I would like a Neural Network to identify the feature independently on where it may appear in the image, and independently on its resolution. If I want to denoise an image, I would like the noise to be consistently recognized as such, form the center of the image, to the corners. ConvNets are practical and effective at this, as they can be expressed as a sequence of convolution filters (plus a few more type of layers I will describe in a future episode), rather than rigid and more expensive fully-connected networks.

In a ConvNet, the input and output tensors of a convolution layer are the neurons, the filter tensor is the weights. The operation of convolution instantiates the same weights across the image as the convolution window slides across, connecting the many regions of neurons to the respective output neurons. The result can be seen as a massive, and very efficiently compressed, neural network. This is the answer to one of my naive initial questions: how can a neural network process images of arbitrary size. Now I know and this opens the door to me to a whole new universe previously ignored. I was blind and now I see!!

Time for a break.

## Extraction

Armed with some new theoretical understanding, I feel positive about extracting the *parseTZA* function from the OIDN code base. Here is the function pseudocode from the first episode:

```
parseTZA(...):
Parse the magic value
Parse the version
Parse the table offset and jump to the table
Parse the number of tensors
Parse the tensors in a loop:
Parse the name of the tensor
Parse the number of dimensions
Parse the shape of the tensor (vector of dimensions)
Parse the layout of the tensor, either "x" or "oihw"
Parse the data type of the tensor, only float is supported
Parse the offset to the tensor data
Create a tensor using dimensions, layout, type, and pointer to data
Add it to the map by its name
Return map of tensor
```

I have a good understanding of what a tensor is, what “dimensions” means, what OIHW layout is, while the layout “x” remains mysterious… Appropriate label though! I am going to roll up my sleeves and begin copy-pasting code to a new header file. I begin with the function itself; the function requires types *Tensor*, *Device*, and *Exception*. Class *Device *seems to be involved…

I am not going to use any of it. Classes like this one tend to govern the computation in a complex heterogeneous system. Since I only need to read the tensors, I declare an impostor *struct Device {} *instead, and see if I can eliminate it later.

Looking at class Tensor I see bits I need and bits I don’t… Here is an example of the before and after, to give you a sense of the type of aggressive pruning: its 261 lines down to 21.

The whole point of what I keep is to retain the information decoded in *parseTZA* function only. Any dead code that comes with the copy-paste, I quickly identify and remove. The process is recursive: *class Tensor* requires *class TensorDesc*, copy paste the implementation, simplify, simplify, simplify. It would be too involved for me to show this process unfolding. And I don’t think it would be interesting to document such a process either, it’s merely an extraction refactoring.

This is what I am left with after a couple of hours with a machete.

```
// Note: this implementation is extracted from the OIDN library and simplified for the purpose
// of just loading the OIDN tza weights blob files. For more, check:
// https://github.com/OpenImageDenoise/oidn.git
//
// Copyright 2009-2021 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#include <vector>
#include <map>
#include <string>
#include <memory>
#pragma once
namespace oidn
{
// Error codes
enum class Error
{
None = 0, // no error occurred
Unknown = 1, // an unknown error occurred
InvalidArgument = 2, // an invalid argument was specified
InvalidOperation = 3, // the operation is not allowed
OutOfMemory = 4, // not enough memory to execute the operation
UnsupportedHardware = 5, // the hardware (e.g. CPU) is not supported
Cancelled = 6, // the operation was cancelled by the user
};
class Exception : public std::exception
{
private:
Error error;
const char* message;
public:
Exception(Error error, const char* message)
: error(error), message(message) {}
Error code() const noexcept
{
return error;
}
const char* what() const noexcept override
{
return message;
}
};
enum class DataType
{
Float32,
Float16,
UInt8,
Invalid
};
// Tensor dimensions
using TensorDims = std::vector<int64_t>;
// Tensor memory layout
enum class TensorLayout
{
x,
chw,
oihw,
};
// Tensor descriptor
struct TensorDesc
{
TensorDims dims;
TensorLayout layout;
DataType dataType;
__forceinline TensorDesc() = default;
__forceinline TensorDesc(TensorDims dims, TensorLayout layout, DataType dataType)
: dims(dims), layout(layout), dataType(dataType) {}
// Returns the number of elements in the tensor
__forceinline size_t numElements() const
{
if (dims.empty())
return 0;
size_t num = 1;
for (size_t i = 0; i < dims.size(); ++i)
num *= dims[i];
return num;
}
// Returns the size in bytes of an element in the tensor
__forceinline size_t elementByteSize() const
{
switch (dataType)
{
case DataType::Float32: return 4;
case DataType::Float16: return 2;
case DataType::UInt8: return 1;
default:
return 0;
}
}
// Returns the size in bytes of the tensor
__forceinline size_t byteSize() const
{
return numElements() * elementByteSize();
}
};
// Tensor
class Tensor : public TensorDesc
{
public:
const void* ptr; // Data is only temporarily referred, not owned
public:
Tensor(const TensorDesc& desc, const void* data)
: TensorDesc(desc),
ptr(data)
{}
Tensor(TensorDims dims, TensorLayout layout, DataType dataType, const void* data)
: TensorDesc(dims, layout, dataType),
ptr(data)
{}
__forceinline const void* data() { return ptr; }
__forceinline const void* data() const { return ptr; }
};
// Checks for buffer overrun
__forceinline void checkBounds(char* ptr, char* end, size_t size)
{
if (end - ptr < (ptrdiff_t)size)
throw Exception(Error::InvalidOperation, "invalid or corrupted weights blob");
}
// Reads a value from a buffer (with bounds checking) and advances the pointer
template<typename T>
__forceinline T read(char*& ptr, char* end)
{
checkBounds(ptr, end, sizeof(T));
T value;
memcpy(&value, ptr, sizeof(T));
ptr += sizeof(T);
return value;
}
// Decode DNN weights from the binary blob loaded from .tza files
int parseTZA(void* buffer, size_t size,
// results
std::map<std::string, std::unique_ptr<Tensor>>& tensorMap)
{
[...]
}
} // namespace oidn
```

This listing can be simplified further, by removing the use of inheritance, granular heap allocations, reliance of *std::unique_ptr*, and exceptions. These are a few among the things I could spend a few more minutes cleaning up. But for now, I don’t want to modify the body of the *parseTZA* function. I can always come back to it later. I am more curious about what it’s parsing from the files, so I add some logging code, and this is what I find:

```
Tensor Name | Dimensions | Layout | BytesSize
------------------+----------------+--------+-----------
enc_conv0.weight | 32, 9, 3, 3 | oihw | 10368
enc_conv0.bias | 32 | x | 128
enc_conv1.weight | 32, 32, 3, 3 | oihw | 36864
enc_conv1.bias | 32 | x | 128
enc_conv2.weight | 48, 32, 3, 3 | oihw | 55296
enc_conv2.bias | 48 | x | 192
enc_conv3.weight | 64, 48, 3, 3 | oihw | 110592
enc_conv3.bias | 64 | x | 256
enc_conv4.weight | 80, 64, 3, 3 | oihw | 184320
enc_conv4.bias | 80 | x | 320
enc_conv5a.weight | 96, 80, 3, 3 | oihw | 276480
enc_conv5a.bias | 96 | x | 384
enc_conv5b.weight | 96, 96, 3, 3 | oihw | 331776
enc_conv5b.bias | 96 | x | 384
dec_conv4a.weight | 112, 160, 3, 3 | oihw | 645120
dec_conv4a.bias | 112 | x | 448
dec_conv4b.weight | 112, 112, 3, 3 | oihw | 451584
dec_conv4b.bias | 112 | x | 448
dec_conv3a.weight | 96, 160, 3, 3 | oihw | 552960
dec_conv3a.bias | 96 | x | 384
dec_conv3b.weight | 96, 96, 3, 3 | oihw | 331776
dec_conv3b.bias | 96 | x | 384
dec_conv2a.weight | 64, 128, 3, 3 | oihw | 294912
dec_conv2a.bias | 64 | x | 256
dec_conv2b.weight | 64, 64, 3, 3 | oihw | 147456
dec_conv2b.bias | 64 | x | 256
dec_conv1a.weight | 64, 80, 3, 3 | oihw | 184320
dec_conv1a.bias | 64 | x | 256
dec_conv1b.weight | 32, 64, 3, 3 | oihw | 73728
dec_conv1b.bias | 32 | x | 128
dec_conv0.weight | 3, 32, 3, 3 | oihw | 3456
dec_conv0.bias | 3 | x | 12
```

The mysterious “x” tensor format is revealed: those the biases! Not surprisingly, all the decoded tensors come in pairs of weight and biases. The weight tensors seem all to be 3×3 convolution windows, while I expected to see larger sizes. Previous questions are replaced with new questions. This is progress!

## Conclusion and next steps

I made some strides in the practical understanding of tensors, convolution and how that becomes the foundation of a variety of DNN architectures referred as Convolutional Neural Networks. I then bit the bullet and began extracting the minimal source code I need from the OIDN library. I know there will be more snippets to extract, but hopefully those will be more mathematical in nature, and less structural.

In the next episode I am going to study how the CNN is connected and discover some new type of layers that I currently ignore. Finally, I will build a theoretical understanding of the U-Net architecture and how these are easily portable to other problems and domains.

Hopefully you enjoyed this second episode. I will attempt to publish one new episode per week for the duration of the series. Stay tuned!

List of previous episodes:

## One thought on “DNND 2: Tensors and Convolution”