After The LEGO Movie (TLM) delivered, the company took a long breath. I knew change was coming that we wanted it or not, part of it was my proposal after all. I created a detailed business plan with costs, budgets and prospects. You can present some tech to you peers, but you need a plan, numbers and budget figures if you want to go beyond that at this level. One thing is having two people working on a project and succeeding, another is to turn a studio upside down and inside out to reap full benefits of a revolutionary new tech.
Supervisors and executives were receptive this time. Skepticism had gone, they didn’t need much convincing after seen what they have seen. Still lots of the company didn’t know much about what happened so I got asked to prepare a live demo of Glimpse to give to the crew. This was in the main review theater (the “Red Theater”). I stood in the cross over space, realtime rendering behind me on the big screen, powered by my laptop. The scene was a room without ceiling. Textured floor and white walls. At the center a LEGO minifig, probably Emmet, but I am not too sure about that. I moved the camera into the box from a top view, nothing impressive… then I set the max ray depth to something like 5,000, assigned a mirror shader to the walls and wham! An virtually infinite construct with billions of characters disappearing at the distance. The renderer was still interactive, allowing me to move the camera and move the character. People were stoked. I believe the cleaning crew had to scrape some of the jaws still stuck on the floor after everyone left the room. The studio was announcing that we would have entered a phase to operate a complete technological refresh of the company.
When the tires of the Glimpse Renderer hit the proverbial road there was a reckoning among TDs and engineers. Up to that point rendering was the slow bit, everything else was of puny importance in comparison… well, sort of. Simulation and a variety of other processes are also computationally intensive, but it is mostly a matter of latency than throughput. In a studio, 3D rendering tends to be 90% or more of the computational workload. So if there is something to pay attention to and to optimize for was the render time. The pipeline processes and tools grew organically and sometimes unchecked, like weeds seen from a distance. Have you ever had a backyard let go for a year of two, letting nature taking over? It may not look so bad from the balcony. Then one day you step in it it feels like being in a jungle, the only way through is with a machete. What was wrong?
A lot of scene file format were home grown proprietary tech. We had geometric formats and a variety of sidecar files based on HDF5. They were created in a weird way that looked speedy at first, when you have 3-4 files flying around, but whose lookup complexity was quadratic (or worse). Each file, as small as it was was allocating a 10MB cache, perhaps just to read a handful of scalar values. This was just an example. Many other processing steps involved python scripts. Now, if you run python to execute some glue code you are golden. But if you invoke python from C code per asset, on 10 thousands assets, your scene takes longer to load than it is to render… at least in this new world order. Yes it was taking way longer to open a scene file than to render it. That is not the way it should be.
The entire R&D and all the various TDs at the company were busy at work to redo the pipeline. A lot of old crufty stuff went out of the window and new systems were put in place. Slowness in the process, in loading, in commands launch was looked with a different eyes now.
The studio created a teaser sequence for the pipeline modernization, to test the new tech end to end. My rendering engineering crew grew: 2 new engineers were hired, none of them with much experience in rendering, but good C++ developers. We were in 4 now, forming a new special ops group in the company.
There was lots to do in the renderer. We needed programmable shading, programmable AOVs, a complete set of BxDFs, better light transport, better random number generators, better texture caching, volume rendering and in-render DOF. On the front-end side we needed a serialization format for the renderer, which then was adopted directly as a format for geometry archives, materials and binding, deltas and look files. We needed a scene graph API, we needed scene graph procedurals, URI asset resolvers and a lot of other infrastructure. How do you eat an elephant? I guess one slice at a time.
The obsession for me had been to not let the baseline performance of the renderer to degrade while we were adding features. If you think about it, who are among the most performance obsessed (and savvy) people in the field? Game developers I should say. In a game you have a fixed target: the FPS. If you add something to you engine or your game content and the frame rate thanks, sorry you are not shipping! As a developer (or content creator) you’ll need to pull out magic in order to recover. Sometimes it means sacrificing the visual quality, sometimes it means creating massive hacks. The concept is that if you add something, change something, anything… if you do anything at all that degrades performance just a little, you go back and fix it right away, if you cannot fix it you go back to the drawing board or you need to optimizer something else to compensate. It is all about that FPS target.
In offline rendering we don’t have a comparable hard requirement such as the frame rate. Unfortunately it is too easy to fall into the mental trap of self-pity where you accept that when you add features it is going to be slower, and how such thing is part of life. Sorry, not for me, not in this life! The methodology I did develop for me and the team was that if anything slows down we need to fix it right away, not tomorrow, not next month because production need it now. We don’t develop something and optimize it later when it becomes apparent something is slow. I feel this can be a subject for a future post.
I have to say that a lot of that constant moping was left to me, which I was ok with, I was taking pride of it, I was responsible for it. So we went on, release after release, things were almost always getting faster, rarely slower. At the same time I was working really hard to shrink the memory footprint of the renderer. The LEGO Batman Movie (TLBM) was green-lit.
The new challenge was big. Glimpse had already quite compact data structures, but TLBM was a level up in terms of complexity. Instancing was a core feature in Glimpse and the production crew really embraced it. A lot of things were achieved through instancing. Take for example moss, what once would have been a surface with displacement and a complex shader, now became a procedural instancing system to scatter micro-pants that were detailedly modeled, with subsurface scatter, a relatively straightforward shaders, yet accurate. If there was sand or mulch, only parts in the distant were a texture, but a texture that was baked out of realistic micro-structures modeled, simulated and scattered. So sand in the action area of a shot was actual instanced grains of sand. The large view in the the pilot shot, the one I mentioned earlier, was looking gorgeous (unfortunately images never released to the public). The average poly count on that data set was 140,000 polygons, per pixel… Yet the renderer chew through it, it wasn’t even considered a large scene, most of it was vegetation, not LEGO geometry. The problem was Gotham City.
The city was huge, buildings were almost real size in proportion to the minifigs. Cars had two seats side-by-side, differently from the cute single seat in TLM. Streets were wider, Everything was bigger in proportion. The large cityscape set was created to extend to the horizon, with many different neighborhoods, I don’t remember if it had been created starting on a study Gotham City and the evolution of its design and concept in the comic. I remember discussions about it and how the city was as important as the hero character in that comic culture, but do not take my word on this. I really don’t know much about it. We were pushing about 500 million bricks in our tests and I recall I had to extend the encoding capacity of the BHV multiple times in order to make data fit. The latest BVH encoding had a instance capacity of 2 billion instances. Not a limit you can easily breach: at 2 billions entries just the xform matrices uses 128GB of memory. Our serves had 64. And if you have ever implemented a high performance BVH you know there are not that many bits to spare in the nodes memory encoding. Because at some level you need to represent data with bits, not words, not bytes…
The second problem with the city asset was loading it. I remember we started at well over two hours loading time and by the moment I left the scene was loading and be up and rendering pixels in about 40 seconds. One of the biggest problem was that instancing was used at lot, but luckily there was some hidden structure to it that could be capture and aggregated at a higher level. LEGO assets had some procedural controls to jitter bricks to make them look naturally staked, instead of perfectly, artificially aligned. The control could be dialed in and out to make something look new and pristine or neglected and worn down. So even the repetition of stories in a building had some level of randomization at the brick level to avoid visual repetition. Same for the selection of textures for molding lines, scratches and human fingerprints. But even in all that chaos there was a structure that was not necessarily apparent, deliberate or known. Equivalent to some of the techniques have always been heart of Pixar USD, by analyzing the combination of properties and resources for the loading of an asset (and its look files and delta overrides) it was possible to discover repetitions that were designed to be not apparent, bu they were there never the less. There was a finite quantity of assets created by artists and a finite number of useful combinations. On a single building not much could be saved, but on a vast city lots was. By exploiting this the renderer could automatically figure out how to restructure the scene, reducing load time, and preventing to re-process similar combinations, without any user input. Once I completed that implementation I tried to scatter as many Gotham Cities as I could without telling the renderer these were instances and witness how the renderer would have fallen apart, but it never did. The super massive scene had around 53,000 Gotham Cities in it, each of them made of ~500 millions bricks, the average brick had half a million polygons (tessellated at the level that the LEGO logo on each stud was defined). You may do the math, just for giggles. With an occlusion integrator I could cruise around the streets and bridges at 10-20 fps at the resolution of an average size viewport.
In this process I have learnt deeply that every single bit counts, every clock cycle counts. Conventional CS wisdom is that 10% of your code will be responsible for 90% of your compute… whatever… it ain’t like that on a production renderer! If you get sloppy, thinking that it won’t matter, that is going to come back at you roaring. The aggravation is that when that happens you will have no clue about what the problem is and you will have to work really hard on the large data sets to trace the problem back that moment of weakness where you got one of your colleagues convincing you on how what you were doing was a “premature optimization”.
Glimpse was not perfect. It had many problems left to solve when I left Animal Logic. Its convergence rate was not as good as it could have been, producing longer render times that it would have had otherwise. The many lights problem was sort of an Achilles heel and other engineers have been working on the engine for the years following that and are still working on it as a type this blog post, and they will be doing it tomorrow. I am confident you know how maintain a renderer in production is a full time job and a perpetual investment for a studio, if you didn’t, now you do. This is the biggest problem. It is not easy to find good rendering engineers. Not because you find “bad” rendering engineers, just you don’t find them easily. This is not a field of specialization you’ll stick around if 1) you don’t love it, 2) you are not good at it, because if either of the two applies you won’t have the stamina to solve the hard problems. Without an unshakeable shield of passion those challenges will be torturous.
You may guess what brought me to Pixar. Its not such a big industry, people talk, studios have eyes and ears and this type of endeavor doesn’t go unseen for long. Pixar asked me to join RenderMan to lead the core engineering team. The following may have happened nevertheless, but if you haven’t noticed RenderMan is fully interactive since R22. XPU even more so and it is coming along, faster rendering on the CPU and the GPU… Maybe this was meant to happen.
A lot of people ask me how did I ended up at Pixar. When I stated, and for the entire duration of it all, I have never really thought about where I would have ended up. I had ideas, not plans. I just tried to do my best where I was and with what I had. So if you feel you are a bit like me and I can give you one a single piece of advice, perhaps do not stress about where you want to be tomorrow, think about what you can do today… life happens in the process.
I almost forgot. Stories like these is what happens at great companies such as Animal Logic and Pixar. Companies that are courageous and creative at their heart. Companies that are innovative, always pushing the boundaries. Companies that take deep pride in what they do and especially in how they do it. I am eternally grateful for the 10 years that I have spent working and innovating at Animal Logic. If I would go back in time I would do it all over again.
Hi Max!!
Your story is very interesting and I find it really inspiring.
Please keep going and thanks for the great read.
LikeLiked by 1 person
This reads like Ready Player One, love it!
LikeLike
Great articles! Thanks for sharing!
LikeLiked by 1 person
Max, that was like being teleported back to a very scary time when it seemed that every day there was some huge strategic decision to be made.
I’m glad you were a hugely influential part of the process.
Craig
LikeLiked by 1 person