One of the challenges in writing a Cube-Based Game* system is meshing. In order to quickly draw a large number of cubes, it helps to have large numbers of cubes be grouped together into a single mesh, which gets cached on the GPU. In every cube-based game system I've seen, each mesh corresponds to a fairly large neighborhood called a "chunk." This makes it fast and easy to cull meshes out from getting drawn.

It's common for game engines to represent the position of mesh vertices in GPU memory by assigning, absent any other per-vertex information, 3 floating point numbers to each vertex. Unfortunately, due to the complexity of cube meshes, using this approach results in meshes which are just too large. In an absolute worst case scenario, a 256x256x256 cubit** region could require a set of meshes with 300 million vertices between them. If every one of these vertices stored 3 floating point numbers, that's 3 gigabytes that needs to get stored in memory.

Now, getting my meshes to take up the absolute smallest amount of space possible on day one -- so that my cube-based game system could run on a Nintendo Switch -- would be too early of an optimization for me to do. However, I wanted to know that such an optimization would be possible down the line, once it's established that I do in fact need to make it.

Sean Barrett's open-source projects inspire me in a lot of my own work, and by that I mean, I sometimes take some of his code and reimplement it for my own commercial project. One subject of plunder was his strangely-named stb_voxel_render.h library, which has a clever approach for compactly representing cube meshes. stb_voxel_render is one of Sean's gnarliest, most macro-heavy, hyper-configurable, feature-packed libraries, which is why I coded my own derivative of it (I also wanted +y to be up, a mistake in retrospect). But at its core, it does two very simple things:
  • It stores positions with as little precision as possible.
  • It stores some data in faces instead of vertices, since the positions of the six vertices which comprise a cube face can be computed based on the position and cardinal direction of the face.

To better explain this, let's set aside the all the fancy data that stb_voxel_render.h can store in a cube face, including color, ambient occlusion, height, slope, rotation, transparency groups, position ranges that aren't 0-255 for all 3 axes, and crossfading textures. To make for a simple exaple, I'll store just the position of a cube face***. That lets you draw an untextured cube mesh that fits within a 256x256x256 cubit box. You can get away with using something as small and simple as this:

Each item in the array is a face. The x, y, and z values tell you which cell in a 3D grid the face borders, and the faceDir tells you how the face borders that cell. We take this buffer of compactly-represented cube faces and store it in the GPU's memory... we will need it later.

Now, the information that you attach to each individual vertex is... nothing! And you ask from the vertex nothing other than its index in the conceptual list of data-free vertices you've passed in. You use this index to look up. it corresponds to, and do a little bit of modular arithmetic to get its position relative to the lowest, southmost, westmost point in the mesh's bounds. This is done in a vertex shader, but here I've written it in psuedo-C.

The buffer of cube faces can take the form of a constant/uniform array, a texture, or a StructuredBuffer/SSBO. They all work just fine, though they all involve a bit of goofiness to set up.

So with this model, the amount of data per face is 8 bits each for the x, y, and z of the grid cell, and then 8 more bits for the position (we could hold more stuff in these 8 bits, but never mind that). Do whereas previously we had 32x3 bits per vertex, here we use (32/6) bits per vertex. So the memory footprint of a mesh gets reduced by a factor of 18!!

Compacting my data and using this way grants me some peace of mind. It took a bit of work to set up, but it was the sort of pleasant programming work where you get continuously closer to having code that does what you want it to do, and then you're done in a few hours. Beyond the future performance concerns, it makes serialization easier, it makes state dumps smaller, and it means I can more reliably do debugging without Spotify getting choppy. I recommend it.

This was first done while I was working entirely in C++ and OpenGL. But how do I do it in Unity? It took me a while to figure out a good answer. In general, Unity's graphical APIs requires that all meshes adhere to a specific format, that is, the intuitive format in which each vertex gets 3 floats to store its position. If you don't have a mesh with 100K vertices, you can't issue a draw call with 100K vertices. If you don't have 300K floats stored, you don't have that mesh. It seemed like I was back to square one with in terms of mesh memory consumption. This was one of the reasons I had been avoiding using a commercial game engine for a long time. I could just have Unity call my own native code, and not play ball with most of Unity's existing rendering system, but this seemed like a bad choice for a Unity plugin, given that I want the average Unity user to be able to use it without too much trouble.

I searched for a clever workaround. My first attempt was to store a Unity-compliant, three-floats-per-vertex mesh (hereafter referred to as a Piggy Mesh) which has as many vertices as my largest compact mesh. I could use this Piggy Mesh to issue draw calls for many compact meshs, completely ignoring the contents of the Piggy Mesh. I wouldn't be happy about having to store such a large mesh, but at least I'd only have to store one. This strategy unfortunately didn't pan out because it gave me no control over how many vertices got drawn, so I had to discard a large number of them in the vertex shader. I then tried making smaller mesh, dubbed the Piglet Mesh. For each compact mesh, I would issue several hundred draw calls using the same Piglet Mesh each time. My vertex shader ended up looking something like this:


This resulted in fewer vertices being discarded, but it created an amount of driver overhead that I wasn't satisfied with.

I was on the toilet thinking about the problems with the Piglet Mesh, when the dots started to connect in my head

same mesh...


multiple draw calls...


driver overhead...


a variable called "instance"...


same mesh...


multiple draw calls...


driver overhead...


a variable called "instance"...



I realized that I wanted to look at piglet videos!



Then while I was watching piglet videos, I remembered that Unity had just recently added some APIs to use GPU instancing.

GPU instancing is a feature of modern graphics cards that gives them the ability to run Total War games. It allows you to draw multiple copies of one mesh and access parameters pertaining to each particular copy... and you can do this all with a single command issued to the GPU. Changing my vertex shader to take advantage of this functionality required very few changes. I had to add some #directives and have my vertex shader take in an instance id as a parameter.

The only difficulty I had in doing this was when I ran into one fiasco of a bug, owing to worse-than-poor documentation on the part of Unity. You can see, in the documentation for Graphics.DrawMeshInstanced, the documentation stating that "meshes are not culled by the view frustum." OK first off, a frustum is a geometric primitive that doesn't cull anything, and secondly, my meshes were in fact subject to undocumented and difficult-to-reverse-engineer frustum culling. It turns out that the way to control the volume that is checked during the frustum cull is to set the bounds parameter of the Piglet Mesh to the maximum possible bounds of the mesh you're truly drawing once everything has gone through your vertex shader. If you're like me, and you haven't played around with Unity's Mesh class extensively, and you didn't know that has a stateful bounding box member, I have likely saved you several hours. Unity, please fix this! You have powerful graphics capabilities now, but stuff like this will scare people away!
The optimal number of vertices to use for each instance is TBD, but I've found that between 100 and 1000 works quite well. I also don't have any performance measurements other than "just peachy."

If you're looking for a Unity project you can quickly download and play with, I'm afraid I don't have that. Fortunately, the talented Keijiro Takahashi has many. NoiseBall is a nice small example. Some exercises you can do with it are:
  • Replace the surface shader with a vertex and fragment shader.
  • Replace the call to DrawMeshInstancedIndirect with a call to DrawMeshInstanced
  • Increase the number of triangles per instance
  • Replace the position/normal buffers filled by NoiseBall2.compute with buffers filled by C# code.
  • Make a whole videogame

I hope this has been an informative learning adventure! Bookmark this blog and click it every month or so for updates. Or, follow me on twitter for more frequent and somewhat stupider updates.

*For first time readers: this blog details the development of a system to render lots of textured, grid-aligned cubes, like you might see in Minecraft. For technical reasons, I call them "Cubes" instead of voxels. I do not think this distinction is particularly important, but I'm tired of seeing this topic disrupt the IRCs I frequent.
**Side length of a cube within a cube-based game system. Partly for wordplay, partly because I think 1 cube length = 0.5 meters is a good scale to use.
***Ironically, this is one thing that stb_voxel_render does not allow you to store per-cube-face, though this is planned for a future release.