I am finishing up the next pass of YAVE's light propagation system and one thing I needed to do was reduce that number of expensive computations that go into spreading light from one block into another.

Like Minecraft, the lighting in my game takes the form of a fixed, low-digit number which corresponds to a point on a non-linear curve, but there are key differences. In Minecraft, the light values range from 0 to 16 and correspond to a range from 5% to 98% screen space brightness*. In YAVE, the light values range from 0 to 255, and because YAVE does HDR rendering, these values correspond to a range from 2^-8 to 2^8, chosen because it matches up with a good range of real-world luminance values. Additionally, YAVE does a simplified version of Light Propagation Volumes aka LPV, and therefore stores 6 light values per block**, one for each orthogonal direction. In Minecraft, the relationship between light value and irradiance was arbitrary, in YAVE I've established a formula,

where L_b is the light value stored in memory, and L_hdr is the "true" light value that gets sent to the GPU before it gets tonemapped using standard HDR rendering techniques. The inverse of this is:

In photography terms, increasing a L_b by 1 means increasing the luminance of some object lit by that block, in that direction, by a sixteenth of an f-stop.

* I am simplifying this somewhat but not a lot.

** This will increase in the future to allow colored light and other shenanigans.

Another key difference from Minecraft is in the way light is transferred from block to block. In Minecraft, to update the light value of block L_h you look at the 6 neighboring lights, find the one with the highest value, subtract one, and set L_h to that value if it's greater. Expressed formula-like:

Where L_h' is the new value of the light, L_h is the old value and L_n_i is the light value of each of the neighbors.

Pretty simple, but not good enough for me! For my lighting, a variation of LPV, I need to add up the the amount of light from each neighbor which I expect to have passing through each face of a given cube. Typing all this LaTeX is kind of a pain in the butt so I'll give you a simplified version of the formula. The value that is typically written to memory for a block and direction, b_sum, is

Where b_sum is a value typically assigned to a particular block and direction, L_b_i is a light value in a neighboring block, and w_i is a floating point weight between 0 and 1.

The naive implementation of this was quite slow, and maybe you can see why. Each time I updated the light value of a block and face, I would need to perform six expf() operations and then a logf() operation. So I turned to that time-honored optimization trick, the lookup table.

The simplest optimization was to observe that since L_b_i could only take on a value between 0 and 255, I could make a lookup table to quickly compute 2^c_i. This saved me from having to compute exp2f several times within the inner loop. The savings were hardly chump change.

Next, I observed that w_i also could only take on a few values. I changed the computation around so that it instead reflected this formula:

-16*log2(w_i) would either round pretty well to a byte or it would be so small as to not significantly affect the lighting. Maybe for a physically-based renderer this would be a problem, but YAVE is designed for stylized games set in a cube world. I replaced these weights with -16*log2(w_i), and made it so they get added to L_b_i before it goes into the lookup table. This replaced a float multiply. It was of little performance benefit, although I figured it would maybe pay off later.

It turned out to pay off while I was solving the last problem with the computation, that damned log2f! Even if I used a fast implementation, that would still be a major drag on the speed. I decided it was worth it to try to execute that as infrequently as possible. I had some ideas for things I could do to remedy this. The best idea was get the log2 to the nearest sixteenth by adding the floats as normal, taking the exponent bits of the float as the whole number portion of the log2, and using a range lookup to get the fractional part.

But bit manipulation fries my brain, I was lazy, and decided that for the time being, I could take advantage of the fact that if one of the c_i' values was significantly larger than the rest, I use it as an approximation to b_sum. I made a plot using WolframAlpha to get a gauge on how damaging this would be. I was surprised by the result.

It looked as if the difference between log2(2^x + 2^y) and max(x,y) was affected solely by the difference between x and y! This meant that, in order to make my computation, all I needed to do was find the max of x and y, then subtract some value taken from a lookup table with abs(x-y) as its index. I wasn't certain that this was actually true, nor was I sure how abs(x-y) related to (log2(2^x + 2^y) - max(x,y)). WolframAlpha wasn't cooperating, so I had to break out my rusty algebra skills and figure out for myself whether this trick would work. What I ended up with was this proof: "The Proof That If You Want log_b(b^x + b^y), And X and Y Are Both Bytes, You Can Compute It Using Comparisons, Additions, and a Lookup"

Just as how I can round 16*log(w_i) to a byte without losing too much precision, I could also round this lookup-able value as well. In case I've lost you, what this means is:

- I can compute the approximate light spread between voxels in keeping with the formula for b_sum
- I can do it without exp2f
- I can do it without log2f
- I can do it without using any floats in the inner loop at all
- It involves some loss of precision, but there's no mathematical bullshitting

Pretty cool, huh?

Are you interested in adding this extremely situational code to your own project? Well good news! You can. Just copy this code, and you'll be computing 16*log3(pow(3,A/16) + pow(3,B/16)) in no time.

Want something slightly different? Turn your eyes to the python code in the block comment, set the variables to what you want them to be, and then pre-process it with a standard setup of Cog. You can efficiently sum and then compactly store powers of whatever number you desire. Powers of 2, powers of Pi, powers of your lucky number...

EDIT: As a side note, lookup tables can be implemented in SIMD code using a PSHUFB instruction. I'm hoping that the ISPC compiler will take advantage of this sometime soon.

This code will generate ugly branch. It's too complex for compiler to optimize.

But you can rewrite it like this to avoid any branches (on CL, clang and gcc): https://godbolt.org/g/kK6Buz

mmozeiko

Good catch... I wanted to show off how simple the run-time computation was and ended up making it slower.

I know this isn't a complete optimization article, since I don't have the performance numbers for a fair comparison between techniques. Maybe I'll do it, idk.