I'm Keith Judge. I've been a programmer in the games industry for nearly 13 years, mostly working on graphics and rendering technology. I'm currently working for Pitbull Studio on Unreal Engine 4.
Posts by Keith Judge
  1. Confessions of a failed indie developer ( Counting comments... )
  2. I'm Easily Distracted ( Counting comments... )
  3. Shader Code for Physically Based Lighting ( Counting comments... )
  4. Stencil Buffer Optimisation for Deferred Lights ( Counting comments... )
  5. Develop Indie Dev Day 2011 ( Counting comments... )
  6. Why Do So Many PC Games Still Use DirectX 9? ( Counting comments... )
  7. Working For Myself - First Two Months ( Counting comments... )
  8. It's All Physics ( Counting comments... )
  9. Testing, Testing, Testing ( Counting comments... )
  10. So You Want To Be A Graphics Programmer... ( Counting comments... )
  11. Input for Modern PC Games ( Counting comments... )
  12. Workstation Set Up for Game Developers ( Counting comments... )
Technology/ Code /

This is my ninth post for AltDevBlogADay and it has occurred to me that I haven't actually written a single article about a graphics technique, so here's my first. I'm going to describe a common technique which uses the stencil buffer for accelerating the rendering of deferred lights. It is important that deferred lights are rendered as quickly as possible as you may be rendering hundreds of them per frame. I haven't found a good tutorial for the technique online, so here's my attempt at writing one. First of all, here's a screenshot from a sample scene lit by a single deferred spotlight. Now that looks fine, but let's see what the performance is like by capturing a frame in Intel's Graphics Performance Analyser (GPA).

Rendering the light takes 297.2 microseconds

The draw call for the light takes 0.3 milliseconds to render, which sounds fast, but we can do better. The darker yellow area of the timing bar represents pixel shader time and this is the majority of time taken for this draw call. Therefore we can make this light render faster if we can reduce the number of pixels drawn. The next screenshot shows a wireframe of the light's geometry overlaid on the final render. For every pixel within the wireframe (strictly the back faces only), we run the lighting calculation. The pixel shader does a lot, it reconstructs a view space position from the depth stored in the g-buffer, reconstructs the view space normal from two channels (using Lambert azimuthal equal-area projectionif you're interested) and then calculates diffuse and specular light colour, taking into account distance and spotlight attenuation. As you can see, there are a lot of pixels within the wireframe that are black in the final image. This is because these areas of the light's geometry don't intersect the world geometry so we're wasting work. If the wireframe is hard to see below, click the image to see it full size.

We render more pixels for the light than are actually affected by it.

If we can somehow make the GPU only render the pixels that are affected by the light, then we can make this quite a bit quicker. Luckily, there's an old technique for calculating the screen space intersection between different volumes - namely "stencil shadows". What we do is we split the render into two stages. In the first stage, we fill the stencil buffer for areas where the cone intersects the world geometry, and then for the second stage we render the light testing against the stencil buffer. To do this efficiently, we make use of double sided stencil and the zfail stencil technique (also known as Carmack's Reverse). The eagle eyed amongst you may have had a warning flag go up in your head regarding patents, though I believe that the Creative patent only refers to stencil shadow rendering, not use of the stencil buffer for optimising light rendering. Perhaps someone with more legal experience can confirm or refute my thoughts in the comments. We create a depth stencil state with the following parameters. I'm using DirectX 11 here, but this should translate directly into more or less any modern graphics API.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
D3D11_DEPTH_STENCIL_DESC depthstencil_desc;
depthstencil_desc.DepthEnable = TRUE;
depthstencil_desc.DepthWriteMask = D3D11_DEPTH_WRITE_MASK_ZERO;
depthstencil_desc.DepthFunc = D3D11_COMPARISON_LESS;
depthstencil_desc.StencilEnable = TRUE;
depthstencil_desc.StencilReadMask = D3D11_DEFAULT_STENCIL_READ_MASK;
depthstencil_desc.StencilWriteMask = D3D11_DEFAULT_STENCIL_WRITE_MASK;
depthstencil_desc.FrontFace.StencilFunc = D3D11_COMPARISON_ALWAYS;
depthstencil_desc.FrontFace.StencilDepthFailOp = D3D11_STENCIL_OP_INVERT;
depthstencil_desc.FrontFace.StencilPassOp = D3D11_STENCIL_OP_KEEP;
depthstencil_desc.FrontFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;
depthstencil_desc.BackFace.StencilFunc = D3D11_COMPARISON_ALWAYS;
depthstencil_desc.BackFace.StencilDepthFailOp = D3D11_STENCIL_OP_INVERT;
depthstencil_desc.BackFace.StencilPassOp = D3D11_STENCIL_OP_KEEP;
depthstencil_desc.BackFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;

The important bits we're doing here are leaving depth test on as normal, setting depth writes off, enabling stencil and setting the stencil depthfail operation for both front and back faces to D3D11_STENCIL_OP_INVERT. If your graphics API does not support an INVERT operator, you can use INCREMENT on the back faces and DECREMENT on the front faces.

Then we render the light geometry with back face culling off and no pixel shader. It is important that the stencil buffer is completely clear before this is done. The stencil buffer is then filled as in the next screenshot.

Stencil buffer contents overlaid on final render. You can see it fits the areas of the image affected by the light more closely than the cone wireframe above.

The next stage is to render the cone again with the following depth stencil state.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
D3D11_DEPTH_STENCIL_DESC depthstencil_desc;
depthstencil_desc.DepthEnable = FALSE;
depthstencil_desc.DepthWriteMask = D3D11_DEPTH_WRITE_MASK_ZERO;
depthstencil_desc.DepthFunc = D3D11_COMPARISON_LESS;
depthstencil_desc.StencilEnable = TRUE;
depthstencil_desc.StencilReadMask = D3D11_DEFAULT_STENCIL_READ_MASK;
depthstencil_desc.StencilWriteMask = D3D11_DEFAULT_STENCIL_WRITE_MASK;
depthstencil_desc.FrontFace.StencilFunc = D3D11_COMPARISON_ALWAYS;
depthstencil_desc.FrontFace.StencilDepthFailOp = D3D11_STENCIL_OP_KEEP;
depthstencil_desc.FrontFace.StencilPassOp = D3D11_STENCIL_OP_INVERT;
depthstencil_desc.FrontFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;
depthstencil_desc.BackFace.StencilFunc = D3D11_COMPARISON_NOT_EQUAL;
depthstencil_desc.BackFace.StencilDepthFailOp = D3D11_STENCIL_OP_KEEP;
depthstencil_desc.BackFace.StencilPassOp = D3D11_STENCIL_OP_ZERO;
depthstencil_desc.BackFace.StencilFailOp = D3D11_STENCIL_OP_KEEP;

The important bits here are that depth testing is off, and the stencil test for back faces is set to D3D11_COMPARISON_NOT_EQUAL, which will mean only pixels with a non-zero stencil will be shaded. We also set the StencilPassOp for back faces to D3D_STENCIL_OP_ZERO so the stencil buffer is cleared for the next light. The FrontFace settings are ignored as we're only going to render back faces in this stage. Then we render the light geometry with the full pixel shader and front faces culled. The visual result is exactly the same, but what of performance? Let's see how long those two draw calls take with another GPA grab.

Total draw time is now 266.3 microseconds

This is about 11% quicker for this particular light, with only minor code changes. Savings will vary depending on how the light and geometry intersect and also depends on where the camera is. It should only be marginally slower in the rare case where the light affects every single pixel on the screen.

I hope you find this article useful.

EDIT: Based on a few twitter comments I've further optimised this by modifying the first stage from depth testing with D3D11_COMPARISON_GREATER_EQUAL and inverting stencil on depth pass, to testing depth with D3D11_COMPARISON_LESS and inverting stencil on depth fail instead. This is equivalent, but crucially, it allows hierarchical depth/stencil optimisations that the GPU does to remain in place, further optimising the rendering which now takes less than 0.2ms. Yay!