I'm the founder and director of Overbyte, a company that specialises in providing performance optimisation solutions for game companies. Based in Adelaide, Australia, I've worked for a range of companies including Ratbag, Midway, Pandemic, EA and SCEE. I now spend my days helping other studios make their code go fast, by physically coding it myself or via training and consultation.
Posts by Tony Albrecht
  1. A Series of Lessons in Optimisation ( Counting comments... )
  2. How I got into games ( Counting comments... )
  3. Step by Step Optimisation ( Counting comments... )
  4. The Root Of All Evil ( Counting comments... )
  5. The Beauty Within ( Counting comments... )
  6. The Pain Of Debuggery ( Counting comments... )
Technology/ Code /

I've recently started writing a series of articles on optimisation on my company web site and, having just completed the second whilst being concurrently due for an #AltDevBlogADay post I thought I'd cross post it here. If you haven't read any of these optimisation lessons yet, then you might want to start with Lesson 1: Profiling and then read this article (which is Lesson 2), as they deal with iterations of the same code. If you have read them, then please pardon my profligate (cross) posting.

Before we start on the second in this series of optimisation lessons I'd like to make something clear. The type of optimisation that I am predominantly talking about in these lessons is low-level optimisation, or micro-optimisation.  This type of optimisation should only be applied once you are sure that it will definitely have a benefit, once you have exhausted all obvious higher level macro-optimisations. There have been (too many) times in my coding career when I have spotted some obviously inefficient code and spent too much time joyously optimising it only to end up with rigid, highly optimised, unmaintainable, very clever code that makes absolutely no difference to the overall performance. Or, even worse, it runs slower.

Your first task in optimisation is to figure out what is slow, then why that something is slow (and sometimes when) and then remedy that.

Understanding the sample distribution

If you can cast your minds back to Lesson 1, we had profiled our particle system and had started examining the PC samples obtained from our profiler as depicted below.

I've highlighted the two lines with the highest hits and we'll start by focusing on the second, biggest one first.

    blt 0x00011B30

From the interactive C++ to ASM page from the last lesson we can see that this is a branch command and corresponds to the following C++ code:

    if(p.m_Position<0)

To understand why this instruction is such a bottleneck we need to understand a little about how the underlying hardware functions - in particular the CPU pipeline. Modern CPUs can issue a command or two every cycle. These instructions take a variable number of cycles (depending on the instruction) to produce an output, but (and this is important), multiple instructions can be in flow at the same time - each one at a different stage in the pipeline.

For example, consider the following code which does 3 single precision floating point adds:

    fadds f4,f4,f0
 
    fadds f13,f13,f30
 
    fadds f0,f31,f29

 

If each fadds instruction takes 10 cycles, how many cycles does it take to complete these three instructions?

The answer, as with most things when you get deep enough, is "It depends". In order to simplify this problem let's assume that the registers that are being used already have the relevant data in them - this way there is no stalling as the fadds instructions wait for the registers to fill from D cache or main memory (we'll deal with that can of worms in the next lesson).

The following diagram shows some of the details of the PPU pipeline. It's from this article and is definitely worth reading if you are at all interested in the Cell BE architecture.

Part of the PPU Pipeline

Each box in this diagram corresponds to a cycle - for example, the first 4 boxes, IC1 through IC4 correspond to the time taken for the processor to fetch instructions from the Instruction Cache -  4 cycles -  followed by a two cycle dispatch and then three cycles to decode and three more to issue the instruction to either the Branch Unit, Fixed Point Unit or the Load Store Unit. All instructions have to go through this 12 cycle front end, but if this pipeline is fully populated then a new instruction can be dispatched every cycle. The time taken for an instruction to enter, pass through and then leave a pipeline is called it's latency. I won't go into too much depth on the details of CPU pipelines here - check your hardware documentation for more information (for consoles this is generally under NDA anyway) and/or have a read of this Ars Technica article.

Back to the case in hand: The FPU pipeline is 10 cycles long which means it will take 10 cycles for a single instruction to pass through that pipeline (the time taken to get the instruction to the FPU can be ignored for this calculation). A new FPU instruction can be issued every cycle after that; this means that f0 from the third fadds will be ready after cycle 12.

It is worth noting at this point that some instructions (such as divide and square root on the PPU), do not pipeline at all. This means that all the instructions after that command will stall until it has finished, reducing throughput significantly (some non-pipelined instructions take more than 80 cycles).

The more attentive amongst you may have noticed the reuse of the register f0  in the third fadds - this is not an issue as once an instruction is committed into the pipeline the input registers are no longer relevant to that instruction. The compiler will take care of dependencies for you, so you shouldn't need to worry about it, merely be aware of it.

Branching

"So," I hear you ask, "what does this have to do with the branch statement in the code we're looking at?"

Well, I'm getting to that. Branching can be an expensive operation and the PPU tries to minimise its impact by trying to predict which way the code will branch. It does this either statically (the compiler or programmer specifies which code path is most likely taken) or dynamically (the PPU keeps a table of branches and which path the most recent branches have taken and uses that to infer the most likely branch).  Check out Mike Acton's article on branching for more details.

A mispredicted branch will incur a complete pipeline flush which is quite expensive (I've not mentioned the exact value for the PPU as I'm not sure if it is covered by NDA but you should be able to infer it from the earlier pipeline diagram) and even an accurately predicted branch has a slight cost.

If we once again look at the performance graph from the first lesson we are now in a position to understand what is happening to produce this peak on the right:

Graph of execution time taken over frame number (from Lesson 1).

The flat part of the graph on the left corresponds to the expansion of the particle ball while all particles are above the collision plane. When the particles start to collide with that plane, the time taken increases dramatically, rising from around 15ms per update to nearly 60ms (when nearly all the particles are at rest on the collision plane). This increase is processing time is due to not just the extra processing when a collision occurs (three fp multiplies and a fp negate), but also due to the branch and (potential) associated pipeline flush.

Removing Branches

Fortunately, there is an instruction that can help us here; fsel (floating point select). This instruction provides a way of effectively performing our own speculative execution and then choosing one of two operands based on a third (see below):

    d = __fsel(a,b,c);
    // is the same as the following code but with no branch instruction
    if(a>=0)
         d = b
    else
        d = c;

The other benefit of this instruction is that it is pipeline friendly - it also takes 10 cycles and our pipeline is left intact once it has completed. And we can have multiple of these instructions in the pipeline at the same time.

So, let's rewrite our particle code without any classes (to make it easier to see how it maps to asm), pre-emptively calculate the bounced position and velocity and pick the resultant position and velocity via fsel() commands based on the y value of the particle's position.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
void ParticleManager::Update(float fClock)
{
   const MyVector dampVelocity(0.75f,-0.5f,0.75f);
   const MyVector v3Acceleration=MyVector(0.0f, -9.8f, 0.0f) * fClock; 	//  a*t
   const MyVector at2 = v3Acceleration*fClock*0.5f;                     // ½a*t^2
 
   MyParticle *p = &m_Particles[0];
   MyParticle *pEnd = &m_Particles[m_NumParticles];
   for (; p<pEnd; p++)
   {
      float newPosy = p->m_Position.y;
      float newPosx = p->m_Position.x;
      float newPosz = p->m_Position.z;
      float newVelx = p->m_Velocity.x;
      float newVely = p->m_Velocity.y;
      float newVelz = p->m_Velocity.z;
 
      newPosy = newVely * fClock + newPosy + at2.y;
      newPosx = newVelx * fClock + newPosx + at2.x;
      newPosz = newVelz * fClock + newPosz + at2.z;
      newVelx = newVelx + v3Acceleration.x;
      newVely = newVely + v3Acceleration.y;
      newVelz = newVelz + v3Acceleration.z;
 
      float bouncedPos_y = -1.0f*newPosy;
      float bouncedVelx = newVelx*dampVelocity.x;
      float bouncedVely = newVely*dampVelocity.y;
      float bouncedVelz = newVelz*dampVelocity.z;
 
      // choose vel based on y pos
      newVelx = __fsels(newPosy,newVelx,bouncedVelx);
      newVely = __fsels(newPosy,newVely,bouncedVely);
      newVelz = __fsels(newPosy,newVelz,bouncedVelz);
      // set y based on y pos
      newPosy = __fsels(newPosy,newPosy,bouncedPos_y);
 
      p->m_Position.x = newPosx;
      p->m_Position.z = newPosz;
      p->m_Position.y = newPosy;
 
      p->m_Velocity.x = newVelx;
      p->m_Velocity.y = newVely;
      p->m_Velocity.z = newVelz;
   }
}

And here is the assembly generated by compiling the above code.

0x000003D0  bge      0x00000458
0x000003D4  lfs      f9,0x4(r4)
0x000003D8  lfs      f10,0xC(r4)
0x000003DC  lfs      f11,0x10(r4)
0x000003E0  lfs      f12,0x14(r4)
0x000003E4  fmadds   f9,f1,f11,f9
0x000003E8  lfs      f13,0x0(r4)
0x000003EC  fadds    f0,f10,f5
0x000003F0  lfs      f31,0x8(r4)
0x000003F4  fadds    f11,f11,f4
0x000003F8  fadds    f30,f12,f5
0x000003FC  fmadds   f10,f1,f10,f13
0x00000400  addic    r5,r4,0x18
0x00000404  fmadds   f12,f1,f12,f31
0x00000408  fmadds   f13,f6,f2,f9
0x0000040C  fnmadds  f9,f6,f2,f9
0x00000410  fmuls    f31,f8,f0
0x00000414  fmuls    f29,f7,f11
0x00000418  fmuls    f28,f8,f30
0x0000041C  fadds    f10,f10,f3
0x00000420  fadds    f12,f12,f3
0x00000424  fsel     f9,f13,f13,f9
0x00000428  fsel     f0,f13,f0,f31
0x0000042C  fsel     f11,f13,f11,f29
0x00000430  fsel     f13,f13,f30,f28
0x00000434  stfs     f10,0x0(r4)
0x00000438  stfs     f12,0x8(r4)
0x0000043C  stfs     f9,0x4(r4)
0x00000440  stfs     f0,0xC(r4)
0x00000444  stfs     f11,0x10(r4)
0x00000448  stfs     f13,0x14(r4)
0x0000044C  clrldi   r4,r5,32
0x00000450  cmplw    r4,r3
0x00000454  blt      0x000003D4

If you have a HTML5 compliant browser, then you can interactively walk through the ASM and C++ code in another window by clicking on the following image:

As you can see from this assembly (particularly via the interactive walkthrough), the only branches are associated with the for loop which iterates through all of the particles.

We would expect this code to run faster as there are no more data accesses and there are less stalls in the worst case scenario (all particles mis-predicting a branch), but how much faster? The following graph shows the performance of the code in lesson 1 against this lesson's performance.

As you would expect, the performance is constant as there is no branching based on the particle's position, but what you might not have expected is that the baseline performance is better than the original solution. Even though we are calculating the bounced position and velocity in addition to the non-bounced ones for every particle, this code is still significantly faster than the branching one. Basically, we are doing more work in fewer cycles. The removal of the branch removes the need for a compare (which stalled waiting for the result of the increment of the y position) as well as the branch and, thereby allows better pipelining of the inner loop.

This is a pretty good performance improvement for a fairly small change, but finding wins like this in production code is rare. This type of optimisation really only benefits very tight inner loops; using branchless logic in higher level code will usually have no measurable performance improvement at the cost of obfuscating your code - so choose wisely.

The important things to take away from this lesson are:

  1. Instructions pipeline. Taking advantage of that can give you big performance wins
  2. Branching can be bad on certain architectures (primarily because it trashes the pipeline)

Coming up next

So where to next? Let's have a look at the PC samples of this new code and see where the bottleneck has moved to:

Samples attributed to C++ code

Most samples in this code seem to have coalesced around the calculation of the y position of the particle. The reason for this is very interesting and we'll address this in the next lesson on the Overbyte web site.

So endeth the second lesson. Please, leave a comment if you've spotted an error, need something clarified or would just like to leave some feedback.