Indie game developer. Runs a software company in Greece. Author of the open source game engine Sylphis3D (http://devnet.sylphis3d.com/). Creator of the iOS game "Pop Corny". Blogs at http://kalogirou.net/
Posts by Charilaos Kalogirou
  1. Porting your iOS game to Blackberry Playbook (and future BB10 phones) ( Counting comments... )
  2. Porting your game from iOS to Android ( Counting comments... )
  3. Writing portable code: A process full of gain ( Counting comments... )
  4. Launching on the AppStore in the year 2012 ( Counting comments... )
  5. My game design hat ( Counting comments... )
  6. Game Pre-Mortem ( Counting comments... )
  7. Debugging hard to reproduce bugs ( Counting comments... )
  8. Optimizing for the instruction cache ( Counting comments... )
  9. Take your time (off) ( Counting comments... )
  10. Predictable garbage collection with Lua ( Counting comments... )
  11. Keep your mallocs close, and your related mallocs closer ( Counting comments... )
  12. From Python to Lua ( Counting comments... )
Technology/ Code /

Everyone on the performance coding front knows it; the cache will make you or break you. There is no need to stress it any further that if your data doesn’t fit in the cache and you don’t make sure that your access patterns are such that allows for large cache hits, you are going to suffer from terrible performance.

One of the major conversion reasons to DOD -that is an almost “hyped” term these days- is that with DOD you get to focus on arranging the data in cache efficient ways. However we mostly focus on the data cache, and we tend to forget about the instruction cache.

The role of the instruction cache is to accelerate the fetching of instructions from the main memory. This is exactly like the way the data cache is used to accelerate the fetching of data. After all your code is data! The difference is that it is much harder to optimize for the instruction cache than it is to optimize for the data cache. This is logical if you consider that we can freely arrange data in any way we can device. We don’t have that freedom with code layout. This makes most of us turn away from the problem and ignore it, while focusing on the data cache and hope all will be fine. Others don’t bother at all or simply get as far as minimizing code size. However it turns out that doing cache misses for instruction fetching is as bad as regular cache misses.

Don’t get me wrong, minimizing code size helps a lot. And yes; inlining every function will eventually hurt you. So let’s suppose we have carefully minimized code size, what else is there that will affect the instruction cache performance? I always like to start with an example so lets take a look at the code below:

void updateEntities(entity_t *entities, unsigned int numOfEntities) {
   for(int i = 0 ; i < numOfEntities ; i++){
      entity_t *p = &entities[i];
      switch(p->type){ 
      case 0:
        updateEntityA(p);
        break;
      case 1:
        updateEntityB(p);
        break;
      }
   }
}

Consider now that we have two kinds of entities, each with its own update function. The above code runs over an array of entities and depending on the entity type it calls the appropriate update function. Assuming that the entities of the two types are randomly distributed in the array, we can easily see that the execution will alternate in the functions updateEntityA() and updateEntityB(). Now suppose that only one function can fit in the instruction cache at once. This would mean that for every entity we will be loading instructions from the main memory with 50% probability. If that sounds extreme, try thinking of it with ten different update functions alternating at random and running on a device like the iPhone. We can easily see that we can do better that that. What would happen if we grouped together entities of the same type? We could do that by having one array for every entity type, or by having a single array that we keep sorted on the entity type. This will change the code access patterns from:

updateEntityA(), updateEntityB(), updateEntityA(), updateEntityC(), updateEntityB()…. … ..

to:

updateEntityA(),….,updateEntityB(),….,updateEntityC(),….., … …

Now the CPU will be able to update all entities of type A with the code resident in the instruction cache, then all entities of type B with the code in the instruction cache, and so on. We effectively reduced the cache misses from the order of the number of entities, to the order of the number of entity types.

The point to be taken from the above is that there is performance to be gained by organizing data in a way that when processed the code will follow similar paths; whether that is by following the same branches consistently, or by dispatching the same virtual method of a parent class with various custom implementations as children. Critical for performance code paths should be tested and optimized for the instruction cache along with the optimization for the data cache.

Optimizing for the instruction cache can be a difficult balancing act, and with the fact that we don’t have the ability to exactly layout the code, we must try to understand what the compiler is doing under the hood and do lots of trials and benchmarks. Inspecting the machine code emitted by the compiler is also crucial. Unroll loops and inline functions with caution. I usually do it incrementally while monitoring the performance. However keeping the code small is not going to help you if you run all over your codebase doing “random” stuff. Keep it in order, and you shall receive!