Creator of the PathEngine pathfinding middleware library, writes about programming at upcoder.com.
Posts by Thomas Young
  1. Zero Initialisation for Classes ( Counting comments... )
  2. Avoid Resize()? ( Counting comments... )
  3. Roll-Your-Own Vector ( Counting comments... )
  4. Efficient Vectors of Vectors ( Counting comments... )
  5. Using STL Vectors ( Counting comments... )
Technology/ Code /

(Also posted to upcoder.com, third in a series of posts about Vectors and Vector based containers.)

STL vectors offer a great combination of dynamically resizing convenience and low level contiguous buffer accessing performance, but std::vector is not the only option, and custom implementations of this container can sometimes be a better fit for your specific requirements.

Roll your own

It seems like it's fairly common practice in the game development industry to replace stl::vector (and other standard containers) with custom STL-style vector or array implementations.

Some public examples are the EASTL (Electronic Arts Standard Template Library) and the Bitsquid Foundation Library, and I know that a lot of other developers also do something similar.

I think that a lot of people may have been 'got bitten' in the past with STL containers, or at least found that they were able to resolve issues or significantly improve performance by replacing STL containers with their own code. This was definitely the case with PathEngine.

We switched to a 'home grown' version of std::vector some years back as a result of some concrete run time performance issues, and saw some quite significant improvements from this change.

Two parts

There are potentially two main reasons for switching:

  • issues with the mechanism for custom memory allocation for STL containers, and
  • performance considerations

For some people memory allocation is the main reason for switching to a custom vector allocation, but this wasn't an issue for us when we first switched to our own vector implementation. At that time, memory allocation customisation in PathEngine was fairly minimal (based on overriding global new and delete and only possible for DLL builds) and we didn't use STL allocators.

The initial motivation for switching PathEngine to a custom vector was performance, and I'll be looking exclusively at the performance side of things in this post, but will come back to look at custom allocation issues separately later on.

Initial issue and expectations

The initial issue for us was with stl::vector methods showing up as significant in performance profiling, for certain operations, and with clients telling us that they needed these operations to run faster.

Most of our vectors contain simple POD (or at least trivially copyable) elements. I had read the 'Generic typed buffers' series of articles by Andrei Alexandrescu on Dr Dobbs (part 1 here, part 2 here, part 3 here, but you have to register to read more than 2 articles). and I was expecting to see the biggest improvements from type specialisations for these elements, but this turned out not to be the most significant issue.

Basic vector implementation

Let's have a look at what a basic vector implementation might look like.

We'll make some simplifications.

In particular we'll avoid adding any iterator classes and just use index values or raw pointers to elements for iteration. But we'll also ignore things like exceptions. (As with a lot of other game development code PathEngine doesn't throw or handle exceptions.) And we'll stick to C++03 for backwards compatibility.

Let's start with the following:

#include <string.h>
#include <stdlib.h>
#include <new>

template <class T>
class cVector
{
public:
    typedef typename std::size_t size_type;
private:
    size_type _size;
    size_type _capacity;
    T* _data;

    static T*
    allocate(size_type size)
    {
        return static_cast<T*>(malloc(sizeof(T) * size));
    }
    static void
    copyRange(T* begin, T* end, T* dest)
    {
        while(begin != end)
        {
            new((void*)dest) T(*begin);
            ++begin;
            ++dest;
        }
    }
    static void
    deleteRange(T* begin, T* end)
    {
        while(begin != end)
        {
            begin->~T();
            ++begin;
        }
    }

public:
    typedef T* iterator;
    typedef T value_type;

    cVector()
    {
        _size = 0;
        _capacity = 0;
        _data = 0;
    }
    ~cVector()
    {
        deleteRange(_data, _data + _size);
        free(_data);
    }

    // needs some other constructors
    // size(), empty() and a bunch of other methods!

    void
    push_back(const T& value)
    {
        if(_size != _capacity)
        {
            new((void*)(_data + _size)) T(value);
            ++_size;
            return;
        }
        size_type newCapacity = _capacity ? _capacity * 2 : 1;
        T* newData = allocate(newCapacity);
        copyRange(_data, _data + _size, newData);
        new((void*)(newData + _size)) T(value);
        deleteRange(_data, _data + _size);
        free(_data);
        _data = newData;
        _capacity = newCapacity;
        ++_size;
    }
    T&
    operator[](size_type index)
    {
        return _data[index];
    }
    const T&
    operator[](size_type index) const
    {
        return _data[index];
    }
    T*
    begin() const
    {
        return _data;
    }
    T*
    end() const
    {
        return _data + _size;
    }
};

This misses out a lot of the vector interface, obviously, but still manages to give us what are arguably the key benefits of STL style vectors, amortized constant time push back and automatic buffer memory management, in quite a small amount of code.

Note that we can continue using a load of other stuff from the standard library (such as sort algorithms), because of the way STL design decouples algorithms from actual container implementations through the iterators concept. Raw pointers into our vector buffer meet same the requirements for random access iterators as the iterators returned by std::vector.

The code is pretty straightforward, but with one slightly tricky aspect being placement new. This is kind of fundamental to vector containers, and so it's important to understand a bit about what's going on here.

We're using malloc and free for underlying buffer allocation, with constructor and destructor calls invoked separately from actual memory allocation. The call to new((void*)dest) T(*begin) constructs an object of type T at the (already allocated) memory location pointed to by dest, using T's copy constructor with argument *begin. The call to begin->~T() calls T's destructor without freeing the underlying memory.

Malloc and free will do for now, but I'll look at customising the underlying memory allocation methods in my next article.

A lot of stuff like size(), empty(), front(), back() has been omitted to save space, but can be added easily enough.

Doing it yourself

So far it's unlikely that this code will actually give us any speedups over standard library versions, at least in release builds, but there are already some interesting implications and potential advantages of 'doing it yourself' like this:

  • We can control exact container semantics, independent of compiler version
  • The code is simpler and easier to understand
  • All of the code is available to us
  • It can work out faster in debug builds
  • There are very minimal dependencies

Knowing exactly what semantics will be applied for underlying buffer allocation can be important in order to use vectors appropriately, for example it is nice to know for sure if calling clear() will free the underlying buffer (as discussed in this previous post).

If you step into the push_back operation of your standard library vector implementation in a debugger, you'll probably find this goes through a bunch of cryptically named template methods, which may then also end up with some calls to standard library binary components (for which source may not be available). It's not so easy to understand what's going on, and although this should all get optimised out in release builds, debug builds are obliged to actually perform each of these calls (so that everything makes sense in the debugger).

And the vector implementation in your standard library also probably depends on a bunch of other stuff, with the vector header pulling in a bunch of other headers. In many situations we can depend on the C++ standard library being available, and working correctly, but in some cases this may not be the case, and extra dependencies may be an issue. A case in point was with the Android NDK, particularly in early versions, where C++ standard library support was limited and it wasn't possible to depend on all of this working completely.

But besides the above points I think that although it's great just to be able to have a look under the hood, understand a bit about how this stuff works, or can work, and dispel some of the mystery that can otherwise shroud some of our standard library invocations!

Reinventing the wheel

There are a lot of reasons, conversely why not to do something like this. We're told not to 'reinvent the wheel' and this is pretty good general advice.

Standard vector implementations get a lot of testing, and there may be non-obvious corner cases or tricky situations, which we don't consider, but which are handled correctly by std::vector.

A vector implementation written by your compiler vendor may take advantage of compiler specifics, or otherwise work together with the compiler to get better performance than an implementation written in generic C++.

Cutting stuff out of the vector interface means that some standard vector based code won't work, and other semantic differences from std::vector can be a potential source of confusion.

Standard vectors may be tooled up to the teeth with stuff like debug checking instrumentation, which can help catch some tricky bugs. (But note that this is partly orthogonal to the issue of faster debug builds, because we can add some debug checking back in without the same kind of call complexity as standard vector implementations.)

Also, standard vector implementations are under development by an external party, and if we stick with the standard implementation we can get the benefits from new releases in the future without any work on our side (kind of like middleware licensing).

One case in point is C++11 move semantics. The PathEngine vector implementation was written before move semantics and so you probably won't get much benefit from implementing move constructors in the contained elements when using this class, unless we update the vector class implementation, but if we stuck with std::vector this is something we would have got 'for free'. (In practice PathEngine avoids using non-simple element types for vectors in performance critical code, so move semantics are not so important for us, but the basic point about external updates and improvements remains valid.)

Some good points on both sides of the argument then, but keeping these various considerations in mind, let's go on to look at what we can actually do with custom vector code in place..

Resize methods

We'll need to look at what goes on in the resize() methods, so lets go ahead and add these:

    void
    resize(size_type newSize)
    {
        if(newSize <= _size)
        {
            deleteRange(_data + newSize, _data + _size);
            _size = newSize;
            return;
        }
        if(newSize <= _capacity)
        {
            constructRange(_data + _size, _data + newSize);
            _size = newSize;
            return;
        }
        size_type newCapacity = newSize;
        if(newCapacity < _size * 2)
        {
            newCapacity = _size * 2;
        }
        reallocate(newCapacity);
        constructRange(_data + _size, _data + newSize);
        _size = newSize;
    }
    void
    resize(size_type newSize, const T& fillWith)
    {
        if(newSize <= _size)
        {
            deleteRange(_data + newSize, _data + _size);
            _size = newSize;
            return;
        }
        if(newSize <= _capacity)
        {
            constructRange(_data + _size, _data + newSize, fillWith);
            _size = newSize;
            return;
        }
        size_type newCapacity = newSize;
        if(newCapacity < _size * 2)
        {
            newCapacity = _size * 2;
        }
        reallocate(newCapacity);
        constructRange(_data + _size, _data + newSize, fillWith);
        for(size_type i = _size; i < newSize; i++)
        {
            ::new((void*)(_data + i)) T(fillWith);
        }
        _size = newSize;
    }

These depend on the following additional private helper methods:

    void
    reallocate(size_type newCapacity)
    {
        T* newData = allocate(newCapacity);
        copyRange(_data, _data + _size, newData);
        deleteRange(_data, _data + _size);
        free(_data);
        _data = newData;
        _capacity = newCapacity;
    }
    static void
    constructRange(T* begin, T* end)
    {
        while(begin != end)
        {
            new((void*)begin) T();
            ++begin;
        }
    }
    static void
    constructRange(T* begin, T* end, const T& fillWith)
    {
        while(begin != end)
        {
            new((void*)begin) T(fillWith);
            ++begin;
        }
    }

Potential gotcha: push back reference to an existing element

We've moved some shared logic out into that reallocate() method. This is quite similar to some code in the push_back() method, and it can be tempting to use the same reallocate call there also, but there's a subtle trap to watch out for.

The temptation is to move copy construction of the pushed element after the buffer copy and delete so we can refactor push_back() as follows:

    void
    push_back(const T& value)
    {
        if(_size != _capacity)
        {
            new((void*)(_data + _size)) T(value);
            ++_size;
            return;
        }
        size_type newCapacity = _capacity ? _capacity * 2 : 1;
        reallocate(newCapacity);
        new((void*)(newData + _size)) T(value); // this line moved after buffer delete
        ++_size;
    }

But this will now fail (potentially unpredictably!) for certain cases where the value being pushed back actually references back into the existing vector buffer, for example with something like v.push_back(v.front()).

There's some discussion about this in this stackoverflow question, but it looks like standard vector implementations all support this kind of push_back operation, and so this is something we should try to support in our custom vector.

[Update: turns out this gotcha got me again! As pointed out by DeadMG on the comments for a later post, the same issue also applies to one of the resize() methods I've posted here, e.g. with v.resize(v.size() + 3, v.back()).]

Code repetition, resizing without a fill value

There's still a lot of code repetition between the two resize methods, but we want to make sure that this is fast, and so I'm kind of treating this as the first priority. For example, we could combine these into a single method by setting a default value for the fillWith argument, (fillWith=T()) but this will most likely result in a load of unnecessary copy operations for the case where no fill value is actually required.

We've added a variation on the placement new call, in fact, for that no fill construction case. Let's look at this a bit more closely.

Construction subtleties and zero initialisation

To construct elements without a fill value in the above code, we use new((void*)begin) T(). The idea is to call the default constructor for class T with memory already allocated, but it turns out that there are some subtleties here in certain cases, notably when T is a built in type.

It turns out that when this version of placement new is called for a built in type we get 'zero initialisation', but we can rewrite this as new((void*)begin) T (no brackets!) and get 'default initialisation'.

There's an good explanation about the difference between these two initialisation types on this stackoverflow question.

Note that the same issue applies on vector construction when we do something like cVector<int> v(10), which uses quite similar code but with the corresponding constructor not shown in the example code.

When we first switched to a custom vector implementation we decided to change these construction semantics, and use default initialisation, and it turned out that this was actually the main concrete benefit in terms of actual speedups for our vector use cases.

I tested this again with the current SDK build, in a modern compiler environment (Clang 3.0, with -std=c++0x), and this still makes a significant difference in certain situations, although now mostly limited to specific load and save methods.

For example, loading unobstructed space for our flatshead_dungeon benchmark mesh:

  • with standard zero initialisation semantics takes 0.000143 seconds
  • with modified default initialisation semantics takes 0.000113 seconds

It looks like the difference in this specific case is then actually mainly down to the following code for loading persistent data into a buffer:

// _backStore is a cVector<char>
// tUnsigned32 is an integer typedef
// 'is' is an input stream object
    tUnsigned32 arraySize;
    is.openNextElementAsArray_GetSize(arraySize);
    _backStore.resize(arraySize);
    is.finishReadElementAsArray(&_backStore[0]);

So with standard vector semantics this will normally zero the buffer before loading, and this can be a noticeable overhead.

If we want _backStore to correctly report it's size, and are avoiding nasty hacks, it doesn't look like there is any way to avoid this overhead with a standard vector implementation, but we could decide just not to use a vector in this case and revert to raw buffer allocation.

Changed semantics

When using our custom vector implementation with this change we need to be aware of the changed semantics, but to be quite honest I actually wasn't aware of this aspect of std::vector behaviour before we switched (elements being zeroed on vector construction or resize) and so was already writing code as if element values were undefined.

I guess an alternative approach for the resize() method, could be to add a separate method called resize_Uninitialised(), and allow this to be chosen explicitly by the calling code, but it's not so easy to provide this kind of choice on construction.

Standalone benchmark for vector resize

Checking this against std::vector, for completeness, with the following benchmarking code (with Clang 3.0 and -std=c++0x):

template <class tVector> static int
ResizeBenchmark()
{
    int sum = 0;
    for(int i = 0; i != 400; ++i)
    {
        tVector v;
        v.resize(100000);
        v.back() += 1;
        sum += v.back();
    }
    return sum;
}

I get:

container type build time sum
std::vector debug 0.0924 seconds 400
std::vector release 0.0125 seconds 400
cVector, zero initialisation debug 0.0930 seconds 400
cVector, zero initialisation release 0.0338 seconds 400
cVector, default initialisation debug 0.0002 seconds 79801
cVector, default initialisation release 0.0001 seconds 79801

The sum variable is there to help ensure the benchmark logic is not optimised out, but also shows us that different semantics are being applied (and we can see that, in the default initialisation cases, the OS is actually giving us back the same buffer again each time through the loop).

In general stand-alone benchmarks like this should be taken with a handful of salt (because of cache issues and because the actual effects on real code performance will usually be quite different), but can be useful at least to demonstrate that there is a difference.

(Note that there can be other ways to avoid the initialisation cost shown in this benchmark, however. The next post in this series discusses this in more detail.)

Type specialisation

The other main change we made was to detect POD types and use specialised code paths for vectors with these element types.

The point is that, for built-ins and certain user defined types, constructor and destructor calls will have no effect and can be omitted, and copy construction can be replaced with memcpy.

With C++11 this becomes quite straightforward, as we can use is_trivial and is_trivially_copyable to ask the compiler whether or not the corresponding shortcuts can be applied for a given type.

Before C++11, however, we had to set this up for ourselves, and we used the following mechanism for this:

struct cPOD_Tag {};
struct cNonPOD_Tag {};
template <class T>
struct cTypeTraitsHelper
{
    typedef cNonPOD_Tag tPODType;
};
#define DECLARE_POD_TYPE(t) template<> struct cTypeTraitsHelper<t*> { typedef cPOD_Tag tPODType };
DECLARE_POD_TYPE(int)
DECLARE_POD_TYPE(char)
//...... add declarations for other types, as required
template <class T>
typename cTypeTraitsHelper<T>::tPODType PODType(T&)
{
    return typename cTypeTraitsHelper<T>::tPODType();
}

With this set up appropriately we can then switch our buffer helper methods depending on whether or not the contained element is declared as a POD type, for example:

    static void
    constructRange(T* begin, T* end, cPOD_Tag)
    {
    }
    static void
    constructRange(T* begin, T* end, cNonPOD_Tag)
    {
        while(begin != end)
        {
            new((void*)begin) T;
            ++begin;
        }
    }
    static void
    constructRange(T* begin, T* end)
    {
        constructRange(begin, end, PODType(begin));
    }

(And similarly for stuff like destruction, copy construction and buffer fill.)

This kind of type specialisation made some difference to performance historically, but in practice it turned out that the compiler is often clever enough to optimise this stuff out by itself, and the actual performance difference resulting from this change was not hugely significant.

I've tested this again in the current release (with Clang 3.0, with and without -std=c++0x)and if I remove all the POD type declarations it looks like the only stuff that actually gets affected by this significantly are now some methods for saving preprocess objects to persistence, which are not so performance critical.

Tweaking initial push_back behaviour

Another optimisation is to adjust the initial behaviour when calling push_back on an empty vector, to avoid small allocations.

After some benchmarking we changed the following line in the push_back() method:

    size_type newCapacity = _capacity ? _capacity * 2 : 1;

to

    size_type newCapacity = _capacity ? _capacity * 2 : 8;

What this does is to avoid initial allocations of size 1, 2 and 4 and jump straight to a buffer of size 8, which works out as a net performance gain for us.

If you think about what's going on in the underlying memory allocation subsystem this makes sense, because small allocations are probably usually going to be aligned and padded up to a certain minimum size anyway.

In-class buffers

Another way to avoid small allocations can be to reserve a fixed amount of space in the vector itself for use when the buffer allocation requirement is below a certain size.

I know of at least one developer that uses this technique but we don't do this currently at PathEngine.

Amortize faster!

The above code doubles vector capacity each time a new buffer is allocated to ensure amortized constant time push_back, but there's nothing special about the number 2 here. In fact, we can should be able to use any constant > 1 and still get this amortized complexity.

There's a trade-off here between number of allocations (which take time) and buffer overallocation (which cost memory, and also potentially reduce cache efficiency), but in PathEngine, changing this constant to 3 seems to give us a net benefit:

    size_type newCapacity = _capacity ? _capacity * 3 : 8;

Note that it's all the more important with this setting to make sure we avoid permanent overallocations wherever possible, e.g. by applying the 'shrink-to-fit' idiom, as discussed here.

Other optimisations and benchmarking

There are other tweaks you could try out, but this can start to be a bit hit and miss if you don't have some good high level benchmarks set up to capture the real performance situation needing to be optimised.

With benchmarks in place a sampling profiler can help you pin down areas where you can most benefit from optimisations, but watch out for situations where the real issues are with higher level vector management and using vectors effectively. Maybe a whole bunch of allocations could be avoided just by reusing a vector across calls or across frames, but a profiler isn't going to tell this directly.

Note that a number of the above tweaks target the underlying memory allocations because this is often the most significant cost associated with vector operations, and there may then be other ways to reduce these memory allocation costs.

I'll look into the issue of memory allocation customisation more generally in my next post, and we'll see, for example, that using realloc() is something that can also potentially improve performance.

Conclusion

It's good to know that we can replace std::vector, first of all. There's a bit of placement new trickery, and at least one gotcha to look out for, but overall this turns out to be not as hard as you might expect. After that, whether or not this is a good idea is something to be decided on a case by case basis!

You should have some ideas, now, about how a custom vector class can potentially improve performance and in my next post we'll see how this can also be desirable if you need a lot of control over buffer memory allocation.

Optimising your vector class is something of a low level optimisation and there are often higher level changes you can make to your program that make a lot more difference, and can even remove the need for this kind of low level optimisation.

In the case of PathEngine, the various low level vector optimisations I describe here are less significant than when we first switched because of other higher level optimisations since then that avoid or reduce costly vector operations in many cases where these were an issue, but our custom vector nevertheless still helps to improve SDK performance.

When measuring performance differences it's important to test vector operations in the actual target run-time context, and to include the effects of memory allocation and processor caching in real situations. If you think that a custom vector class could improve your application performance, then perhaps the best way to measure this could be to actually go ahead and plug in a custom vector implementation (perhaps based on the code shown above) and compare some high level benchmark timings..

** Comments: Please check the existing comment thread for this post before commenting. **