I'm a game programmer that has worked for both corporate and 3rd party companies, and contributed to hit titles including Transformers, Wall-E and Spider-man. Currently I'm doing an personal project (keeping scope small, and the passion alive) while I search for my next gig.
Posts by Paul Laska
  1. User Defined Build Command Macros in MSVS 2010... ( Counting comments... )
  2. Windows XP's Low Fragmentation Heaps... ( Counting comments... )
  3. Ready, Set, Allocate! (Part 6) ( Counting comments... )
  4. Ready, Set, Allocate! (Part 5) ( Counting comments... )
  5. Ready, Set, Allocate! (Part 4) ( Counting comments... )
  6. Ready, Set, Allocate! (Part 3) ( Counting comments... )
  7. Ready, Set, Allocate! (Part 2) ( Counting comments... )
  8. Ready, Set, Allocate! (Part 1) ( Counting comments... )
  9. Social Media Games On the Mind ( Counting comments... )
  10. Now Where Did I Allocate That?! ( Counting comments... )
Protected / Technology/ Code /

Links to other parts:

In this part I will begin to discuss the method and assumptions used to rewrite malloc.

 First comes the huge non-secret, secret.  Aligned memory is easier to allocate than un-aligned memory, because the addresses work nicely within this system.  Since it can be assumed that allocations will always be made along the alignment boundary, there is no requirement to divide and track sub-alignment size blocks, which saves time in operations and space in tracking. The downside is the allocations potentially waste space and fragment memory.  With some careful planning and usage of design patterns such as the Object Pool, the wasted space and fragmentation can be minimized.

One of the dirty little secrets about the tracking system mentioned in Part 1 is it only tracks whether a page of memory is in use or not, which doesn’t inform the free method about how much memory was previously allocated when malloc was called.  So all malloc calls should mark the memory being returned with some sort of header to denote how much memory was allocated, so the corresponding free call will have the information it needs to release the memory and update the tracking information.  In my code I use a simple header that tracks the size and the alignment.

typedef struct _TAllocationHeader
      size_t      uSize;
      u32         uAlignment;

I should mention a few typedefs I use: 

u8        - unsigned char
u16      - unsigned short
u32      - unsigned integer
l64       - long long

 Now, technically, a size_t, or 32-bit unsigned integer, for the uSize member is overkill for a 32MB memory heap (2^25 would suffice), but trying to go smaller isn’t possible with the current data types; the next data type down, a u16 only goes to 65,535.  Even if a data type existed between u16 and u32, it would remove the generalization to be able to easily switch to 512MB or larger.  The 32-bit unsigned integer can denote up to a 4GB allocation, which hopefully should be more than will be needed in one allocation; if not, adjust accordingly for your needs.

The u32, 32-bit unsigned integer, for the Alignment is also overkill, but instead of trying to conserve space using a u8, 8-bit unsigned char, the u32 is chosen because it keeps the address at the end of the header aligned within the 32-bit system.  Something to keep in mind when modifying the allocation header to suit your needs is that the size of the header should be a multiple of the alignment size (e.g. in this system, on a 32-bit platform, headers should be a multiple of 4-bytes (32-bits = 4 bytes)), this provides the benefit that memory addresses returned will be useable with operations that require memory alignment (such as those used in SIMD).

The malloc function signature looks like so:

void* my_malloc( size_t uSize, u32 uAlignment )

uSize is the amount of memory in bytes being requested, and uAlignment is the size in bytes to use for calculating address boundaries.  Calls to malloc can be redirected to use my_malloc by first ensuring malloc.h is not included, then providing a define that replaces malloc with my_malloc during the pre-processor phase of compilation, and swapping the malloc calls for the define. 

#define MYMALLOC( uSize ) my_malloc( uSize, 4 )

So here’s what malloc needs to do… 

  1. Add padding to compensate for alignment of the allocation.
  2. Align the header so the beginning address of the memory will be aligned.
  3. Add the size of the header to the allocation.
  4. Check the request to make sure it will fit in the memory being tracked.
  5. Find enough contiguous pages of memory to satisfy the allocation.
    1. Find a starting point of available pages within a tracking unit.
    2. Find whole tracking units where all pages are available and required in fulfilling the request.
    3. Find the remaining pages needed to fulfill the request.
  6. Return the memory, if enough available memory was found, otherwise return NULL.

 The first four steps feel mostly like housekeeping steps to me, so I’ve kept the explanations mostly to just a word description of what is done in code.  If more explanation is needed, please post questions.


1. Add padding to compensate for alignment of the allocation.

If the allocation request doesn’t match the alignment requirement, it’s simple enough to fix.  The additional space is calculated by taking the modulus of the size by the alignment, subtracting the remainder from the alignment, and adding the result back into the size.

uSize += (((uSize % uAlignment) > 0) ? uAlignment - (uSize % uAlignment) : 0);

2. Align the header so the beginning address of the memory will be aligned.

As an optimization, this step can be skipped, if you know that your allocation header will always take enough memory to leave the next available byte on an alignment boundary.  However, if someone calls my_malloc with an alignment size greater than the size of the allocation header, the result will be an unaligned address boundary.  The size in bytes to pad the allocation header can be obtained using a similar method to what was used to pad the allocation itself.  The final size of the allocation header is also saved so it can be used to calculate the return address.

u32 uAllocationHeaderPaddingSize = ((sizeof(TALLOCATION_HEADER) % uAlignment) > 0) ? uAlignment - sizeof(TALLOCATION_HEADER) % uAlignment : 0;
u32 uAllocationHeaderSize = sizeof(TALLOCATION_HEADER) + uAllocationHeaderPaddingSize; 

3.  Add the size of the header to the allocation.

Nothing big here, just keeping track of how much memory will be required.

uSize += uAllocationHeaderSize;

4. Check the request to make sure it will fit in the memory being tracked.

The number of pages being requested is calculated and saved.  First the number of whole pages is obtained by dividing the allocation size by the size of a page of memory.  Then any remaining memory that is required will fit into one page, so one page will be added if the allocation size modulus by the size of a page of memory is anything other than zero.

const u32 uNumPagesRequested = uSize / kMemPageSize + ((uSize % kMemPageSize) ? 1 : 0);

Next the number of tracking units requested is calculated and saved in similar fashion.  The number of whole tracking units is the number of pages requested divided by the number of pages per tracking unit.  Then any remaining pages will fit into one tracking unit, so one tracking unit will be added if the number of pages requested modulus by the number of pages per tracking unit is anything other than zero.

const u32 uNumTrackingUnitsRequested = uNumPagesRequested / kMemNumPagesPerUnit + ((uNumPagesRequested % kMemNumPagesPerUnit) ? 1 : 0);

Last the number of tracking units requested is compared to the total number of tracking units.  If there are more tracking units being requested than the total number of tracking units, the allocation failure is reported by returning NULL.

if( uNumTrackingUnitsRequested > kMemNumTrackingUnits )
      return NULL;

5. Find enough contiguous pages of memory to satisfy the allocation.

The search for pages is broken up into three stages, starting point, whole tracking units, and remaining pages.  The starting point looks for a tracking unit that either resolves the entire request or has pages available at the end and is contiguous to the next tracking unit.  The whole tracking units consist of 32 pages, at 4KB each, and by searching for available whole tracking units, the search is sped up by not having to do an individual search for each page internally.  The remaining pages stage aims to fulfill the remainder of the request by checking the tracking unit that is contiguous to either the tracking unit used for the starting point, or the last whole tracking unit, depending on which was last used.

In the next part the three stages of the search for pages will be covered.  To be continued…