Bits, cycles and polygons. Engine programmer at Team Bondi.
Posts by Luke Hutchinson
  1. The best of the best, or the lowest of the low ? ( Counting comments... )
  2. 3... 2... 1... GO! Roll-Your-Own Countdown Event Synchronization ( Counting comments... )
  3. Memory Address Translation ( Counting comments... )
  4. Getting Down and Dirty with MESI Caches ( Counting comments... )
  5. Old Tricks, New Dogs: Removing Branches ( Counting comments... )
  6. Dual Issue and Software Pipelining on SPUs ( Counting comments... )
  7. Instruction Level Parallelism ( Counting comments... )
Technology/ Code /

Last time around we looked at how instruction level parallelism can be achieved when you have an understanding of how your instructions get scheduled. But the example we used, was simple enough that we could ignore an important feature of the PPU and SPU architectures. That feature is dual issue. Under the right conditions, both of these architectures can issue not one, but two instructions every clock cycle.

First up we are going to look at the conditions for dual issue on an SPU, then we are going to look at optimizing loops using software pipelining.

On an SPU, the instructions are split into two categories, pipeline 0 and pipeline 1 instructions. Appendix B.1.2 of the Cell Broadband Engine Programming Handbook (https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE70D5852575CA0074E8ED), gives the definitive list of instructions, and which pipeline they belong to, but the following simple rule of thumb is good enough to remember most cases,

"Any arithmetic or logic instructions that work on word size or smaller elements are pipe 0. Any loads, stores, branches and instructions that move bits between words are pipe 1."

For example, XOR, FMA, CGT, and SHLI are all pipeline 0, while LQD, SHUFB, BRNZ snd ROTQBYI are all pipeline 1.

Dual issue occurs when
(a) the program counter is 0 mod 8
(b) the next instruction is a pipeline 0 instruction
(c) the instruction after is a pipeline 1 instruction
(d) the pipeline 1 instruction is not blocked by a register dependency

So given the following code,

   100:   ai      $3, $3, 1
   104:   shufb   $4, $4, $4, $4
   108:   ai      $5, $5, 1
   10c:   ai      $6, $6, 1
   110:   a       $3, $3, $7
   114:   lnop
   118:   a       $5, $4, $5
   11c:   shufb   $6, $6, $6, $6

100 and 104 will dual issue.
108 and 10c will not, because both are pipe 0.
110 and 114 will dual issue, and so will 118 and 11c.

Note the LNOP at 114; NOP (pipeline 0) and LNOP (pipeline 1) can be useful to align instructions for dual issue. We could also add LNOPs after the AIs at 108 and 10c, the code would still execute at the same speed, but it would be a pointless waste of the precious local store.

Ok, now that we've covered the pretty simple rules for how dual issue is achieved on an SPU, its time for the cool bit. Really cranking up the speed of a loop with software pipelining. Don't be confused by the name, and the pipeline categories for each instruction, they are two separate things. But there is an important interaction between the two as we shall soon see.

To cover this, we will use the following problem as an example. Lets say that our rendering pipeline is setup so that we store a compressed version of our vertex tangents in memory, and decompress them on the fly with an SPU just before rendering it.

The compressed format we'll use for tangents is,

     0          10 11       20 21       30 31
    +-------------+-----------+-----------+---+
    |      X      |     Y     |     Z     | S |
    +-------------+-----------+-----------+---+

11-bits for the X component, 10-bits for the Y and Z components and a sign bit for the bitangent. This compressed tangent will then be decompressed to four floats, the X, Y and Z components, and +/- 1 for the bitangent sign. In the vertex shader we'll reconstruct the vertex bitangent, by taking the cross product between the normal and tangent, and then use this sign bit to potentially flip the direction.

A simple to read, but horrifically slow, reference version of the decompression function is

reference.c++

#include <stdint.h>

void reference( void *__restrict__ out,const void *__restrict__ in,
 uint32_t count, uint32_t stride ) {
  float *p_out = (float*)out;
  const uint32_t *p_in =(uint32_t*)in;
  for ( unsigned i=0; i<count; ++i ) {
    uint32_t u32 = *p_in;
    p_in = (uint32_t*)((uintptr_t)p_in+stride);
    p_out[0] = ((float)((u32>>21)&0x7ff)/0x7ff)*2.f - 1.f;
    p_out[1] = ((float)((u32>>11)&0x3ff)/0x3ff)*2.f - 1.f;
    p_out[2] = ((float)((u32>>1) &0x3ff)/0x3ff)*2.f - 1.f;
    p_out[3] = (u32&1) ? -1.f : 1.f;
    p_out += 4;
  }
}

Now lets look at a fairly decent implementation written in intrinsics.

intrinsics.c++

#include <spu_intrinsics.h>
#include <stdint.h>

void intrinsics( void *__restrict__ out,const void *__restrict__ in,
 uint32_t count, uint32_t stride ) {

  qword p_ina = si_from_ptr( in );
  qword p_out = si_from_ptr( out );

  // Loop will be doing four at a time, so round up the input count.  This means
  // that the caller must guarantee there is enough space in the output buffer so
  // that we won't be corrupting anything important.
  qword cnt = si_from_uint( count );
  cnt = si_andi( si_ai( cnt, 3 ), -4 );

  qword strd = si_from_uint( stride );
  qword p_inb = si_a( p_ina, strd );
  qword p_inc = si_a( p_inb, strd );
  qword p_ind = si_a( p_inc, strd );
  qword strd4 = si_shli( strd, 2 );
  qword onef = si_ilhu( 0x3f80 );
  qword neg_onef = si_ilhu( 0xbf80 );

  // Here we are generating {0x10111213,0x14151617,0x18191a1b,0x1c1d1e1f}.  We
  // can save a little code space (12-bytes), by not loading it from memory.
  // Instead we take advantage of the ABI specification that the stack pointer
  // is always 16-byte aligned, this means CWD reg, 0($1) always returns a
  // constant {0x00010203,0x04050607,0x18191a1b,0x1c1d1e1f}.  Don't use GCC's
  //    register qword sp __asm__("1");
  // syntax, as on rare occasions, this is buggy and you get the wrong register,
  // potentially this only occurs inside of C++ templates though.
  qword shufabcd; __asm__ ( "cwd %0, 0($1)" : "=r"(shufabcd) );
  shufabcd = si_andbi( shufabcd, 15 );

  qword second_qword = si_ilh( 0x1010 );
  static const qword x_scale = (qword)(vec_float4){
    2.f/0x7ff, 2.f/0x7ff, 2.f/0x7ff, 2.f/0x7ff };
  static const qword yz_scale = (qword)(vec_float4){
    2.f/0x3ff, 2.f/0x3ff, 2.f/0x3ff, 2.f/0x3ff };
  static const qword shufAaBb = {
    0x00, 0x01, 0x02, 0x03,   0x10, 0x11, 0x12, 0x13,
    0x04, 0x05, 0x06, 0x07,   0x14, 0x15, 0x16, 0x17 };
  qword shufCcDd = si_orbi( shufAaBb, 8 );
  qword x3ff = si_il( 0x3ff );

  do {
    cnt = si_ai( cnt, -4 );

    // For each of the four pointers, load the two potential qwords it can
    // overlap
    qword qa0 = si_lqd( p_ina, 0  );
    qword qa1 = si_lqd( p_ina, 16 );
    qword qb0 = si_lqd( p_inb, 0  );
    qword qb1 = si_lqd( p_inb, 16 );
    qword qc0 = si_lqd( p_inc, 0  );
    qword qc1 = si_lqd( p_inc, 16 );
    qword qd0 = si_lqd( p_ind, 0  );
    qword qd1 = si_lqd( p_ind, 16 );

    // Generate SHUFB controls that will move the compressed tangents into the
    // preferred word of a register
    qword shuf_a = si_rotqby( shufabcd, p_ina );
    qword shuf_b = si_rotqby( shufabcd, p_inb );
    qword shuf_c = si_rotqby( shufabcd, p_inc );
    qword shuf_d = si_rotqby( shufabcd, p_ind );
    shuf_a = si_andc( shuf_a, si_shlqby( second_qword, si_andi( p_ina, 15 ) ) );
    shuf_b = si_andc( shuf_b, si_shlqby( second_qword, si_andi( p_inb, 15 ) ) );
    shuf_c = si_andc( shuf_c, si_shlqby( second_qword, si_andi( p_inc, 15 ) ) );
    shuf_d = si_andc( shuf_d, si_shlqby( second_qword, si_andi( p_ind, 15 ) ) );

    p_ina = si_a( p_ina, strd4 );
    p_inb = si_a( p_inb, strd4 );
    p_inc = si_a( p_inc, strd4 );
    p_ind = si_a( p_ind, strd4 );

    // Collect each of the four compressed tangents into a single qword
    qword u32a = si_shufb( qa0, qa1, shuf_a );
    qword u32b = si_shufb( qb0, qb1, shuf_b );
    qword u32c = si_shufb( qc0, qc1, shuf_c );
    qword u32d = si_shufb( qd0, qd1, shuf_d );
    qword u32ac = si_shufb( u32a, u32c, shufAaBb );
    qword u32bd = si_shufb( u32b, u32d, shufAaBb );
    qword u32 = si_shufb( u32ac, u32bd, shufAaBb );

    // Convert the compressed tangents to the output X, Y, Z components and
    // bitangent signs
    qword x = si_rotmi( u32, -21 );
    qword y = si_and( si_rotmi( u32, -11 ), x3ff );
    qword z = si_and( si_rotmi( u32,  -1 ), x3ff );
    qword s = si_or( si_shli( u32, 31 ), onef );
    x = si_cuflt( x, 0 );
    y = si_cuflt( y, 0 );
    z = si_cuflt( z, 0 );
    x = si_fma( x, x_scale,  neg_onef );
    y = si_fma( y, yz_scale, neg_onef );
    z = si_fma( z, yz_scale, neg_onef );

    // Transpose and store the output values
    qword xazaxbzb = si_shufb( x, z, shufAaBb );
    qword xczcxdzd = si_shufb( x, z, shufCcDd );
    qword yasaybsb = si_shufb( y, s, shufAaBb );
    qword ycscydsd = si_shufb( y, s, shufCcDd );
    qword outa = si_shufb( xazaxbzb, yasaybsb, shufAaBb );
    qword outb = si_shufb( xazaxbzb, yasaybsb, shufCcDd );
    qword outc = si_shufb( xczcxdzd, ycscydsd, shufAaBb );
    qword outd = si_shufb( xczcxdzd, ycscydsd, shufCcDd );
    si_stqd( outa, p_out, 0x00 );
    si_stqd( outb, p_out, 0x10 );
    si_stqd( outc, p_out, 0x20 );
    si_stqd( outd, p_out, 0x30 );
    p_out = si_ai( p_out, 0x40 );

  } while( __builtin_expect( si_to_int( cnt ), 1 ) );
}

This intrinsics version is a smidgen more than 6 times as fast as the reference version, so often we'd say that's good enough and move on to another problem. But this most certainly can still be made faster. We'll use this intrinsics version as the basis for our assembler version that we'll optimize further by pipelining the loop. We still haven't explained what "pipelining the loop" means, but don't worry, nearly there.

First up, lets translate this function into assembly code.

assembler_0.s

        .section    .text.assembler, "ax", @progbits

        .global     assembler
        .type       assembler, @function

        # $3    void *__restrict__ out
        # $4    const void *__restrict__ in
        # $5    uint32_t count
        # $6    uint32_t stride
        .set        out,            3
        .set        in,             4
        .set        count,          5
        .set        stride,         6
assembler:

        .set        p_ina,          in
        .set        p_inb,          7
        .set        p_inc,          8
        .set        p_ind,          9
        .set        p_ina_tmp,      10
        .set        p_inb_tmp,      11
        .set        p_inc_tmp,      12
        .set        p_ind_tmp,      13

        .set        strd4,          14
        .set        onef,           15
        .set        neg_onef,       16
        .set        shufabcd,       17
        .set        second_qword,   18
        .set        x_scale,        19
        .set        yz_scale,       20
        .set        shufAaBb,       21
        .set        shufCcDd,       22
        .set        x3ff,           23

        .set        qa0,            24
        .set        qa1,            25
        .set        qb0,            26
        .set        qb1,            27
        .set        qc0,            28
        .set        qc1,            29
        .set        qd0,            30
        .set        qd1,            31
        .set        shuf_a,         32
        .set        shuf_b,         33
        .set        shuf_c,         34
        .set        shuf_d,         35
        .set        u32a,           36
        .set        u32b,           37
        .set        u32c,           38
        .set        u32d,           39
        .set        u32ac,          40
        .set        u32bd,          41
        .set        u32,            42
        .set        x,              43
        .set        y,              44
        .set        z,              45
        .set        s,              46
        .set        xazaxbzb,       47
        .set        xczcxdzd,       48
        .set        yasaybsb,       49
        .set        ycscydsd,       50
        .set        outa,           51
        .set        outb,           52
        .set        outc,           53
        .set        outd,           54

        ai          count, count, 3
        andi        count, count, -4

        shli        strd4, stride, 2
        ilhu        onef, 0x3f80
        ilhu        neg_onef, 0xbf80
        cwd         shufabcd, 0 ( $1 )
        andbi       shufabcd, shufabcd, 15
        ilh         second_qword, 0x1010
        lqr         x_scale, _x_scale
        lqr         yz_scale, _yz_scale
        lqr         shufAaBb, _shufAaBb
        orbi        shufCcDd, shufAaBb, 8
        il          x3ff, 0x3ff

        a           p_inb, p_ina, stride
        a           p_inc, p_inb, stride
        a           p_ind, p_inc, stride

    loop:
        ai          count, count, -4
        lqd         qa0, 0  ( p_ina )
        lqd         qa1, 16 ( p_ina )
        lqd         qb0, 0  ( p_inb )
        lqd         qb1, 16 ( p_inb )
        lqd         qc0, 0  ( p_inc )
        lqd         qc1, 16 ( p_inc )
        lqd         qd0, 0  ( p_ind )
        lqd         qd1, 16 ( p_ind )
        rotqby      shuf_a, shufabcd, p_ina
        rotqby      shuf_b, shufabcd, p_inb
        rotqby      shuf_c, shufabcd, p_inc
        rotqby      shuf_d, shufabcd, p_ind
        andi        p_ina_tmp, p_ina, 15
        andi        p_inb_tmp, p_inb, 15
        andi        p_inc_tmp, p_inc, 15
        andi        p_ind_tmp, p_ind, 15
        shlqby      p_ina_tmp, second_qword, p_ina_tmp
        shlqby      p_inb_tmp, second_qword, p_inb_tmp
        shlqby      p_inc_tmp, second_qword, p_inc_tmp
        shlqby      p_ind_tmp, second_qword, p_ind_tmp
        andc        shuf_a, shuf_a, p_ina_tmp
        andc        shuf_b, shuf_b, p_inb_tmp
        andc        shuf_c, shuf_c, p_inc_tmp
        andc        shuf_d, shuf_d, p_ind_tmp
        a           p_ina, p_ina, strd4
        a           p_inb, p_inb, strd4
        a           p_inc, p_inc, strd4
        a           p_ind, p_ind, strd4
        shufb       u32a, qa0, qa1, shuf_a
        shufb       u32b, qb0, qb1, shuf_b
        shufb       u32c, qc0, qc1, shuf_c
        shufb       u32d, qd0, qd1, shuf_d
        shufb       u32ac, u32a, u32c, shufAaBb
        shufb       u32bd, u32b, u32d, shufAaBb
        shufb       u32, u32ac, u32bd, shufAaBb
        rotmi       x, u32, -21
        rotmi       y, u32, -11
        rotmi       z, u32, -1
        shli        s, u32, 31
        and         y, y, x3ff
        and         z, z, x3ff
        or          s, s, onef
        cuflt       x, x, 0
        cuflt       y, y, 0
        cuflt       z, z, 0
        fma         x, x, x_scale,  neg_onef
        fma         y, y, yz_scale, neg_onef
        fma         z, z, yz_scale, neg_onef
        shufb       xazaxbzb, x, z, shufAaBb
        shufb       xczcxdzd, x, z, shufCcDd
        shufb       yasaybsb, y, s, shufAaBb
        shufb       ycscydsd, y, s, shufCcDd
        shufb       outa, xazaxbzb, yasaybsb, shufAaBb
        shufb       outb, xazaxbzb, yasaybsb, shufCcDd
        shufb       outc, xczcxdzd, ycscydsd, shufAaBb
        shufb       outd, xczcxdzd, ycscydsd, shufCcDd
        stqd        outa, 0x00 ( out )
        stqd        outb, 0x10 ( out )
        stqd        outc, 0x20 ( out )
        stqd        outd, 0x30 ( out )
        ai          out, out, 0x40
        brnz        count, loop

        bi          $0

        .size       assembler, .-assembler


        .section    .rodata.assembler, "a", @progbits
        .align      4
_x_scale:           .float  0.0009770396, 0.0009770396, 0.0009770396, 0.0009770396
_yz_scale:          .float  0.0019550342, 0.0019550342, 0.0019550342, 0.0019550342
_shufAaBb:          .long   0x00010203,   0x10111213,   0x04050607,   0x14151617

The translation was a pretty trivial straight forward one, but it sucks. GCC scheduled the intrinsics version much better. This initial assembly version is 1.4 times as expensive as the intrinsics one. So using our new found understanding of instruction latency and dual issue rules, lets schedule this code better. For breivety, we'll just show the instructions, not all the .sets and .rodata, as they stay the same.

abreiviated assenbler_1.s

        .align      3
assembler:

        ai          count, count, 3 ;               lqr         x_scale, _x_scale
        andi        count, count, -4 ;              lqr         yz_scale, _yz_scale
        shli        strd4, stride, 2 ;              lqr         shufAaBb, _shufAaBb
        ilhu        onef, 0x3f80 ;                  cwd         shufabcd, 0 ( $1 )
        ilhu        neg_onef, 0xbf80 ;              hbrr        loop_branch, loop
        ilh         second_qword, 0x1010 ;          /*lnop*/
        il          x3ff, 0x3ff ;                   /*lnop*/
        andbi       shufabcd, shufabcd, 15 ;        /*lnop*/
        a           p_inb, p_ina, stride ;          /*lnop*/
        a           p_inc, p_inb, stride ;          /*lnop*/
        a           p_ind, p_inc, stride ;          /*lnop*/
        orbi        shufCcDd, shufAaBb, 8 ;         lnop

    loop:
        # pipeline 0                                # pipeline 1                                      # cycle
        ai          count, count, -4 ;              lqd         qa0, 0  ( p_ina )                     #   0
        andi        p_ina_tmp, p_ina, 15 ;          lqd         qa1, 16 ( p_ina )                     #   1
        andi        p_inb_tmp, p_inb, 15 ;          lqd         qb0, 0  ( p_inb )                     #   2
        andi        p_inc_tmp, p_inc, 15 ;          lqd         qb1, 16 ( p_inb )                     #   3
        andi        p_ind_tmp, p_ind, 15 ;          lqd         qc0, 0  ( p_inc )                     #   4
        /*nop*/ ;                                   lqd         qc1, 16 ( p_inc )                     #   5
        /*nop*/ ;                                   lqd         qd0, 0  ( p_ind )                     #   6
        /*nop*/ ;                                   lqd         qd1, 16 ( p_ind )                     #   7
        /*nop*/ ;                                   rotqby      shuf_a, shufabcd, p_ina               #   8
        a           p_ina, p_ina, strd4 ;           rotqby      shuf_b, shufabcd, p_inb               #   9
        a           p_inb, p_inb, strd4 ;           rotqby      shuf_c, shufabcd, p_inc               #  10
        a           p_inc, p_inc, strd4 ;           rotqby      shuf_d, shufabcd, p_ind               #  11
        a           p_ind, p_ind, strd4 ;           shlqby      p_ina_tmp, second_qword, p_ina_tmp    #  12
        /*nop*/ ;                                   shlqby      p_inc_tmp, second_qword, p_inc_tmp    #  13
        /*nop*/ ;                                   shlqby      p_inb_tmp, second_qword, p_inb_tmp    #  14
        nop ;                                       shlqby      p_ind_tmp, second_qword, p_ind_tmp    #  15
        andc        shuf_a, shuf_a, p_ina_tmp ;     /*lnop*/                                          #  16
        andc        shuf_c, shuf_c, p_inc_tmp ;     /*lnop*/                                          #  17
        andc        shuf_b, shuf_b, p_inb_tmp ;     shufb       u32a, qa0, qa1, shuf_a                #  18
        andc        shuf_d, shuf_d, p_ind_tmp ;     shufb       u32c, qc0, qc1, shuf_c                #  19
        /*nop*/ ;                                   shufb       u32b, qb0, qb1, shuf_b                #  20
        /*nop*/ ;                                   shufb       u32d, qd0, qd1, shuf_d                #  21
        /*nop*/ ;                                   /*lnop*/                                          #  22
        /*nop*/ ;                                   shufb       u32ac, u32a, u32c, shufAaBb           #  23
        /*nop*/ ;                                   /*lnop*/                                          #  24
        /*nop*/ ;                                   shufb       u32bd, u32b, u32d, shufAaBb           #  25
        /*nop*/ ;                                   /*lnop*/                                          #  26
        /*nop*/ ;                                   /*lnop*/                                          #  27
        /*nop*/ ;                                   /*lnop*/                                          #  28
        nop ;                                       shufb       u32, u32ac, u32bd, shufAaBb           #  29
        /*nop*/ ;                                   /*lnop*/                                          #  30
        /*nop*/ ;                                   /*lnop*/                                          #  31
        /*nop*/ ;                                   /*lnop*/                                          #  32
        rotmi       z, u32, -1 ;                    /*lnop*/                                          #  33
        rotmi       y, u32, -11 ;                   /*lnop*/                                          #  34
        rotmi       x, u32, -21 ;                   /*lnop*/                                          #  35
        shli        s, u32, 31 ;                    /*lnop*/                                          #  36
        and         z, z, x3ff ;                    /*lnop*/                                          #  37
        and         y, y, x3ff ;                    /*lnop*/                                          #  38
        cuflt       z, z, 0 ;                       /*lnop*/                                          #  39
        cuflt       x, x, 0 ;                       /*lnop*/                                          #  40
        cuflt       y, y, 0 ;                       /*lnop*/                                          #  41
        or          s, s, onef ;                    /*lnop*/                                          #  42
        /*nop*/ ;                                   /*lnop*/                                          #  43
        /*nop*/ ;                                   /*lnop*/                                          #  44
        /*nop*/ ;                                   /*lnop*/                                          #  45
        fma         z, z, yz_scale, neg_onef ;      /*lnop*/                                          #  46
        fma         x, x, x_scale,  neg_onef ;      /*lnop*/                                          #  47
        fma         y, y, yz_scale, neg_onef ;      /*lnop*/                                          #  48
        /*nop*/ ;                                   /*lnop*/                                          #  49
        /*nop*/ ;                                   /*lnop*/                                          #  50
        /*nop*/ ;                                   /*lnop*/                                          #  51
        /*nop*/ ;                                   shufb       xazaxbzb, x, z, shufAaBb              #  52
        /*nop*/ ;                                   shufb       yasaybsb, y, s, shufAaBb              #  53
        /*nop*/ ;                                   shufb       xczcxdzd, x, z, shufCcDd              #  54
        /*nop*/ ;                                   shufb       ycscydsd, y, s, shufCcDd              #  55
        /*nop*/ ;                                   /*lnop*/                                          #  56
        /*nop*/ ;                                   shufb       outa, xazaxbzb, yasaybsb, shufAaBb    #  57
        /*nop*/ ;                                   shufb       outb, xazaxbzb, yasaybsb, shufCcDd    #  58
        /*nop*/ ;                                   shufb       outc, xczcxdzd, ycscydsd, shufAaBb    #  59
        /*nop*/ ;                                   shufb       outd, xczcxdzd, ycscydsd, shufCcDd    #  60
        /*nop*/ ;                                   stqd        outa, 0x00 ( out )                    #  61
        /*nop*/ ;                                   stqd        outb, 0x10 ( out )                    #  62
        /*nop*/ ;                                   stqd        outc, 0x20 ( out )                    #  63
        nop ;                                       stqd        outd, 0x30 ( out )                    #  64
        ai          out, out, 0x40 ; loop_branch:   brnz        count, loop                           #  60

        bi          $0

        .size       assembler, .-assembler

This version is now fractionally faster than the intrinsics version, but pretty much the same. Look how many nops there are in there. This code is nowhere near fully utilizing the SPU. Ideally we'd like to be issuing two instructions every cycle. The loop has 63 instructions that do something useful, and 69 nops (mostly commented out), so it is acheiving less then 50% utilization of the SPU.

Finally, its' time to look at software pipelining. The idea is to make use of the spare instruction cycles by executing several iterations of the loop at the same time. But this is not unrolling the loop. Instead, the contents of the loop is wrapped around on itself. So lets see how it can be done.

The loop has 27 pipe 0 instructions, and 36 pipe 1 (not including any nops). That means that the pipe 1 instructions are the limiting factor, the loop cannot be executed in less than 36 cycles. First we split the loop after cycle 34, and reschedule the remaining instructions back into the start of the loop.

The ROTMI from cycle 35 can be slotted into cycle 5. The SHLI from cycle 36 can be slotted into cycle 6, and so on. So now when the SPU is executing one iteration of the loop, these instructions correspond to one loop behind most of the other instructions. Not every instruction was able to be scheduled in the first time wrapping around. Several instructions remained, so again we wrapped them around. The end result was three iterations of the loop pipelined inside each other.

We are going to need some prologue code before entering the loop, as the otherwise first two iterations will be using some undefined values. We simply take two copies of the loop, and remove the instructions that were pipelined from a previous iteration. There is then is the question of a epilogue to the loop. Because all the stores were scheduled in the same iteration of the loop, we have an option. One possibility is to create an epilogue the same way as we did a prologue, the other is to just execute the loop two more times. We chose to do without an epilogue to save some code space.

Unfortunately it isn't always as "easy" as this. When wrapping the loop around like this, we need to be very careful that the resulting loop actually still does the same thing. We need to make sure that the instructions are placed before any others that modify the registers they use. We'll see an example of this problem soon.

Below is the optimized function with pipelining (again we have abreviated it, not copying all the .set pseudo ops).

abreiviated assenbler_2.s

        .align      3
assembler:

        ai          count, count, 3 ;               lqr         x_scale, _x_scale
        andi        count, count, -4 ;              lqr         yz_scale, _yz_scale
        shli        strd4, stride, 2 ;              lqr         shufAaBb, _shufAaBb
        ilhu        onef, 0x3f80 ;                  cwd         shufabcd, 0 ( $1 )
        ilhu        neg_onef, 0xbf80 ;              hbrr        loop_branch, loop
        ilh         second_qword, 0x1010 ;          /*lnop*/
        il          x3ff, 0x3ff ;                   /*lnop*/
        andbi       shufabcd, shufabcd, 15 ;        /*lnop*/
        a           p_inb, p_ina, stride ;          /*lnop*/
        a           p_inc, p_inb, stride ;          /*lnop*/
        a           p_ind, p_inc, stride ;          /*lnop*/
        orbi        shufCcDd, shufAaBb, 8 ;         /*lnop*/

        /*nop*/ ;                                   lqd         qa0, 0  ( p_ina )
        andi        p_ina_tmp, p_ina, 15 ;          lqd         qa1, 16 ( p_ina )
        andi        p_inb_tmp, p_inb, 15 ;          lqd         qb0, 0  ( p_inb )
        andi        p_inc_tmp, p_inc, 15 ;          lqd         qb1, 16 ( p_inb )
        andi        p_ind_tmp, p_ind, 15 ;          lqd         qc0, 0  ( p_inc )
        /*nop*/ ;                                   lqd         qc1, 16 ( p_inc )
        /*nop*/ ;                                   lqd         qd0, 0  ( p_ind )
        /*nop*/ ;                                   lqd         qd1, 16 ( p_ind )
        /*nop*/ ;                                   rotqby      shuf_a, shufabcd, p_ina
        a           p_ina, p_ina, strd4 ;           rotqby      shuf_b, shufabcd, p_inb
        a           p_inb, p_inb, strd4 ;           rotqby      shuf_c, shufabcd, p_inc
        a           p_inc, p_inc, strd4 ;           rotqby      shuf_d, shufabcd, p_ind
        a           p_ind, p_ind, strd4 ;           shlqby      p_ina_tmp, second_qword, p_ina_tmp
        /*nop*/ ;                                   shlqby      p_inc_tmp, second_qword, p_inc_tmp
        /*nop*/ ;                                   shlqby      p_inb_tmp, second_qword, p_inb_tmp
        nop ;                                       shlqby      p_ind_tmp, second_qword, p_ind_tmp
        andc        shuf_a, shuf_a, p_ina_tmp ;     /*lnop*/
        andc        shuf_c, shuf_c, p_inc_tmp ;     /*lnop*/
        andc        shuf_b, shuf_b, p_inb_tmp ;     shufb       u32a, qa0, qa1, shuf_a
        andc        shuf_d, shuf_d, p_ind_tmp ;     shufb       u32c, qc0, qc1, shuf_c
        /*nop*/ ;                                   shufb       u32b, qb0, qb1, shuf_b
        /*nop*/ ;                                   shufb       u32d, qd0, qd1, shuf_d
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   shufb       u32ac, u32a, u32c, shufAaBb
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   shufb       u32bd, u32b, u32d, shufAaBb
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   /*lnop*/
        nop ;                                       shufb       u32, u32ac, u32bd, shufAaBb
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   /*lnop*/
        rotmi       z, u32, -1 ;                    lnop
        rotmi       y, u32, -11 ;                   /*lnop*/

        /*nop*/ ;                                   lqd         qa0, 0  ( p_ina )
        andi        p_ina_tmp, p_ina, 15 ;          lqd         qa1, 16 ( p_ina )
        andi        p_inb_tmp, p_inb, 15 ;          lqd         qb0, 0  ( p_inb )
        andi        p_inc_tmp, p_inc, 15 ;          lqd         qb1, 16 ( p_inb )
        andi        p_ind_tmp, p_ind, 15 ;          lqd         qc0, 0  ( p_inc )
  /*2*/ rotmi       x, u32, -21 ;                   lqd         qc1, 16 ( p_inc )
  /*2*/ shli        s, u32, 31 ;                    lqd         qd0, 0  ( p_ind )
  /*2*/ and         z, z, x3ff ;                    lqd         qd1, 16 ( p_ind )
  /*2*/ and         y, y, x3ff ;                    rotqby      shuf_a, shufabcd, p_ina
        a           p_ina, p_ina, strd4 ;           rotqby      shuf_b, shufabcd, p_inb
        a           p_inb, p_inb, strd4 ;           rotqby      shuf_c, shufabcd, p_inc
        a           p_inc, p_inc, strd4 ;           rotqby      shuf_d, shufabcd, p_ind
        a           p_ind, p_ind, strd4 ;           shlqby      p_ina_tmp, second_qword, p_ina_tmp
  /*2*/ cuflt       z, z, 0 ;                       shlqby      p_inc_tmp, second_qword, p_inc_tmp
  /*2*/ cuflt       x, x, 0 ;                       shlqby      p_inb_tmp, second_qword, p_inb_tmp
  /*2*/ cuflt       y, y, 0 ;                       shlqby      p_ind_tmp, second_qword, p_ind_tmp
        andc        shuf_a, shuf_a, p_ina_tmp ;     /*lnop*/
        andc        shuf_c, shuf_c, p_inc_tmp ;     /*lnop*/
        andc        shuf_b, shuf_b, p_inb_tmp ;     shufb       u32a, qa0, qa1, shuf_a
        andc        shuf_d, shuf_d, p_ind_tmp ;     shufb       u32c, qc0, qc1, shuf_c
  /*2*/ or          s, s, onef ;                    shufb       u32b, qb0, qb1, shuf_b
  /*2*/ fma         z, z, yz_scale, neg_onef ;      shufb       u32d, qd0, qd1, shuf_d
  /*2*/ fma         x, x, x_scale,  neg_onef ;      lnop
  /*2*/ fma         y, y, yz_scale, neg_onef ;      shufb       u32ac, u32a, u32c, shufAaBb
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   shufb       u32bd, u32b, u32d, shufAaBb
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   /*lnop*/
        /*nop*/ ;                                   shufb       xazaxbzb, x, z, shufAaBb            /*2*/
        /*nop*/ ;                                   shufb       u32, u32ac, u32bd, shufAaBb
        /*nop*/ ;                                   shufb       yasaybsb, y, s, shufAaBb            /*2*/
        /*nop*/ ;                                   shufb       xczcxdzd, x, z, shufCcDd            /*2*/
        /*nop*/ ;                                   shufb       ycscydsd, y, s, shufCcDd            /*2*/
        rotmi       z, u32, -1 ;                    lnop
        rotmi       y, u32, -11 ;                   shufb       outa, xazaxbzb, yasaybsb, shufAaBb  /*2*/

    loop:
        # pipeline 0                                # pipeline 1                                            # cycle
        ai          count, count, -4 ;              lqd         qa0, 0  ( p_ina )                           #   0
        andi        p_ina_tmp, p_ina, 15 ;          lqd         qa1, 16 ( p_ina )                           #   1
        andi        p_inb_tmp, p_inb, 15 ;          lqd         qb0, 0  ( p_inb )                           #   2
        andi        p_inc_tmp, p_inc, 15 ;          lqd         qb1, 16 ( p_inb )                           #   3
        andi        p_ind_tmp, p_ind, 15 ;          lqd         qc0, 0  ( p_inc )                           #   4
  /*2*/ rotmi       x, u32, -21 ;                   lqd         qc1, 16 ( p_inc )                           #   5
  /*2*/ shli        s, u32, 31 ;                    lqd         qd0, 0  ( p_ind )                           #   6
  /*2*/ and         z, z, x3ff ;                    lqd         qd1, 16 ( p_ind )                           #   7
  /*2*/ and         y, y, x3ff ;                    rotqby      shuf_a, shufabcd, p_ina                     #   8
        a           p_ina, p_ina, strd4 ;           rotqby      shuf_b, shufabcd, p_inb                     #   9
        a           p_inb, p_inb, strd4 ;           rotqby      shuf_c, shufabcd, p_inc                     #  10
        a           p_inc, p_inc, strd4 ;           rotqby      shuf_d, shufabcd, p_ind                     #  11
        a           p_ind, p_ind, strd4 ;           shlqby      p_ina_tmp, second_qword, p_ina_tmp          #  12
  /*2*/ cuflt       z, z, 0 ;                       shlqby      p_inc_tmp, second_qword, p_inc_tmp          #  13
  /*2*/ cuflt       x, x, 0 ;                       shlqby      p_inb_tmp, second_qword, p_inb_tmp          #  14
  /*2*/ cuflt       y, y, 0 ;                       shlqby      p_ind_tmp, second_qword, p_ind_tmp          #  15
        andc        shuf_a, shuf_a, p_ina_tmp ;     shufb       outb, xazaxbzb, yasaybsb, shufCcDd  /*3*/   #  16
        andc        shuf_c, shuf_c, p_inc_tmp ;     shufb       outc, xczcxdzd, ycscydsd, shufAaBb  /*3*/   #  17
        andc        shuf_b, shuf_b, p_inb_tmp ;     shufb       u32a, qa0, qa1, shuf_a                      #  18
        andc        shuf_d, shuf_d, p_ind_tmp ;     shufb       u32c, qc0, qc1, shuf_c                      #  19
  /*2*/ or          s, s, onef ;                    shufb       u32b, qb0, qb1, shuf_b                      #  20
  /*2*/ fma         z, z, yz_scale, neg_onef ;      shufb       u32d, qd0, qd1, shuf_d                      #  21
  /*2*/ fma         x, x, x_scale,  neg_onef ;      shufb       outd, xczcxdzd, ycscydsd, shufCcDd  /*3*/   #  22
  /*2*/ fma         y, y, yz_scale, neg_onef ;      shufb       u32ac, u32a, u32c, shufAaBb                 #  23
        /*nop*/ ;                                   stqd        outa, 0x00 ( out )                  /*3*/   #  24
        /*nop*/ ;                                   shufb       u32bd, u32b, u32d, shufAaBb                 #  25
        /*nop*/ ;                                   stqd        outb, 0x10 ( out )                  /*3*/   #  26
        /*nop*/ ;                                   stqd        outc, 0x20 ( out )                  /*3*/   #  27
        /*nop*/ ;                                   shufb       xazaxbzb, x, z, shufAaBb            /*2*/   #  28
        /*nop*/ ;                                   shufb       u32, u32ac, u32bd, shufAaBb                 #  29
        /*nop*/ ;                                   shufb       yasaybsb, y, s, shufAaBb            /*2*/   #  30
        /*nop*/ ;                                   shufb       xczcxdzd, x, z, shufCcDd            /*2*/   #  31
        nop ;                                       shufb       ycscydsd, y, s, shufCcDd            /*2*/   #  32
        rotmi       z, u32, -1 ;                    stqd        outd, 0x30 ( out )                  /*3*/   #  33
        rotmi       y, u32, -11 ;                   shufb       outa, xazaxbzb, yasaybsb, shufAaBb  /*2*/   #  34
  /*3*/ ai          out, out, 0x40 ; loop_branch:   brnz        count, loop                                 #  35

        bi          $0

        .size       assembler, .-assembler

Nice, 61 down to 36 cycles. We are issueing a pipe 1 instruction every cycle. But what if we could trade some pipe 1 instructions for equivalent pipe 0 instructions ? This is where some quite interesting, and somewhat counter intuitive, optimizations come into play.

Part of the code used for generating the SHUFB controls,

        andi        p_ina_tmp, p_ina, 15
        shlqby      p_ina_tmp, second_qword, p_ina_tmp

can be replaced with

        cgtb        p_ina_tmp, cmp_addr_mod16, p_ina_mod16_byte
        andbi       p_ina_tmp, p_ina_tmp, 16
        a           p_ina_mod16_byte, p_ina_mod16_byte, stride4_mod16_byte
        andbi       p_ina_mod16_byte, p_ina_mod16_byte, 15

where cmp_addr_mod16 is preloaded with {0x100f0e0d,0x0c0b0a09,0x08070605,0x04030201}, p_ina_mod16_byte contains p_ina%16 in every byte, and stride4_mod16_byte contains (stride*4)%16 in every byte. I say "somewhat counter intuitive", as we have locally increased the latency, from 6 to 8 cycles.

So we have traded one pipe 1 instruction for three pipe 0 instructions. If we also do this for p_inb_tmp, then we'll have 33 pipe 0 instructions and 34 pipe 1.

This was a little tricker to schedule this time, since we had more instructions in less slots. The problem was that the lifetime of some variables was too long. Originally to calculate x, we had the following code,

        rotmi       x, u32, -21
        cuflt       x, x, 0
        fma         x, x, x_scale, neg_onef

That's is a long 17 cycles, and was preventing the

        shufb       xazaxbzb, x, z, shufAaBb

from being correctly scheduled. So what we did is break x into several different variables,

        rotmi       x0, u32, -21
        cuflt       x1, x0, 0
        fma         x2, x1, x_scale, neg_onef

The SHUFB only needed to use x2, so this could then be scheduled without a problem. The register usage increased, but we just managed to stay within the registers specified as volatile by the ABI, so we didn't need to go saving and restoring registers in the function prologue/epilogue.

Another trick in this final version is to remove some of the loop prologue by using the "trashcan-stores" technique. The idea is to call the loop body extra times rather than prologue code. The output writen by these extra loop iterations is writen to a "trashcan" location on the stack instead of the caller's buffer. To implement this, the input argument out has been renamed to out_tmp, and in the setup code we added

        ai          out, $1, -0x40
        ai          out_tmp, out_tmp, -0x40

and to the loop, we replaced the

        # stores ...
        ai          out, out, 0x40

with

        ai          out_tmp, out_tmp, 0x40
        # stores ...
        ai          out, out_tmp, 0

After exchanging pipe 1 for pipe 0 instructions, we were left with one less pipe 0 instruction, which is now used to copy out_tmp to out. With a second temporary variable we could have eliminated all of the prologue code, but the register copy would cost us an extra cycle.

assembler.s

        .section    .text.assembler, "ax", @progbits

        .global     assembler
        .type       assembler, @function

        # $3    void *__restrict__ out
        # $4    const void *__restrict__ in
        # $5    uint32_t count
        # $6    uint32_t stride
        .set        out_tmp,            3
        .set        in,                 4
        .set        count,              5
        .set        stride,             6
        .align      3
assembler:

        .set        p_ina,              in
        .set        p_inb,              7
        .set        p_inc,              8
        .set        p_ind,              9
        .set        p_ina_0,            10
        .set        p_inb_0,            11
        .set        p_inc_0,            12
        .set        p_ind_0,            13

        .set        strd4,              14
        .set        onef,               15
        .set        neg_onef,           16
        .set        shufabcd,           17
        .set        second_qword,       18
        .set        x_scale,            19
        .set        yz_scale,           20
        .set        shufAaBb,           21
        .set        shufCcDd,           22
        .set        x3ff,               23

        .set        qa0,                24
        .set        qa1,                25
        .set        qb0,                26
        .set        qb1,                27
        .set        qc0,                28
        .set        qc1,                29
        .set        qd0,                30
        .set        qd1,                31
        .set        shuf_a,             32
        .set        shuf_b,             33
        .set        shuf_c,             34
        .set        shuf_d,             35
        .set        u32a,               36
        .set        u32b,               37
        .set        u32c,               38
        .set        u32d,               39
        .set        u32ac,              40
        .set        u32bd,              41
        .set        u32,                42
        .set        x,                  43
        .set        y,                  44
        .set        z,                  45
        .set        s,                  46
        .set        xazaxbzb,           47
        .set        xczcxdzd,           48
        .set        yasaybsb,           49
        .set        ycscydsd,           50
        .set        outa,               51
        .set        outb,               52
        .set        outc,               53
        .set        outd,               54
        .set        p_ina_mod16_byte,   55
        .set        p_inb_mod16_byte,   56
        .set        byte_splat,         57
        .set        cmp_addr_mod16,     58
        .set        strd4_mod16b,       59
        .set        p_ina_mod16b,       60
        .set        p_inb_mod16b,       61
        .set        x0,                 62
        .set        y0,                 63
        .set        z0,                 64
        .set        s0,                 65
        .set        x1,                 66
        .set        y1,                 67
        .set        z1,                 68
        .set        s1,                 69
        .set        x2,                 70
        .set        y2,                 71
        .set        z2,                 72
        .set        y3,                 73
        .set        z3,                 74
        .set        p_ina_1,            75
        .set        p_inb_1,            76
        .set        p_inc_1,            77
        .set        p_ind_1,            78
        .set        out,                79

        ilh         byte_splat, 0x0303 ;                        lqr         cmp_addr_mod16, _cmp_addr_mod16
        ai          count, count, 7 ;                           lqr         x_scale, _x_scale
        andi        count, count, -4 ;                          lqr         yz_scale, _yz_scale
        shli        strd4, stride, 2 ;                          lqr         shufAaBb, _shufAaBb
        ilhu        onef, 0x3f80 ;                              cwd         shufabcd, 0 ( $1 )
        ilhu        neg_onef, 0xbf80 ;                          hbrr        loop_branch, loop
        ilh         second_qword, 0x1010 ;                      shufb       p_ina_mod16_byte, p_ina, p_ina, byte_splat
        il          x3ff, 0x3ff ;                               shufb       p_inb_mod16_byte, p_inb, p_inb, byte_splat
        andbi       shufabcd, shufabcd, 15 ;                    shufb       strd4_mod16b, strd4, strd4, byte_splat
        a           p_inb, p_ina, stride ;                      /*lnop*/
        a           p_inc, p_inb, stride ;                      /*lnop*/
        a           p_ind, p_inc, stride ;                      /*lnop*/
        orbi        shufCcDd, shufAaBb, 8 ;                     /*lnop*/
        andbi       p_ina_mod16_byte, p_ina_mod16_byte, 15 ;    /*lnop*/
        andbi       p_inb_mod16_byte, p_inb_mod16_byte, 15 ;    /*lnop*/
        andbi       strd4_mod16b, strd4_mod16b, 15 ;            /*lnop*/
        ai          out, $1, -0x40 ;                            /*lnop*/
        ai          out_tmp, out_tmp, -0x40 ;                   /*lnop*/

        /*nop*/ ;                                               lqd         qa0, 0  ( p_ina )
        cgtb        p_ina_0, cmp_addr_mod16, p_ina_mod16b ;     lqd         qa1, 16 ( p_ina )
        cgtb        p_inb_0, cmp_addr_mod16, p_inb_mod16b ;     lqd         qb0, 0  ( p_inb )
        andi        p_inc_0, p_inc, 15 ;                        lqd         qb1, 16 ( p_inb )
        andi        p_ind_0, p_ind, 15 ;                        lqd         qc0, 0  ( p_inc )
        /*nop*/ ;                                               lqd         qc1, 16 ( p_inc )
        /*nop*/ ;                                               lqd         qd0, 0  ( p_ind )
        /*nop*/ ;                                               lqd         qd1, 16 ( p_ind )
        /*nop*/ ;                                               rotqby      shuf_a, shufabcd, p_ina
        a           p_ina, p_ina, strd4 ;                       rotqby      shuf_b, shufabcd, p_inb
        a           p_inb, p_inb, strd4 ;                       rotqby      shuf_c, shufabcd, p_inc
        a           p_inc, p_inc, strd4 ;                       rotqby      shuf_d, shufabcd, p_ind
        a           p_ind, p_ind, strd4 ;                       shlqby      p_inc_1, second_qword, p_inc_0
        andbi       p_ina_1, p_ina_0, 16 ;                      shlqby      p_ind_1, second_qword, p_ind_0
        andbi       p_inb_1, p_inb_0, 16 ;                      /*lnop*/
        andc        shuf_a, shuf_a, p_ina_1 ;                   /*lnop*/
        andc        shuf_c, shuf_c, p_inc_1 ;                   lnop
        andc        shuf_b, shuf_b, p_inb_1 ;                   shufb       u32a, qa0, qa1, shuf_a
        andc        shuf_d, shuf_d, p_ind_1 ;                   shufb       u32c, qc0, qc1, shuf_c
        /*nop*/ ;                                               shufb       u32b, qb0, qb1, shuf_b
        /*nop*/ ;                                               shufb       u32d, qd0, qd1, shuf_d
        /*nop*/ ;                                               /*lnop*/
        nop ;                                                   shufb       u32ac, u32a, u32c, shufAaBb
        /*nop*/ ;                                               /*lnop*/
        a           p_ina_mod16b, p_ina_mod16b, strd4_mod16b ;  shufb       u32bd, u32b, u32d, shufAaBb
        a           p_inb_mod16b, p_inb_mod16b, strd4_mod16b ;  /*lnop*/
        andbi       p_ina_mod16b, p_ina_mod16b, 15 ;            /*lnop*/
        andbi       p_inb_mod16b, p_inb_mod16b, 15 ;            /*lnop*/
        /*nop*/ ;                                               shufb       u32, u32ac, u32bd, shufAaBb

    loop:
        # pipeline 0                                            # pipeline 1                                        # cycle
        ai          count, count, -4 ;                          lqd         qa0, 0  ( p_ina )                       #   0
        cgtb        p_ina_0, cmp_addr_mod16, p_ina_mod16b ;     lqd         qa1, 16 ( p_ina )                       #   1
        cgtb        p_inb_0, cmp_addr_mod16, p_inb_mod16b ;     lqd         qb0, 0  ( p_inb )                       #   2
        andi        p_inc_0, p_inc, 15 ;                        lqd         qb1, 16 ( p_inb )                       #   3
        andi        p_ind_0, p_ind, 15 ;                        lqd         qc0, 0  ( p_inc )                       #   4
  /*2*/ rotmi       z0, u32, -1 ;                               lqd         qc1, 16 ( p_inc )                       #   5
  /*2*/ rotmi       y0, u32, -11 ;                              lqd         qd0, 0  ( p_ind )                       #   6
  /*2*/ rotmi       x0, u32, -21 ;                              lqd         qd1, 16 ( p_ind )                       #   7
  /*2*/ shli        s0, u32, 31 ;                               rotqby      shuf_a, shufabcd, p_ina                 #   8
        a           p_ina, p_ina, strd4 ;                       rotqby      shuf_b, shufabcd, p_inb                 #   9
        a           p_inb, p_inb, strd4 ;                       rotqby      shuf_c, shufabcd, p_inc                 #  10
        a           p_inc, p_inc, strd4 ;                       rotqby      shuf_d, shufabcd, p_ind                 #  11
        a           p_ind, p_ind, strd4 ;                       shlqby      p_inc_1, second_qword, p_inc_0          #  12
        andbi       p_ina_1, p_ina_0, 16 ;                      shlqby      p_ind_1, second_qword, p_ind_0          #  13
        andbi       p_inb_1, p_inb_0, 16 ;                      shufb       xazaxbzb, x2, z3, shufAaBb         /*3*/#  14
  /*2*/ and         z1, z0, x3ff ;                              shufb       yasaybsb, y3, s1, shufAaBb         /*3*/#  15
        andc        shuf_a, shuf_a, p_ina_1 ;                   shufb       xczcxdzd, x2, z3, shufCcDd         /*3*/#  16
        andc        shuf_c, shuf_c, p_inc_1 ;                   shufb       ycscydsd, y3, s1, shufCcDd         /*3*/#  17
        andc        shuf_b, shuf_b, p_inb_1 ;                   shufb       u32a, qa0, qa1, shuf_a                  #  18
        andc        shuf_d, shuf_d, p_ind_1 ;                   shufb       u32c, qc0, qc1, shuf_c                  #  19
  /*2*/ and         y1, y0, x3ff ;                              shufb       u32b, qb0, qb1, shuf_b                  #  20
  /*2*/ cuflt       z2, z1, 0 ;                                 shufb       u32d, qd0, qd1, shuf_d                  #  21
  /*2*/ cuflt       x1, x0, 0 ;                                 shufb       outa, xazaxbzb, yasaybsb, shufAaBb /*3*/#  22
  /*2*/ cuflt       y2, y1, 0 ;                                 shufb       u32ac, u32a, u32c, shufAaBb             #  23
  /*3*/ ai          out_tmp, out_tmp, 0x40 ;                    shufb       outb, xazaxbzb, yasaybsb, shufCcDd /*3*/#  24
        a           p_ina_mod16b, p_ina_mod16b, strd4_mod16b ;  shufb       u32bd, u32b, u32d, shufAaBb             #  25
        a           p_inb_mod16b, p_inb_mod16b, strd4_mod16b ;  shufb       outc, xczcxdzd, ycscydsd, shufAaBb /*3*/#  26
        andbi       p_ina_mod16b, p_ina_mod16b, 15 ;            shufb       outd, xczcxdzd, ycscydsd, shufCcDd /*3*/#  27
  /*2*/ fma         z3, z2, yz_scale, neg_onef ;                stqd        outa, 0x00 ( out )                 /*3*/#  28
  /*2*/ fma         x2, x1, x_scale,  neg_onef ;                shufb       u32, u32ac, u32bd, shufAaBb             #  29
  /*2*/ fma         y3, y2, yz_scale, neg_onef ;                stqd        outb, 0x10 ( out )                 /*3*/#  30
        andbi       p_inb_mod16b, p_inb_mod16b, 15 ;            stqd        outc, 0x20 ( out )                 /*3*/#  31
  /*2*/ or          s1, s0, onef ;                              stqd        outd, 0x30 ( out )                 /*3*/#  32
  /*3*/ ai          out, out_tmp, 0 ;            loop_branch:   brnz        count, loop                             #  33

        bi          $0

        .size       assembler, .-assembler



         .section    .rodata.assembler, "a", @progbits
        .align      4
_x_scale:           .float  0.0009770396, 0.0009770396, 0.0009770396, 0.0009770396
_yz_scale:          .float  0.0019550342, 0.0019550342, 0.0019550342, 0.0019550342
_shufAaBb:          .long   0x00010203,   0x10111213,   0x04050607,   0x14151617
_cmp_addr_mod16:    .long   0x100f0e0d,   0x0c0b0a09,   0x08070605,   0x04030201

So our final results are,

               cycles       speed up
    reference  0x00001f36
    intrinsics 0x0000052f   6.021
    assembler  0x00000290   12.180

When a subject is complicated, reading different explainations from different people can really help your understanding. Also here on #AltDevBlogADay, Jaymin Kessler has put together some excellent videos on the same topic, if you haven't already, have a look at http://altdev.co/2011/03/16/software-pipelining-failed-video-experiment/. The "trashcan-stores" technique comes from Jon Rocatis, and is described in the comment he posted to Jaymin's blog post.

If you want to mess around with this code, or verify results, the test setup used was,

test.c++

#include <spu_intrinsics.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#define COMPRESSED_TANGENT( TANGENT_X, TANGENT_Y, TANGENT_Z, BITANGENT_SIGN ) \
   ((uint32_t)(((TANGENT_X)*0.5f+0.5f)*0x7ff))<<21 \
  |((uint32_t)(((TANGENT_Y)*0.5f+0.5f)*0x3ff))<<11 \
  |((uint32_t)(((TANGENT_Z)*0.5f+0.5f)*0x3ff))<<1  \
  |((BITANGENT_SIGN)<0.f)
#define PADDING 0xdecafbad
#define TEST_VALUES                                                     \
  COMPRESSED_TANGENT(+1.f, 0.f, 0.f,+1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT(-1.f, 0.f, 0.f,+1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f,+1.f, 0.f,+1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f,-1.f, 0.f,+1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f, 0.f,+1.f,+1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f, 0.f,-1.f,+1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT(+1.f, 0.f, 0.f,-1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT(-1.f, 0.f, 0.f,-1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f,+1.f, 0.f,-1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f,-1.f, 0.f,-1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f, 0.f,+1.f,-1.f), PADDING, PADDING,            \
  COMPRESSED_TANGENT( 0.f, 0.f,-1.f,-1.f), PADDING, PADDING
#define TEST_VALUES_X16                                                 \
  TEST_VALUES, TEST_VALUES, TEST_VALUES, TEST_VALUES,                   \
  TEST_VALUES, TEST_VALUES, TEST_VALUES, TEST_VALUES,                   \
  TEST_VALUES, TEST_VALUES, TEST_VALUES, TEST_VALUES,                   \
  TEST_VALUES, TEST_VALUES, TEST_VALUES, TEST_VALUES
#define TEST_VALUES_X256                                                \
  TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16,   \
  TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16,   \
  TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16,   \
  TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16, TEST_VALUES_X16
static const uint32_t test_data[]={
  TEST_VALUES_X256
};
#define ARRAYLEN(ARRAY) (sizeof(ARRAY)/sizeof(*(ARRAY)))

static uint32_t results[ (ARRAYLEN(test_data)/3)*4 ]
  __attribute__((__aligned__(16)));

extern void reference( void *__restrict__ out,const void *__restrict__ in,
  uint32_t count, uint32_t stride );
extern void intrinsics( void *__restrict__ out,const void *__restrict__ in,
  uint32_t count, uint32_t stride );
extern "C" void assembler( void *__restrict__ out,const void *__restrict__ in,
  uint32_t count, uint32_t stride );

int main( int argc, char **argv ) {

  uint32_t reference_cycles;
  uint32_t intrinsics_cycles;
  uint32_t assembler_cycles;

  (void) argc;
  (void) argv;

  {
    spu_writech( SPU_WrDec, -1 );
    reference( results, test_data, ARRAYLEN(test_data)/3, 12 );
    reference_cycles = -spu_readch( SPU_RdDec );
  }

  {
    spu_writech( SPU_WrDec, -1 );
    intrinsics( results, test_data, ARRAYLEN(test_data)/3, 12 );
    intrinsics_cycles = -spu_readch( SPU_RdDec );
  }

  {
    spu_writech( SPU_WrDec, -1 );
    assembler( results, test_data, ARRAYLEN(test_data)/3, 12 );
    assembler_cycles = -spu_readch( SPU_RdDec );
  }

  printf(
    "           cycles       speed up\n"
    "reference  0x%08x\n"
    "intrinsics 0x%08x   %.3f\n"
    "assembler  0x%08x   %.3f\n",
    reference_cycles,
    intrinsics_cycles, (float)reference_cycles/(float)intrinsics_cycles,
    assembler_cycles,  (float)reference_cycles/(float)assembler_cycles );

  return 0;
}

build.sh

#! /bin/sh

set -e

spu-g++ -c -O3 -Werror -Wall -pedantic test.c++       -o test.o
spu-g++ -c -O3 -Werror -Wall -pedantic reference.c++  -o reference.o
spu-g++ -c -O3 -Werror -Wall -pedantic intrinsics.c++ -o intrinsics.o
spu-as --fatal-warnings assembler.s -o assembler.o
spu-g++ test.o reference.o intrinsics.o assembler.o -o test

scp test ${PS3_USER_IP}:/tmp/test
ssh ${PS3_USER_IP} /tmp/test

To be honest, you do need to think carefully if it is really the right move to optimize like this. Chances are, your scarce dev time can but put to better use optimizing something else. As we saw, the code became very fast, twice as fast as a fairly respectable intrinsics implementation, but it also became quite difficult to maintain. If you now want to make any change, you pretty much always need to reschedule the whole thing again from scratch. Luckily, if you are a registerred PS3 developer, then Sony has a very awesome tool which takes the manual pain out of the process. But as always, to fully get the most from your tools, you need to understand how they work. Specifically, reducing the lifetime of a variable like we did for x by using several temporaries helps. And if your working on some other hardware where the tools do not exist, then this can get you that last elusive boost of speed you need.

Next time around, we'll have a look at branch prediction, and how to remove branches.