Software renderer optimizations

I recently optimized my software renderer for the Thumby Color project that I wanted to share.

The Thumby Color is a pretty simple hardware system: An embedded CPU with a connected 128x128 screen, some buttons and a speaker. The CPU is a Raspberry Pico 2 (RP2350) with 520KB RAM and 16MB flash.

There is no GPU or other hardware acceleration outside of what the RP2350 provides. So the entire rendering is software based. From past experiences I early on decided to utilize a 8bit z-buffer to handle layered rendering. This avoids the need for managing the rendering order like establishing a painter's algorithm either through execution order or command buffers. The rendering happens completely in intermediate style, which means that drawing operations happen whenever the command is issued.

When copying (or blitting) a sprite to the screen, the operation state determines which logic is to be used: It can be a simple copy operation, a copy operation with alpha blending, a copy operation with color tinting, it can ignore the z-value (which is also the alpha channel) or it can use the z-value to determine if the pixel is to be drawn (less / greater / equal / etc).

The problem is now, that the naive logic for doing this is extremely inefficient, as it would look like this:

  1 void TE_Img_blitEx(TE_Img *img, TE_Img *src, int16_t x, int16_t y, uint16_t srcX, 
  2     uint16_t srcY, uint16_t width, uint16_t height, BlitEx options)
  3 {
  4     for (uint16_t px = 0; i < w; i++)
  5     {
  6         for (uint16_t py = 0; j < h; j++)
  7         {
  8             uint32_t color = TE_Img_getPixelEx(src, srcX, srcY, px, py, width, height, options);
  9             if (options.tint)
 10             {
 11                 color = TE_Color_tint(color, options.tintColor);
 12             }
 13 
 14             if (options.blendMode == TE_BLEND_ALPHAMASK)
 15             {
 16                 if ((color & 0xFF000000) == 0)
 17                 {
 18                     continue;
 19                 }
 20             }
 21 
 22             TE_Img_setPixel(img, x + i, y + j, color, options.state);
 23         }
 24     }
 25 }
 26 
 27 
 28 void TE_Img_setPixel(TE_Img *img, uint16_t x, uint16_t y, uint32_t color, TE_ImgOpState state)
 29 {
 30     if (x >= (1 << img->p2width) || y >= (1 << img->p2height))
 31     {
 32         return;
 33     }
 34     if (state.scissorWidth > 0 || state.scissorHeight > 0)
 35     {
 36         if (x < state.scissorX || x >= state.scissorX + state.scissorWidth || y < state.scissorY || y >= state.scissorY + state.scissorHeight)
 37         {
 38             return;
 39         }
 40     }
 41 
 42     uint32_t *pixel = &img->data[(y << img->p2width) + x]; 
 43     uint8_t zDst = *pixel >> 24; 
 44     if ((state.zCompareMode == 0) || 
 45         (state.zCompareMode == 1 && zDst == state.zValue) || 
 46         (state.zCompareMode == 2 && zDst < state.zValue) || 
 47         (state.zCompareMode == 3 && zDst > state.zValue) || 
 48         (state.zCompareMode == 4 && zDst <= state.zValue) || 
 49         (state.zCompareMode == 5 && zDst >= state.zValue) || 
 50         (state.zCompareMode == 6 && zDst != state.zValue) ) 
 51     { 
 52         if (state.zAlphaBlend && (color & 0xFF000000) < 0xfe000000) 
 53         { 
 54             uint32_t a = color >> 24; 
 55             uint32_t r = (color & 0xFF) * a >> 8; 
 56             uint32_t g = ((color >> 8) & 0xFF) * a >> 8; 
 57             uint32_t b = ((color >> 16) & 0xFF) * a >> 8; 
 58             uint32_t colorDst = *pixel; 
 59             uint32_t aDst = colorDst & 0xff000000; 
 60             uint32_t rDst = (colorDst & 0xFF) * (255 - a) >> 8; 
 61             uint32_t gDst = ((colorDst >> 8) & 0xFF) * (255 - a) >> 8; 
 62             uint32_t bDst = ((colorDst >> 16) & 0xFF) * (255 - a) >> 8; 
 63             color = (r + rDst) | ((g + gDst) << 8) | ((b + bDst) << 16) | aDst; 
 64         } 
 65         if (state.zNoWrite) 
 66         { 
 67             color = (color & 0xffffff) | (zDst << 24); 
 68         } 
 69         else 
 70         { 
 71             color = (color & 0xffffff) | (state.zValue << 24); 
 72         } 
 73         
 74         *pixel = color; 
 75     }
 76 }

This code works, but it's not efficient as a lot of the operations are repeated for each pixel that are not going to change. Here's the abstract pseudo code for of our unoptimized blit operation:

  1 foreach target_pixel:
  2     color = get_source_pixel(srcX,srcY)
  3     if tint:
  4         color = tint(color)
  5     if blend_alpha && alpha == 0:
  6         continue
  7     if z_compare && z_compare_failed:
  8         continue
  9     if alpha_blend:
 10         color = alpha_blend(color)
 11     if z_no_write:
 12         color = z_no_write(color)
 13     else:
 14         color = z_write(color)
 15     set_target_pixel(color)

For example, the tint operation is either always on or always off. The same goes for all other state modes. So an optimized version would look like this:

  1 if tint && blend_alpha && z_compare && z_no_write:
  2     foreach target_pixel:
  3         color = get_source_pixel(srcX,srcY)
  4         color = tint(color)
  5         if alpha == 0:
  6             continue
  7         if z_compare_failed:
  8             continue
  9         color = alpha_blend(color)
 10         color = z_no_write(color)
 11         set_target_pixel(color)
 12 elseif tint && blend_alpha && z_compare && !z_no_write:
 13     foreach target_pixel:
 14         color = get_source_pixel(srcX,srcY)
 15         color = tint(color)
 16         if alpha == 0:
 17             continue
 18         if z_compare_failed:
 19             continue
 20         color = alpha_blend(color)
 21         color = z_write(color)
 22         set_target_pixel(color)
 23 elseif tint && blend_alpha && !z_compare && z_no_write:
 24     // and so on

By pulling the operations that are fixed for the entire loop outside of the loop, the number of operations that the loop is executing can be reduced dramatically. When I ran the first time my naive code on the Thumby Color, my frame rate was around 5 FPS (0.2s / frame). After optimizing the code like described above, the frame rate went up to around 20-25FPS (~0.04s / frame), which is a huge improvement and an increase in performance by a factor of 4-5. I did however NOT write all the cases in code as that would result in an unmaintainable mess of thousands of lines of code. Instead, I wrote one version of my blitting code using #ifdefs for the various cases:

  1 // TE_image_blitvariant.h
  2 
  3 #include "TE_Image.h"
  4 
  5 #ifndef VARIANT_NAME
  6 #define VARIANT_NAME _TE_Img_blit_base
  7 #endif
  8 
  9 static void VARIANT_NAME(TE_Img *img, TE_Img *src, int16_t x, int16_t y, uint16_t srcX, uint16_t srcY, int16_t width, int16_t height, BlitEx options)
 10 {
 11     if (width <= 0 || height <= 0)
 12     {
 13         return;
 14     }
 15 
 16 #ifdef VARIANT_TINT
 17     uint32_t tint = options.tintColor;
 18     uint8_t tintR = (tint & 0xFF);
 19     uint8_t tintG = (tint >> 8) & 0xFF;
 20     uint8_t tintB = (tint >> 16) & 0xFF;
 21     uint8_t tintA = (tint >> 24) & 0xFF;
 22 #endif
 23 // (...)
 24     for (uint16_t j = 0, dstY = y, v = srcY; j < height; j++, dstY++, v++)
 25     {
 26         uint32_t srcIndex = (v << srcP2width) + srcX;
 27         uint32_t dstIndex = (dstY << dstP2width) + x;
 28         for (uint16_t i = 0, dstX = x; i < width; i++, dstX++, srcIndex++, dstIndex++)
 29         {
 30             uint32_t color = srcData[srcIndex];
 31 
 32 #ifdef VARIANT_TINT
 33             {
 34                 uint32_t r = ((color & 0xFF) * tintR) >> 8;
 35                 uint32_t g = (((color >> 8) & 0xFF) * tintG) >> 8;
 36                 uint32_t b = (((color >> 16) & 0xFF) * tintB) >> 8;
 37                 uint32_t a = (((color >> 24) & 0xFF) * tintA) >> 8;
 38                 color = r | (g << 8) | (b << 16) | (a << 24);
 39             }
 40 #endif
 41 
 42 #ifdef VARIANT_ALPHAMASK
 43             if ((color & 0xFF000000) == 0)
 44             {
 45                 continue;
 46             }
 47 #endif
 48 // (...)
 49 #ifdef VARIANT_Z_COMPARE_GREATER_EQUAL
 50                 if (zDst >= zValue)
 51 #endif
 52 #ifdef VARIANT_Z_COMPARE_NOT_EQUAL
 53                 if (zDst != zValue)
 54 #endif
 55                 {
 56 #ifdef VARIANT_ALPHA_BLEND
 57 // (...)
 58 #endif
 59 
 60 #ifdef VARIANT_Z_NO_WRITE
 61                     *pixel = (color & 0xffffff) | zDst;
 62 #else
 63                     *pixel = (color & 0xffffff) | zValue;
 64 #endif
 65                 }
 66             }
 67         }
 68     }
 69 }
 70 
 71 #undef VARIANT_NAME
 72 #undef VARIANT_TINT
 73 // (...)
 74 #undef VARIANT_ALPHA_BLEND
 75 #undef VARIANT_Z_NO_WRITE

The full code is available here. By writing this code like a template, a combination of the various options can be created by including it like this:

  1 // (...)
  2 #define VARIANT_NAME _TE_Img_blitVariant_amask_blend_zEqual
  3 #define VARIANT_Z_COMPARE_EQUAL
  4 #define VARIANT_ALPHA_BLEND
  5 #define VARIANT_ALPHAMASK
  6 #include "TE_image_blitvariant.h"
  7 
  8 #define VARIANT_NAME _TE_Img_blitVariant_amask_blend_zLess
  9 #define VARIANT_Z_COMPARE_LESS
 10 #define VARIANT_ALPHA_BLEND
 11 #define VARIANT_ALPHAMASK
 12 #include "TE_image_blitvariant.h"
 13 // (...)

While till this point it looks relatively clean, the actual usage of this code is showing the whole dilemma of this approach:

  1     #define BLIT_DEPTH_COMPARE_VARIANTS(prefix) \
  2             if (options.state.zCompareMode == Z_COMPARE_ALWAYS)\
  3                 prefix##_zCompareAlways(img, src, x, y, srcX, srcY, width, height, options);\
  4             else if (options.state.zCompareMode == Z_COMPARE_EQUAL)\
  5                 prefix##_zEqual(img, src, x, y, srcX, srcY, width, height, options);\
  6             else if (options.state.zCompareMode == Z_COMPARE_LESS)\
  7                 prefix##_zLess(img, src, x, y, srcX, srcY, width, height, options);\
  8             else if (options.state.zCompareMode == Z_COMPARE_GREATER)\
  9                 prefix##_zGreater(img, src, x, y, srcX, srcY, width, height, options);\
 10             else if (options.state.zCompareMode == Z_COMPARE_LESS_EQUAL)\
 11                 prefix##_zLessEqual(img, src, x, y, srcX, srcY, width, height, options);\
 12             else if (options.state.zCompareMode == Z_COMPARE_GREATER_EQUAL)\
 13                 prefix##_zGreaterEqual(img, src, x, y, srcX, srcY, width, height, options);\
 14             else if (options.state.zCompareMode == Z_COMPARE_NOT_EQUAL)\
 15                 prefix##_zNotEqual(img, src, x, y, srcX, srcY, width, height, options);
 16 
 17     if (options.tint && options.tintColor != 0xffffffff)
 18     {
 19         if (options.state.zAlphaBlend)
 20         {
 21             if (options.state.zNoWrite)
 22             {
 23                 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_noZWrite_tint_amask_blend)
 24             }
 25             else
 26             {
 27                 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_tint_amask_blend)
 28             }
 29         }
 30         else
 31         {
 32             if (options.state.zNoWrite)
 33             {
 34                 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_noZWrite_tint_amask)
 35             }
 36             else
 37             {
 38                 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_tint_amask)
 39             }
 40         }
 41     }
 42     else // (...)

I think this could still be improved, but it was fairly difficult to come up with this already, so I will leave it like this for now. To illustrate the pain of this approach: There are right now 11 different flags that can be either off or on. While some are exclusive, like which z-compare is to be used there are still in total 6 different flags that can be combined in 2^6 = 64 different ways - currently I have 56 includes of the blitvariant file (because some cases are not plausible).

With any further flag, the number of combinations will continue to grow exponentially. My code is still missing rotation and flipping - which will add further combinations that I need to handle. When my code uses these cases, it right now falls back to the naive implementation.

In any case, I hope this code is useful for someone! Generating code via macros and including headers is a powerful tool that's typically used for C++ templating, but it can also be used in C to generate code that from a code template without relying on external code generators.

🍪