Software renderer optimizations
I recently optimized my software renderer for the Thumby Color project that I wanted to share.
The Thumby Color is a pretty simple hardware system: An embedded CPU with a connected 128x128 screen, some buttons and a speaker. The CPU is a Raspberry Pico 2 (RP2350) with 520KB RAM and 16MB flash.
There is no GPU or other hardware acceleration outside of what the RP2350 provides. So the entire rendering is software based. From past experiences I early on decided to utilize a 8bit z-buffer to handle layered rendering. This avoids the need for managing the rendering order like establishing a painter's algorithm either through execution order or command buffers. The rendering happens completely in intermediate style, which means that drawing operations happen whenever the command is issued.
When copying (or blitting) a sprite to the screen, the operation state determines which logic is to be used: It can be a simple copy operation, a copy operation with alpha blending, a copy operation with color tinting, it can ignore the z-value (which is also the alpha channel) or it can use the z-value to determine if the pixel is to be drawn (less / greater / equal / etc).
The problem is now, that the naive logic for doing this is extremely inefficient, as it would look like this:
1 void TE_Img_blitEx(TE_Img *img, TE_Img *src, int16_t x, int16_t y, uint16_t srcX,
2 uint16_t srcY, uint16_t width, uint16_t height, BlitEx options)
3 {
4 for (uint16_t px = 0; i < w; i++)
5 {
6 for (uint16_t py = 0; j < h; j++)
7 {
8 uint32_t color = TE_Img_getPixelEx(src, srcX, srcY, px, py, width, height, options);
9 if (options.tint)
10 {
11 color = TE_Color_tint(color, options.tintColor);
12 }
13
14 if (options.blendMode == TE_BLEND_ALPHAMASK)
15 {
16 if ((color & 0xFF000000) == 0)
17 {
18 continue;
19 }
20 }
21
22 TE_Img_setPixel(img, x + i, y + j, color, options.state);
23 }
24 }
25 }
26
27
28 void TE_Img_setPixel(TE_Img *img, uint16_t x, uint16_t y, uint32_t color, TE_ImgOpState state)
29 {
30 if (x >= (1 << img->p2width) || y >= (1 << img->p2height))
31 {
32 return;
33 }
34 if (state.scissorWidth > 0 || state.scissorHeight > 0)
35 {
36 if (x < state.scissorX || x >= state.scissorX + state.scissorWidth || y < state.scissorY || y >= state.scissorY + state.scissorHeight)
37 {
38 return;
39 }
40 }
41
42 uint32_t *pixel = &img->data[(y << img->p2width) + x];
43 uint8_t zDst = *pixel >> 24;
44 if ((state.zCompareMode == 0) ||
45 (state.zCompareMode == 1 && zDst == state.zValue) ||
46 (state.zCompareMode == 2 && zDst < state.zValue) ||
47 (state.zCompareMode == 3 && zDst > state.zValue) ||
48 (state.zCompareMode == 4 && zDst <= state.zValue) ||
49 (state.zCompareMode == 5 && zDst >= state.zValue) ||
50 (state.zCompareMode == 6 && zDst != state.zValue) )
51 {
52 if (state.zAlphaBlend && (color & 0xFF000000) < 0xfe000000)
53 {
54 uint32_t a = color >> 24;
55 uint32_t r = (color & 0xFF) * a >> 8;
56 uint32_t g = ((color >> 8) & 0xFF) * a >> 8;
57 uint32_t b = ((color >> 16) & 0xFF) * a >> 8;
58 uint32_t colorDst = *pixel;
59 uint32_t aDst = colorDst & 0xff000000;
60 uint32_t rDst = (colorDst & 0xFF) * (255 - a) >> 8;
61 uint32_t gDst = ((colorDst >> 8) & 0xFF) * (255 - a) >> 8;
62 uint32_t bDst = ((colorDst >> 16) & 0xFF) * (255 - a) >> 8;
63 color = (r + rDst) | ((g + gDst) << 8) | ((b + bDst) << 16) | aDst;
64 }
65 if (state.zNoWrite)
66 {
67 color = (color & 0xffffff) | (zDst << 24);
68 }
69 else
70 {
71 color = (color & 0xffffff) | (state.zValue << 24);
72 }
73
74 *pixel = color;
75 }
76 }
This code works, but it's not efficient as a lot of the operations are repeated for each pixel that are not going to change. Here's the abstract pseudo code for of our unoptimized blit operation:
1 foreach target_pixel:
2 color = get_source_pixel(srcX,srcY)
3 if tint:
4 color = tint(color)
5 if blend_alpha && alpha == 0:
6 continue
7 if z_compare && z_compare_failed:
8 continue
9 if alpha_blend:
10 color = alpha_blend(color)
11 if z_no_write:
12 color = z_no_write(color)
13 else:
14 color = z_write(color)
15 set_target_pixel(color)
For example, the tint operation is either always on or always off. The same goes for all other state modes. So an optimized version would look like this:
1 if tint && blend_alpha && z_compare && z_no_write:
2 foreach target_pixel:
3 color = get_source_pixel(srcX,srcY)
4 color = tint(color)
5 if alpha == 0:
6 continue
7 if z_compare_failed:
8 continue
9 color = alpha_blend(color)
10 color = z_no_write(color)
11 set_target_pixel(color)
12 elseif tint && blend_alpha && z_compare && !z_no_write:
13 foreach target_pixel:
14 color = get_source_pixel(srcX,srcY)
15 color = tint(color)
16 if alpha == 0:
17 continue
18 if z_compare_failed:
19 continue
20 color = alpha_blend(color)
21 color = z_write(color)
22 set_target_pixel(color)
23 elseif tint && blend_alpha && !z_compare && z_no_write:
24 // and so on
By pulling the operations that are fixed for the entire loop outside of the loop, the number of operations that the loop is executing can be reduced dramatically. When I ran the first time my naive code on the Thumby Color, my frame rate was around 5 FPS (0.2s / frame). After optimizing the code like described above, the frame rate went up to around 20-25FPS (~0.04s / frame), which is a huge improvement and an increase in performance by a factor of 4-5. I did however NOT write all the cases in code as that would result in an unmaintainable mess of thousands of lines of code. Instead, I wrote one version of my blitting code using #ifdefs for the various cases:
1 // TE_image_blitvariant.h
2
3 #include "TE_Image.h"
4
5 #ifndef VARIANT_NAME
6 #define VARIANT_NAME _TE_Img_blit_base
7 #endif
8
9 static void VARIANT_NAME(TE_Img *img, TE_Img *src, int16_t x, int16_t y, uint16_t srcX, uint16_t srcY, int16_t width, int16_t height, BlitEx options)
10 {
11 if (width <= 0 || height <= 0)
12 {
13 return;
14 }
15
16 #ifdef VARIANT_TINT
17 uint32_t tint = options.tintColor;
18 uint8_t tintR = (tint & 0xFF);
19 uint8_t tintG = (tint >> 8) & 0xFF;
20 uint8_t tintB = (tint >> 16) & 0xFF;
21 uint8_t tintA = (tint >> 24) & 0xFF;
22 #endif
23 // (...)
24 for (uint16_t j = 0, dstY = y, v = srcY; j < height; j++, dstY++, v++)
25 {
26 uint32_t srcIndex = (v << srcP2width) + srcX;
27 uint32_t dstIndex = (dstY << dstP2width) + x;
28 for (uint16_t i = 0, dstX = x; i < width; i++, dstX++, srcIndex++, dstIndex++)
29 {
30 uint32_t color = srcData[srcIndex];
31
32 #ifdef VARIANT_TINT
33 {
34 uint32_t r = ((color & 0xFF) * tintR) >> 8;
35 uint32_t g = (((color >> 8) & 0xFF) * tintG) >> 8;
36 uint32_t b = (((color >> 16) & 0xFF) * tintB) >> 8;
37 uint32_t a = (((color >> 24) & 0xFF) * tintA) >> 8;
38 color = r | (g << 8) | (b << 16) | (a << 24);
39 }
40 #endif
41
42 #ifdef VARIANT_ALPHAMASK
43 if ((color & 0xFF000000) == 0)
44 {
45 continue;
46 }
47 #endif
48 // (...)
49 #ifdef VARIANT_Z_COMPARE_GREATER_EQUAL
50 if (zDst >= zValue)
51 #endif
52 #ifdef VARIANT_Z_COMPARE_NOT_EQUAL
53 if (zDst != zValue)
54 #endif
55 {
56 #ifdef VARIANT_ALPHA_BLEND
57 // (...)
58 #endif
59
60 #ifdef VARIANT_Z_NO_WRITE
61 *pixel = (color & 0xffffff) | zDst;
62 #else
63 *pixel = (color & 0xffffff) | zValue;
64 #endif
65 }
66 }
67 }
68 }
69 }
70
71 #undef VARIANT_NAME
72 #undef VARIANT_TINT
73 // (...)
74 #undef VARIANT_ALPHA_BLEND
75 #undef VARIANT_Z_NO_WRITE
The full code is available here. By writing this code like a template, a combination of the various options can be created by including it like this:
1 // (...)
2 #define VARIANT_NAME _TE_Img_blitVariant_amask_blend_zEqual
3 #define VARIANT_Z_COMPARE_EQUAL
4 #define VARIANT_ALPHA_BLEND
5 #define VARIANT_ALPHAMASK
6 #include "TE_image_blitvariant.h"
7
8 #define VARIANT_NAME _TE_Img_blitVariant_amask_blend_zLess
9 #define VARIANT_Z_COMPARE_LESS
10 #define VARIANT_ALPHA_BLEND
11 #define VARIANT_ALPHAMASK
12 #include "TE_image_blitvariant.h"
13 // (...)
While till this point it looks relatively clean, the actual usage of this code is showing the whole dilemma of this approach:
1 #define BLIT_DEPTH_COMPARE_VARIANTS(prefix) \
2 if (options.state.zCompareMode == Z_COMPARE_ALWAYS)\
3 prefix##_zCompareAlways(img, src, x, y, srcX, srcY, width, height, options);\
4 else if (options.state.zCompareMode == Z_COMPARE_EQUAL)\
5 prefix##_zEqual(img, src, x, y, srcX, srcY, width, height, options);\
6 else if (options.state.zCompareMode == Z_COMPARE_LESS)\
7 prefix##_zLess(img, src, x, y, srcX, srcY, width, height, options);\
8 else if (options.state.zCompareMode == Z_COMPARE_GREATER)\
9 prefix##_zGreater(img, src, x, y, srcX, srcY, width, height, options);\
10 else if (options.state.zCompareMode == Z_COMPARE_LESS_EQUAL)\
11 prefix##_zLessEqual(img, src, x, y, srcX, srcY, width, height, options);\
12 else if (options.state.zCompareMode == Z_COMPARE_GREATER_EQUAL)\
13 prefix##_zGreaterEqual(img, src, x, y, srcX, srcY, width, height, options);\
14 else if (options.state.zCompareMode == Z_COMPARE_NOT_EQUAL)\
15 prefix##_zNotEqual(img, src, x, y, srcX, srcY, width, height, options);
16
17 if (options.tint && options.tintColor != 0xffffffff)
18 {
19 if (options.state.zAlphaBlend)
20 {
21 if (options.state.zNoWrite)
22 {
23 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_noZWrite_tint_amask_blend)
24 }
25 else
26 {
27 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_tint_amask_blend)
28 }
29 }
30 else
31 {
32 if (options.state.zNoWrite)
33 {
34 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_noZWrite_tint_amask)
35 }
36 else
37 {
38 BLIT_DEPTH_COMPARE_VARIANTS(_TE_Img_blitVariant_tint_amask)
39 }
40 }
41 }
42 else // (...)
I think this could still be improved, but it was fairly difficult to come up with this already, so I will leave it like this for now. To illustrate the pain of this approach: There are right now 11 different flags that can be either off or on. While some are exclusive, like which z-compare is to be used there are still in total 6 different flags that can be combined in 2^6 = 64 different ways - currently I have 56 includes of the blitvariant file (because some cases are not plausible).
With any further flag, the number of combinations will continue to grow exponentially. My code is still missing rotation and flipping - which will add further combinations that I need to handle. When my code uses these cases, it right now falls back to the naive implementation.
In any case, I hope this code is useful for someone! Generating code via macros and including headers is a powerful tool that's typically used for C++ templating, but it can also be used in C to generate code that from a code template without relying on external code generators.