Parsing BMP Files on a Microcontroller With Almost No RAM

A 24-bit, 800×480 BMP is about 1.1 MB on disk and 1.1 MB once decoded - and the microcontroller you want to draw it on has maybe 64 KB of RAM. The naive plan (“load the image, then draw it”) loses before it starts. The whole trick to images on tiny hardware is to never hold the image. You walk the file from the SD card one row at a time, convert each pixel as it flies past, push it to the display, and forget it ever existed.

This article is a tour of exactly that, using the BMP reader we built for the SIKTEC-EPD library - a parser that runs on AVR and ESP boards, draws full-color photos onto black/white/red e-paper panels, and gets away with about two rows of working memory. BMP is a wonderful format to learn on: it’s old, it’s honest, and it puts its pixels in a flat array with no compression to fight. Let’s read one.

Why BMP Is the Friendly Format

Modern formats earn their keep by being clever. PNG deflates, JPEG does a discrete cosine transform, WebP does several things at once. Clever is great on a laptop and miserable on a microcontroller - decoding any of them means buffering, dictionaries, and math your chip would rather not do.

BMP is the opposite of clever, and that’s exactly why we love it here. A BMP is essentially a header that says “the pixels start at byte N, they’re this wide, this tall, and this many bits each,” followed by the pixels laid out in a plain grid. There’s no entropy coding to unwind. If you can compute a byte offset and call seek(), you can read any pixel in the file without touching the others. That property - random access by arithmetic - is what makes the streaming approach possible.

The Anatomy of a BMP

Every BMP opens with a 14-byte file header. Pack it into a struct and read it in one shot:

#pragma pack(push, 1)
typedef struct BMPFileHeader {
    uint16_t  type        = 0;  // 'BM' == 0x4D42
    uint32_t  size        = 0;  // total file size
    uint32_t  reserved    = 0;
    uint32_t  array_start = 0;  // byte offset where pixels begin
} bmp_file_header_t;
#pragma pack(pop)

The #pragma pack(push, 1) is not decoration - it’s load-bearing. By default the compiler will pad that struct so the 32-bit fields land on 4-byte boundaries, which would shove every field out of alignment with the file’s actual byte layout. Packing to 1 forces the struct to mirror the file exactly, so you can do the laziest, fastest parse there is: read 14 bytes straight off the card into the struct and trust the field names.

Two fields earn their place immediately. type must equal 0x4D42 - the ASCII letters BM, little-endian - or this isn’t a BMP we recognize, and we bail out before doing anything silly. array_start is the gift: it’s the absolute byte offset where pixel data begins, so we never have to guess how big the headers and palette turned out to be.

After those 14 bytes comes the DIB header (the “info header”), and here BMP shows its age - there are at least seven of them, and the only way to tell which one you’re holding is to read its first field, the header’s own size in bytes:

enum BMP_VARIANT {
    NOT_SUPPORTED           = 0,
    BITMAPCOREHEADER_12     = 12,   // ancient OS/2
    OS22XBITMAPHEADER_16    = 16,
    BITMAPINFOHEADER_40     = 40,   // the one you'll see 95% of the time
    BITMAPINFOHEADE_ILLU_56 = 56,   // a wild Illustrator-only 16bpp variant
    OS22XBITMAPHEADER_64    = 64,
    BITMAPV4HEADER_108      = 108,
    BITMAPV5HEADER_124      = 124
};

Detecting the variant is one read:

BMP_VARIANT bitmapVariant() {
    this->seekSet(BITMAP_FILEHEADER_SIZE); // jump past the 14-byte file header
    uint32_t header_size = this->read32();
    switch (header_size) {
        case 12:  return BITMAPCOREHEADER_12;
        case 40:  return BITMAPINFOHEADER_40;
        case 108: return BITMAPV4HEADER_108;
        case 124: return BITMAPV5HEADER_124;
        // ...the rest
        default:  return NOT_SUPPORTED;
    }
}

The newer headers (V4, V5) tack on color-space and gamma fields we don’t care about for drawing. So we read only the part we use and skip the rest:

// Read at most 64 bytes of the info header - everything past that is
// color-management metadata we don't need to draw pixels.
size_t read_size = def.variant <= 64 ? def.variant : 64;
this->file.read(&def.info_header, read_size);

The one variant that won’t tolerate this shortcut is the 12-byte BITMAPCOREHEADER, because its width and height are 16-bit, not 32-bit. Read it field by field or you’ll smear the layout. Old formats demand a little respect:

if (def.variant == BITMAPCOREHEADER_12) {
    def.info_header.header_size  = this->read32();
    def.info_header.width        = (uint32_t)this->read16(); // 16-bit here!
    def.info_header.height       = (uint32_t)this->read16();
    def.info_header.color_planes = this->read16();
    def.info_header.bpp          = this->read16();
}

Two Gotchas That Will Eat Your Afternoon

Before any pixels come out right, two details have to be correct. They’re the difference between “renders perfectly” and “renders as garbage that looks almost right,” which is the worst kind of bug.

Gotcha one: BMP is stored upside down. Unless the height field is negative, the rows are written bottom-to-top - the first row in the file is the bottom row of the image. A negative height flips the convention to top-down. So you normalize once and remember which way it went:

if (def.info_header.height < 0) {
    def.info_header.height = -def.info_header.height; // top-down image
    def.flip = false;
}

Then when you draw, you walk the file forward but place rows from the bottom of the display upward. Forget this and your image is mirrored vertically and you’ll spend twenty minutes blaming your SD wiring.

Gotcha two: rows are padded to a 4-byte boundary. This is the classic BMP bug. Each row of pixels is rounded up so its byte length is a multiple of four. A 3-pixel-wide 24-bit image needs 9 bytes per row of actual color - but it’s stored as 12, with 3 bytes of padding you must skip. The formula that has saved a thousand parsers:

// bits-per-pixel * width, rounded up to the next multiple of 32 bits,
// expressed in bytes. This is the true on-disk stride of one row.
bmp_read.row_bit_size = ((bpp * width + 31) / 32) * 4;

Use the padded stride - never width * bytes_per_pixel - every time you compute the address of a row. Get it wrong and each row drifts a few bytes from the last, producing that unmistakable diagonal-shear smear.

Reading Numbers Off the Card (Cheaply)

A microcontroller and a BMP file are usually both little-endian, which lets you cheat. On a little-endian target you can read a multi-byte integer by slurping the bytes straight into the variable; only on a big-endian or unknown target do you reassemble byte by byte:

uint16_t read16() {
#if (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)
    uint16_t result;
    this->file.read(&result, 2);  // bytes already in the right order
    return result;
#else
    return this->file.read() | ((uint16_t)this->file.read() << 8);
#endif
}

It’s a small thing, but on a chip reading thousands of pixels per image, skipping the shift-and-or on every value adds up - and it costs nothing but a compile-time branch.

The Palette: Small, and Stored Backwards

Images at 8 bits per pixel or fewer don’t store colors directly; they store indices into a palette table that sits between the header and the pixel array. A 4-bit image has at most 16 colors; an 8-bit image has at most 256. The size falls out of the bit depth:

if (def.info_header.bpp <= 8 && def.info_header.colors == 0) {
    def.palette_size = 1 << def.info_header.bpp; // 2^bpp entries
}

Two wrinkles. First, BMP stores palette colors as BGR(A), not RGB - blue first - so you flip the channels as you read them in. Second, you store them in whatever format your display actually wants. On a memory-tight panel that means packing each entry into a single 16-bit RGB565 value instead of a fat 32-bit one, halving the table’s footprint:

// Convert 8-8-8 down to 5-6-5: 5 bits red, 6 green, 5 blue, packed in 16 bits.
uint16_t color32to16(uint8_t R8, uint8_t G8, uint8_t B8) {
    return (((R8 >> 3) & 0x1f) << 11)
         | (((G8 >> 2) & 0x3f) << 5)
         |  ((B8 >> 3) & 0x1f);
}

The human eye gets an extra bit of green because we’re more sensitive to it - that’s the lopsided 5-6-5 split, not a typo.

The Heart of It: Streaming, Not Buffering

Here’s the payoff for all that setup. To draw the image we never allocate a framebuffer. We compute the on-disk address of the first row, seek there, read exactly one row’s worth of pixels, draw them, advance the address by one padded stride, and repeat. The only memory in play is the file’s own read buffer and a handful of locals.

void proccessUncompressed(uint32_t epd_x, uint32_t epd_y, ...) {
    int16_t  epd_row = (int16_t)epd_y + read_height - 1; // bottom-up, remember
    uint32_t address = bitmap_read.start_row_address;

    for (uint32_t r = 0; r < read_height; ++r, address += row_bit_size) {
        this->seekSet(address + column_offset_bytes); // jump to row start
        for (uint32_t c = 0; c < read_width; c += pixels_per_iteration) {
            // ...unpack one byte's worth of pixels, draw, repeat
        }
        epd_row--; // climb up the panel as we go down the file
    }
}

That pixels_per_iteration field is how we handle the packed low-bit-depth formats without branching inside the hot loop more than necessary. One byte holds eight 1-bit pixels, two 4-bit pixels, or one pixel at 8 bits and above:

switch (bpp) {
    case 1:  pixels_per_iteration = 8; break;
    case 4:  pixels_per_iteration = 2; break;
    default: pixels_per_iteration = 1;
}

Unpacking 1-bit pixels is a bit-walk from the most significant bit down; 4-bit pixels are two nibbles per byte; everything else is a direct read. For example, the 4-bit case shifts each nibble out and looks it up in the palette:

} else if (bpp == 4) {
    pixels = this->read8();              // one byte = two pixels
    for (int8_t p = 1; p >= 0; --p) {    // signed! a uint8_t loops forever at 0
        uint8_t idx = (pixels >> (4 * p)) & 0xF;
        epd->drawPixel(epd_col++, epd_row,
            pixelColorProccess(
                idx < palette_size ? palette[idx] : 0xFFFF, // bounds guard
                filter));
    }
}

Two details worth stealing. The loop counter is int8_t, not uint8_t - an unsigned counter comparing >= 0 never terminates, an embedded classic that compiles clean and hangs at runtime. And every palette lookup is bounds-checked (idx < palette_size ? ... : 0xFFFF): a corrupt or truncated file should draw a white pixel, not read off the end of your palette array into whatever RAM happens to live there.

For the direct-color depths the body is even simpler - read a 16-, 24-, or 32-bit value and hand it to the pixel processor. A 24-bit pixel is three bytes of BGR; a 16-bit pixel is already RGB565; 32-bit is ARGB with an alpha byte we ignore.

Filters: Turning Photos Into Three Colors

A black/white/red e-paper panel can show exactly three colors. A photo has millions. Something has to give, and that “something” is a color kernel - a function called once per pixel that maps a full-color input to one of the panel’s real colors. The reader takes a function pointer, so the mapping policy is completely decoupled from the parsing:

// Every filter implements this one method.
virtual colorBits_t kernel(uint8_t R, uint8_t G, uint8_t B) = 0;

The simplest filter is plain black-and-white thresholding: average the channels into a gray value, and snap to black or white on either side of a cutoff. The tri-color version adds a “is this pixel convincingly red?” test first:

// A pixel is "red" when red dominates both other channels by a margin.
bool reddish(int16_t R, int16_t G, int16_t B) {
    return ((R - G) > red_thresh_g) && ((R - B) > red_thresh_b);
}

The quantize filter is the general case: for each pixel, find the nearest color in the panel’s palette by Euclidean distance in RGB space. Nearest-color search means a square root per comparison, and sqrt() on an AVR is painfully slow - so the library borrows the legendary Quake III fast inverse square root trick to approximate it in a few cheap operations:

#define SQRT_MAGIC_F 0x5f3759df
inline float sqrt2(const float x) {
    const float xhalf = 0.5f * x;
    union { float x; int i; } u;
    u.x = x;
    u.i = SQRT_MAGIC_F - (u.i >> 1); // the famous bit-hack initial guess
    return x * u.x * (1.5f - xhalf * u.x * u.x); // one Newton step
}

You don’t need a correct distance to pick the closest color - you only need the ordering to be right - so an approximate root is free accuracy you can throw away. It’s the kind of trade that feels like cheating and is actually just engineering.

Dithering, the Tasteful Way

Thresholding a photograph to three colors looks awful - smooth gradients become hard blobs. Dithering fixes this by being honest about its mistakes: when it rounds a pixel to the nearest available color, it measures the error it just introduced and pushes that error onto neighboring pixels that haven’t been drawn yet. The errors cancel out across an area, and your eye blends the speckle back into a smooth tone. This is Floyd-Steinberg error diffusion, and it’s the same idea behind the dot patterns in old newspaper photos.

The classic Floyd-Steinberg kernel spreads the error to four not-yet-drawn neighbors with weights {7, 3, 5, 1} out of 16 - most of it to the pixel on the right, the rest to the row below:

void dither(int16_t pixels[5]) {
    int16_t old_c = pixels[0];
    int16_t quant = (old_c > level) ? white_val : 0; // snap to nearest
    int16_t err   = old_c - quant;                   // the mistake we made
    float   adj   = (float)err / 16;

    pixels[0] = (old_c > level) ? color_white : color_black;
    pixels[1] += weights[0] * adj; // right        (7/16)
    pixels[2] += weights[1] * adj; // below-left   (3/16)
    pixels[3] += weights[2] * adj; // below        (5/16)
    pixels[4] += weights[3] * adj; // below-right  (1/16)
}

The naively scary part is that error diffusion seems to need the whole image in memory - you’re writing into the row below the one you’re on. It doesn’t. Because the error only ever flows right and one row down, you only ever need two rows resident at once: the row you’re drawing and the row beneath it. The library keeps exactly that - a two-row sliding buffer - and on chips with external SRAM it parks the buffer there instead of precious internal RAM:

// Only ever two rows in flight. For an 800px-wide image that's
// 800 * 2 * sizeof(int16_t) ≈ 3.2 KB - not 1.1 MB.
int16_t *ram_buffer = new int16_t[loadWidth * 2];

That single insight - that a whole-image algorithm has a two-row working set - is what lets a chip with kilobytes of RAM dither a megabyte image and have it come out looking like a proper grayscale photo.

Bonus: Sprites for Free

Because every pixel’s address is pure arithmetic, cropping is also free. Want just the bottom-right 32×32 tile of a sprite sheet? Add a column offset and a starting row to the address math and read only that window - the rest of the file is never touched:

void defineBitmapSprite(uint16_t columns, uint16_t rows) {
    this->sprite.columns = columns;
    this->sprite.rows    = rows;
    this->sprite.width   = this->width()  / columns; // one cell's size
    this->sprite.height  = this->height() / rows;
}

One BMP becomes an entire icon set, a font, or an animation, and drawing any single frame reads only that frame’s bytes off the card. No atlas in RAM, no unpacking step - just a different starting offset and a smaller loop.

The Whole Philosophy in One Sentence

If there’s a single takeaway, it’s this: on tiny hardware, the file on the card is your buffer. Don’t copy the image into RAM to work on it - leave it on disk, compute the address of the exact bytes you need, stream them through a per-pixel function, and let them go. The header tells you where everything lives; padding and bottom-up order are the only traps; and even a “global” operation like dithering turns out to need just two rows at a time. That’s how a chip with 64 KB draws a megapixel photo - by being disciplined about never holding more than a sliver of it at once.

The full reader lives in src/bitmap/ of the SIKTEC-EPD library - SIKTEC_EPD_BITMAP.cpp for the parser and SIKTEC_BITMAP_FILTERS.h for the color kernels and dithering. It handles 1–32 bpp, every common header variant, and several flavors of “technically valid” BMP that GIMP and Illustrator like to emit.