2025, Sep 26 07:00

Why PDFium-rendered pages and images don't match pixel-for-pixel, and how cv2 resizing with higher DPI closes the gap

Learn why PDF and image arrays differ and how to align pipelines: render with PDFium at higher DPI, convert BGRA to RGB, and resize using cv2 for closer match.

When you render a PDF page to a numpy array and do the same for an image that contains identical content, it’s natural to expect pixel-perfect equality. Yet in practice, the arrays can diverge significantly if the pipelines aren’t truly aligned. This guide shows where the mismatch comes from and how to make the two paths comparable without abandoning PDFium.

Baseline: rendering a PDF page and loading an image

The PDF path uses PDFium to rasterize a page into a BGRA buffer, which is then exposed as a numpy array. The image path uses stb_image to decode, resize, and drop alpha to get RGB.

py::array_t<uint8_t> page_to_ndarray(FPDF_PAGE pdf_page,
                                      int out_w = 0,
                                      int out_h = 0,
                                      int render_dpi = 80) {
    int px_w, px_h;
    if (out_w > 0 && out_h > 0) {
        px_w = out_w;
        px_h = out_h;
    } else {
        px_w = static_cast<int>(FPDF_GetPageWidth(pdf_page) * render_dpi / 72.0);
        px_h = static_cast<int>(FPDF_GetPageHeight(pdf_page) * render_dpi / 72.0);
    }
    FPDF_BITMAP bmp = FPDFBitmap_Create(px_w, px_h, 1);
    if (!bmp) throw std::runtime_error("Failed to create bitmap");
    FPDFBitmap_FillRect(bmp, 0, 0, px_w, px_h, 0xFFFFFFFF);
    FPDF_RenderPageBitmap(bmp, pdf_page, 0, 0, px_w, px_h, 0, FPDF_ANNOT);
    int row_stride = FPDFBitmap_GetStride(bmp);
    uint8_t* raw_ptr = static_cast<uint8_t*>(FPDFBitmap_GetBuffer(bmp));
    auto out = py::array_t<uint8_t>({px_h, px_w, 4}, raw_ptr); // BGRA
    FPDFBitmap_Destroy(bmp);
    return out;
}

In Python, the BGRA output is turned into RGB by dropping alpha and reordering channels.

rgb_from_bgra = bgra_view[:, :, [2, 1, 0]]

The image path uses stb_image and stb_image_resize to force RGBA, resize to a target resolution, and convert to RGB.

py::array_t<uint8_t> load_image_rgb(const std::string& file_path,
                                   int out_w = 224,
                                   int out_h = 224) {
    int src_w, src_h, src_c;
    unsigned char* rgba_mem = stbi_load(file_path.c_str(), &src_w, &src_h, &src_c, 4);
    if (!rgba_mem) throw std::runtime_error("Failed to load image");
    std::vector<uint8_t> tmp(out_w * out_h * 4);
    stbir_resize_uint8(rgba_mem, src_w, src_h, 0,
                       tmp.data(), out_w, out_h, 0, 4);
    stbi_image_free(rgba_mem);
    py::array_t<uint8_t> rgb({out_h, out_w, 3});
    auto view = rgb.mutable_unchecked<3>();
    for (int yy = 0; yy < out_h; ++yy) {
        for (int xx = 0; xx < out_w; ++xx) {
            int p = (yy * out_w + xx) * 4;
            view(yy, xx, 0) = tmp[p + 0]; // R
            view(yy, xx, 1) = tmp[p + 1]; // G
            view(yy, xx, 2) = tmp[p + 2]; // B
        }
    }
    return rgb;
}

Why the arrays don’t match

The mismatch comes from doing two fundamentally different things. The PDF page was rendered directly to a specific size (for example, 224×224) to keep it fast, while the image was decoded at its source resolution and then downscaled. Rendering a PDF at a target size and rasterizing at a high dpi followed by a downscale are not equivalent operations. The result: even if the visual contents are the same, per-pixel values differ.

Make the pipelines comparable

The most reliable way to bring the results closer is to align the resizing strategy. Render the PDF at a higher dpi, and then downscale using the same resizer that you use for images. Among the tested options, using cv2 to resize arrays produced the closest match by Euclidean distance, and it has the practical benefit of working directly with numpy arrays. Pillow and a custom C++ resize tied behind.

This won’t make the arrays perfectly identical. Even when the content is the same, the distance won’t be exactly zero; it gets closer as you increase the PDF render dpi. There’s a trade-off between performance and accuracy, so you’ll need to choose a dpi that fits your constraints.

A unified resizing path with cv2

The goal is simple: render the PDF at sufficiently high dpi, convert BGRA to RGB, and resize with cv2. Do the same for the image array so both go through the same numpy→cv2 path.

One practical tweak is to avoid resizing the image in C++ and let cv2 handle it, just like the PDF. Here is a revised image loader that returns RGB at the source resolution:

py::array_t<uint8_t> read_image_as_rgb(const std::string& file_path) {
    int iw, ih, ic;
    unsigned char* rgba_buf = stbi_load(file_path.c_str(), &iw, &ih, &ic, 4);
    if (!rgba_buf) throw std::runtime_error("Failed to load image");
    py::array_t<uint8_t> rgb({ih, iw, 3});
    auto dst = rgb.mutable_unchecked<3>();
    for (int y = 0; y < ih; ++y) {
        for (int x = 0; x < iw; ++x) {
            int k = (y * iw + x) * 4;
            dst(y, x, 0) = rgba_buf[k + 0]; // R
            dst(y, x, 1) = rgba_buf[k + 1]; // G
            dst(y, x, 2) = rgba_buf[k + 2]; // B
        }
    }
    stbi_image_free(rgba_buf);
    return rgb;
}

With both sources as numpy RGB arrays, the Python side can keep the resizing identical using cv2:

import cv2
# PDF: render at higher dpi, then convert BGRA -> RGB
pdf_bgra = page_to_ndarray(pdf_page, out_w=0, out_h=0, render_dpi=some_dpi)
pdf_rgb = pdf_bgra[:, :, [2, 1, 0]]
pdf_resized = cv2.resize(pdf_rgb, (224, 224))
# Image: decode to RGB at source size, then use the same cv2 resizer
img_rgb = read_image_as_rgb(path_to_image)
img_resized = cv2.resize(img_rgb, (224, 224))

This brings the two outputs much closer numerically. Based on Euclidean distance, cv2 delivered the best match among the tried options, while Pillow and the custom C++ image path tied behind.

Why this matters

Many downstream tasks assume that a PDF rendition and a raster image with the same content will align pixel-for-pixel. In reality, rasterization strategies matter. Aligning how and when you resize ensures your comparisons are meaningful and your metrics reflect content rather than pipeline artifacts. If you care about tight numerical similarity, the dpi you choose for PDF rendering has a measurable impact, so it’s worth tuning.

Conclusion

Directly rendering a PDF to a target size is not the same as rasterizing at high dpi and then downscaling, and that difference alone can produce “very different arrays.” Unify the pipelines by rendering PDFs at a higher dpi, converting BGRA to RGB, and resizing with cv2. Do the same for images so both go through the same numpy→cv2 path. Expect results to be very close but not identical, and pick a dpi that balances performance and accuracy for your workload.

The article is based on a question from StackOverflow by Something Something and an answer by Something Something.