2025, Oct 17 14:00

How to Extract Missing XFA PDF Field Coordinates: Use PyMuPDF to Detect White Input Boxes from Page Drawings

Learn why PageItemUIDToLocationDataMap can miss XFA PDF fields and recover coordinates by detecting white rectangles with PyMuPDF. Code included.

Extracting coordinates of form fields from Adobe XFA PDFs often looks straightforward until a subset of fields simply refuses to appear in your data. A common route is to read /PageItemUIDToLocationDataMap under /PieceInfo, but in some files only a fraction of fields show up. The task, then, is to recover the missing coordinates without relying on commercial tooling.

Problem setup

The approach below iterates pages, pulls the /PageItemUIDToLocationDataMap from /PieceInfo[/InDesign], writes the results to CSV, and places marks on the PDF. It demonstrates the issue: on certain pages the map omits fields that are visibly present when you open the document in a reader.

import pikepdf
import fitz  # PyMuPDF
import csv

SRC_PDF = "input.pdf"
CSV_OUT = "points.csv"
PDF_OUT = "output.pdf"
MAP_KEY = "/PageItemUIDToLocationDataMap"

def pull_datamap_points(pdf_file, target_key=MAP_KEY):
    rows_out = []
    with pikepdf.open(pdf_file) as pdf:
        for idx, pg in enumerate(pdf.pages):
            piece_meta = pg.get('/PieceInfo', None)
            if piece_meta and '/InDesign' in piece_meta:
                idn = piece_meta['/InDesign']
                if target_key in idn:
                    for key, val in idn[target_key].items():
                        try:
                            uid = int(str(key).lstrip('/'))
                            kind_val = float(val[2])
                            crds = [float(n) for n in list(val)[3:7]]
                            rows_out.append([idx + 1, uid, kind_val] + crds)
                        except Exception as err:
                            print(f"Error parsing {key}:{val} ({err})")
    return rows_out

def count_pages(pdf_file):
    with pikepdf.open(pdf_file) as pdf:
        return len(pdf.pages)

def normalize_rows(data_rows, max_pages):
    Y_BASE = 420.945  # Local constant for y-coordinate transform

    pg_total = count_pages(SRC_PDF)
    map_page = lambda p: 2 if (p >= max_pages) else (p + 1 if p > 1 else p)
    norm_rows = []
    for rec in data_rows:
        pg, uid, kind, x0, y0, x1_, y1_ = rec
        pg_fix = map_page(pg)
        y0n = round(Y_BASE - y0, 3)
        y1n = round(Y_BASE - y1_, 3)
        x0n = round(x0, 3)
        x1n = round(x1_, 3)
        height = round(abs(y1n - y0n), 1)
        norm_rows.append([pg_fix, uid, kind, x0n, y0n, x1n, y1n, height])
    return norm_rows

def order_and_pick(data_rows):
    rows_sorted = sorted(data_rows, key=lambda r: (r[0], -r[6], r[3], r[1]))
    picked = []
    for rec in rows_sorted:
        if (rec[2] == 4 and rec[7] == 17):
            picked.append(rec)
    return picked

def export_csv(csv_path, data_rows):
    with open(csv_path, 'w', newline='', encoding='utf-8') as f:
        w = csv.writer(f)
        w.writerow(['page', 'id', 'type', 'x1', 'y1', 'x2', 'y2', 'h'])
        w.writerows(data_rows)

def paint_points(src_pdf, out_pdf, data_rows):
    doc = fitz.open(src_pdf)
    for rec in data_rows:
        pno = int(rec[0])
        cx = rec[3]
        cy = rec[6]
        page = doc[pno - 1]
        y_mupdf = page.rect.height - cy
        page.draw_circle((cx, y_mupdf), radius=2, color=(0, 0, 0), fill=(0, 0, 0))
    doc.save(out_pdf)

if __name__ == "__main__":
    raw_points = pull_datamap_points(SRC_PDF)
    pg_count = count_pages(SRC_PDF)
    mapped_points = normalize_rows(raw_points, pg_count)
    final_points = order_and_pick(mapped_points)
    export_csv(CSV_OUT, final_points)
    paint_points(SRC_PDF, PDF_OUT, final_points)
    print(f"Done. Points: {len(final_points)}; Wrote {CSV_OUT} and {PDF_OUT}")

If you want to inspect everything that was discovered on the page—before any filtering—skip the type/height filter step and export all rows to CSV.

What is actually going on

/PageItemUIDToLocationDataMap is useful, but it is not guaranteed to contain every visible input area for all pages in such documents. In the sample, pages that are clearly fillable in a viewer still do not report all expected entries when relying solely on that map. At the same time, those missing fields are visually present on the page as shapes, which opens a separate, reliable path to discover them.

Reframing the goal from “read every XFA field from the map” to “find the white input boxes present on the page” avoids the omission. The boxes can be detected directly from the page’s drawing instructions by looking for filled rectangles with white color.

Solution: detect the white boxes from page drawings

Scanning page drawings and filtering for white-filled rectangles recovers the missing coordinates and works without proprietary tools. The code below writes the rectangle coordinates to CSV and marks each box with a small circle on the page.

import fitz  # PyMuPDF
import csv

SRC_DOC = "input.pdf"
MARKED_PDF = "output.pdf"
BOXES_CSV = "output.csv"

def is_color(color, target=(1, 1, 1)):
    return color == target

pdf = fitz.open(SRC_DOC)

# Zero-based page indices; adjust as needed, or use range(len(pdf)) to process all
pages_idx = [1]

with open(BOXES_CSV, mode="w", newline="", encoding="utf-8") as fh:
    w = csv.writer(fh)
    w.writerow(["page_num", "x0", "y0", "x1", "y1"])
    for idx in pages_idx:
        pg = pdf[idx]
        vector_ops = pg.get_drawings()
        canvas = pg.new_shape()
        for op in vector_ops:
            rc = op.get("rect")
            fc = op.get("fill")
            if rc and is_color(fc, target=(1, 1, 1)):
                x0, y0, x1, y1 = rc
                cx, cy = x0, y1
                canvas.draw_circle((cx, cy), 2)
                w.writerow([idx, x0, y0, x1, y1])
        canvas.finish(color=(0, 0, 1), fill=None)
        canvas.commit()

pdf.save(MARKED_PDF)
pdf.close()

This approach is effective because the sought-after areas are represented as white rectangles in the page content stream. By collecting drawing operations and checking for fill color (1, 1, 1), the coordinates of those areas become available for further automation.

Why knowing this matters

Field discovery in complex PDFs can follow more than one path. If one metadata-based route returns partial results, directly analyzing the page drawing instructions offers a practical fallback. Once the bounding boxes are identified, you can proceed to place your own form fields and automate data entry, while staying entirely within open-source tooling.

Takeaways

When /PageItemUIDToLocationDataMap does not list all fields, treat the task as locating the visual input boxes instead of mining XFA maps. Extract drawing primitives with PyMuPDF, filter for white-filled rectangles, and record their coordinates. Keep the CSV export unfiltered during exploration, and only then introduce filters for types or sizes that match your specific page layout. This combination gives you a robust, non-proprietary pipeline for recovering the positions you need and moving on to automated filling.

The article is based on a question from StackOverflow by flywire and an answer by flywire.

pdf pikepdf pymupdf python