2025, Nov 19 03:00

From Indented HTML Tables to Nested JSON: a Path-Based, Stack-Free Strategy with BeautifulSoup in Python

Learn how to convert an indented HTML table into a clean nested JSON tree using a path-based parser in Python with BeautifulSoup—no recursion, stable hierarchy.

From indented HTML table to nested JSON: a clean path-based approach

Transforming an indented HTML table into a properly nested JSON tree looks straightforward until you try to map visual indentation to a real parent/child structure. The common temptation is to recurse over rows and push nodes onto a stack, but without a reliable way to maintain the active ancestry, grandchildren often get lost on the way. Below is a practical walkthrough that parses the Wikipedia table of industry sectors and builds a nested JSON out of it.

The problematic approach

The initial idea uses BeautifulSoup to iterate table rows, extracts indentation from the style attribute, and tries to stitch the hierarchy via recursion and a shared stack. The structure looks reasonable, but children of children don’t consistently end up in the right place.

import bs4

fp = open("./webpage.html", "r", encoding="utf8")
doc = bs4.BeautifulSoup(fp, "html.parser")

grid = doc.find(
    "table", attrs={"class": "collapsible wikitable mw-collapsible mw-made-collapsible"}
)
lines = grid.find_all("tr")

bucket = []

def dive(base_i):
    for rel_i, tr in enumerate(lines[base_i:]):
        tds = tr.find_all("td")
        if len(tds) == 0:
            continue
        if not tds[0].attrs['style']:
            continue
        depth = int(tds[0].attrs["style"].strip(';')[-3])
        node = {
            "name": tds[0].text,
            "value": tds[1].text,
            "indent": depth,
            "children": [],
        }
        while bucket:
            ancestor = bucket[-1]
            for inner_i, tr_in in enumerate(lines[base_i + rel_i + 1:]):
                tds_in = tr_in.find_all("td")
                depth_in = int(tds_in[0].attrs["style"].strip(';')[-3])
                node_in = {
                    "name": tds_in[0].text,
                    "value": tds_in[1].text,
                    "indent": depth_in,
                    "children": [],
                }

                if depth == ancestor["indent"]:
                    leaf = bucket.pop()
                    bucket[-1]["children"].append(leaf)
                    bucket.append(node)

                if ancestor["indent"] - depth == -1:
                    bucket.append(node)
                    dive(base_i + rel_i + 1)
                    ancestor["children"].append(node)

                if depth < ancestor["indent"]:
                    return
        bucket.append(node)

dive(0)

One practical detail: for rows without an explicit style, the indentation index needs a baseline. In practice, setting style="padding-left: 0em" on the first non-header cell makes that baseline explicit.

Why this breaks down

The loop mixes three moving parts at once: forward iteration through rows, a recursive descent that jumps ahead, and a mutable stack that is supposed to represent the current ancestry. The inner scanning over subsequent rows, combined with stack mutations and early returns, makes it hard to keep a consistent path from the root to the current node. When the depth increases and then decreases again, the algorithm loses track of the correct parent to attach grandchildren, which is why descendants may be misplaced or skipped.

The fix: separate parsing from tree building

A robust way forward is to decouple concerns. First, read the table into a simple stream of tuples containing the indent level, the label, and the numeric value. Second, consume that stream while maintaining a path of lists that represent the chain from the root to the current level. Whenever a row arrives with a given indent, truncate the path to that level, append the new node to the last list in the path, and extend the path with the new node’s children list. This keeps the ancestry consistent at all times.

import bs4
import re

def scan_table(markup):
    tbl = markup.find(
        "table", attrs={"class": "collapsible wikitable mw-collapsible mw-made-collapsible"}
    )
    for tr in tbl.find_all("tr"):
        tds = tr.find_all("td")
        if len(tds) != 2:
            continue
        m = re.match(r"padding-left: *(\d+)em", tds[0].attrs.get('style', ''))
        lvl = int(m[1]) if m else 0
        yield lvl, tds[0].text, int(re.sub(r",|\D.*", "", tds[1].text))

def build_tree(markup):
    root = []
    trail = [root]
    for lvl, label, amount in scan_table(markup):
        kids = []
        del trail[lvl+1:]
        trail[-1].append({
            "name": label,
            "value": amount,
            "indent": lvl,
            "children": kids,
        })
        trail.append(kids)
    return root

fp = open("./webpage.html", "r", encoding="utf8")
html = bs4.BeautifulSoup(fp, "html.parser")
result = build_tree(html)

How the path strategy works

The key is the path list. It always contains, in order, the child lists from the root down to the current depth. When a new row arrives at indent k, everything deeper than k is discarded by slicing, so the last element of the path is exactly the list to which this node should be appended. Right after that, the node’s own children list is pushed onto the path, making it the active target for potential descendants. As soon as the indent decreases, the path contracts to the correct ancestor. No ad hoc stack unwinding, no forward scanning, no recursive jumps are required.

Why this matters

Many real-world HTML tables encode hierarchy visually rather than structurally. When you must convert that into clean nested JSON for downstream use, a stable way to translate indentation into parentage is critical. A small, deterministic builder that maintains the ancestry path avoids fragile recursion, makes the intent obvious, and drastically reduces the chance of wiring children to the wrong parent.

Takeaways

Parse first, build later. Stream the rows into a normalized sequence of depth, label, and value. Maintain a path of child lists that mirrors the current depth. On each row, truncate the path to the row’s indent, append the node to the last list, and extend the path with its children. The result is a compact hierarchy builder that produces predictable nested JSON from an indented table without wrestling with complex control flow. If a baseline indent is missing, make it explicit by ensuring the first data cell starts at padding-left: 0em.

data-structures json python