2025, Oct 30 19:00

Flattening Nested JSON with Lists into a Single-Row Structure in Python: a Recursive Approach That Outperforms json_normalize

Flatten deeply nested JSON with lists into a single row in Python: recursive flattener, beyond pandas.json_normalize and explode, preserving values by index.

Flattening deeply nested JSON into a single-row structure sounds straightforward until lists of objects and multi-level nesting enter the picture. Standard tools like pandas.json_normalize and explode handle many cases, but when your payload mixes lists of dicts with further lists inside, they will often leave you with columns that still contain lists. If the end goal is a flat, single-record object with indexed keys for every nested value, you need a deterministic way to traverse and unroll the structure.

Problem in context

The JSON comes from an Elasticsearch response: a record with many scalar fields and several arrays of objects such as icdDiagnosisCodes, serviceProcedures, procedureCodeModifiers and serviceDiagnoses. Using json_normalize with a capped max_level or chaining explode did not yield a single flat row; list-valued columns remained as is.

import requests
import pandas as pd
import json
resp = requests.get(
    ELASTICSEARCH_URL,
    data=QUERY,
    auth=(config.get('username'), config.get('password')),
    verify=False,
    headers={'Content-Type': 'application/json'}
)
blob = resp.json()
slice_ = blob["hits"]["hits"][0]["_source"]["response"]["data"][CLAIMTYPE]
frame_main = pd.json_normalize(slice_, max_level=2).fillna('')
frame_nested = pd.json_normalize(frame_main['icdDiagnosisCodes'])
print(frame_nested.head(10).to_string())

Even after normalize, fields like icdDiagnosisCodes and serviceProcedures are still lists, and further attempts to explode do not produce one flat row with dedicated columns for each nested attribute.

What’s really going on

json_normalize flattens dictionaries into columns, but it does not automatically expand every list-of-dicts into a set of disambiguated scalar columns. A single call with max_level cuts off the traversal after a certain depth; at that point list-valued fields remain lists. explode can turn a list-like column into multiple rows or multiple entries, but if you need a single-row representation where each nested value becomes its own field, you need to walk the structure yourself and assign stable, unique keys.

There is also a practical requirement from the data model: all values must be retained even when they repeat, and fields like procedureCodeModifiers can contain many items (up to 25). This rules out deduplication and demands a keying scheme that indexes items in order.

A focused, recursive flattener

The approach below flattens the dictionary by concatenating nested keys with a separator for readability and appending numeric indices for list positions. It preserves all values and disambiguates duplicates by position, which satisfies the need for fields such as procedureCodeModifier_1, procedureCodeModifier_2, and so on.

def squash_map(obj_map, base_key=None, glue="___"):
    flat = {}
    for k, v in obj_map.items():
        new_key = k if base_key is None else base_key + glue + k
        if isinstance(v, str):
            flat[new_key] = v
        elif isinstance(v, list):
            for idx, item in enumerate(v):
                flat.update(squash_map(item, base_key=f"{new_key}_{idx}", glue=glue))
    return flat

Here is a minimal slice of the data to illustrate how list indexes and key concatenation are applied:

sample_payload = {
    "providerCity": "SOME CITY",
    "providerSpecialtyDescription": "PHYSICAL/OCCUPATIONAL THERAPY",
    "updateDate": "YYYY-MM-DD",
    "providerNpi": "XXXXXXXXXXX",
    "icdDiagnosisCodes": [
        {
            "icdDiagnosisCode": "M25551",
            "icdDiagnosisDecimalCode": "M25.551",
            "icdDiagnosisCodeDescription": "PAIN IN RIGHT HIP"
        },
        {
            "icdDiagnosisCode": "M545",
            "icdDiagnosisDecimalCode": "M54.5",
            "icdDiagnosisCodeDescription": "LOW BACK PAIN"
        }
    ],
    "dateOfBirth": "YYYY-MM-DD"
}
from pprint import pprint
pprint(squash_map(sample_payload))
{'dateOfBirth': 'YYYY-MM-DD',
 'icdDiagnosisCodes_0___icdDiagnosisCode': 'M25551',
 'icdDiagnosisCodes_0___icdDiagnosisCodeDescription': 'PAIN IN RIGHT HIP',
 'icdDiagnosisCodes_0___icdDiagnosisDecimalCode': 'M25.551',
 'icdDiagnosisCodes_1___icdDiagnosisCode': 'M545',
 'icdDiagnosisCodes_1___icdDiagnosisCodeDescription': 'LOW BACK PAIN',
 'icdDiagnosisCodes_1___icdDiagnosisCodeDecimalCode': 'M54.5',
 'providerCity': 'SOME CITY',
 'providerNpi': 'XXXXXXXXXXX',
 'providerSpecialtyDescription': 'PHYSICAL/OCCUPATIONAL THERAPY',
 'updateDate': 'YYYY-MM-DD'}

The display ordering in the snippet comes from the pretty printer; the insertion order of entries is preserved in the returned dictionary itself.

Why this addresses the constraints

This strategy emits one flat dictionary that you can treat as a single record. Every nested list item is indexed in order, which means duplicates are retained by design. Fields like procedureCodeModifier will naturally appear as procedureCodeModifiers_0___procedureCodeModifier, procedureCodeModifiers_1___procedureCodeModifier, and so on, covering the case where there may be many modifiers. The same holds for nested arrays like serviceDiagnoses inside serviceProcedures, since indexing repeats at each level of the path.

Why it’s worth knowing

When you need a single-row representation of a complex JSON document, especially one with lists of objects inside lists, generic flattening often falls short. A small, explicit traversal gives you a predictable column naming scheme, stable ordering via indices, and no loss of values. That is essential when your downstream consumers expect fields to be present regardless of duplicates or count, and when different payloads vary in the number of nested items.

Takeaways

If your target is a truly flat object, start by defining how keys should be constructed and how lists should be indexed. Use a dedicated traversal to produce that shape, then feed the result into your DataFrame workflow. This keeps nested lists from leaking into columns and satisfies requirements like “retain all values” and “support many modifiers” without ambiguity.

The article is based on a question from StackOverflow by Bhavani Kumar Metla and an answer by Swifty.