2025, Oct 24 03:00

Streaming large JSON from Zstandard-compressed .json.zstd with Python ijson (no newline-delimited JSON)

Stream large JSON from .json.zstd in Python: feed a Zstandard stream to ijson for incremental parsing without loading into memory; no newline-delimited JSON

When a JSON payload is too large to fit in memory, the only sensible approach is to stream it. Things get trickier when the JSON is stored inside a .json.zstd file and the structure isn’t newline-delimited. Iterating line by line won’t help if the entire JSON lives on a single line, and loading everything at once defeats the point. The good news: you can feed a zstd decompression stream directly to a streaming JSON parser and iterate without ever materializing the full document.

Problem setup

Assume you build a large compressed file iteratively from chunks that can be either strings or dictionaries with a text field. A minimal data sample could look like this:

[{"text": "very very"}, " very very", " very very very", {"text": " very very long"}]

Writing those chunks into a single JSON object compressed with zstd might look like this:

import zstandard as zstd

# creates a .json.zstd file holding {"text": "..."}
def dump_stream(zpath, pieces, lvl=22):
    comp = zstd.ZstdCompressor(level=lvl)
    with open(zpath, 'wb') as fp_out:
        with comp.stream_writer(fp_out) as zw:
            for pos, part in enumerate(pieces):
                if isinstance(part, str):
                    pass
                elif isinstance(part, dict):
                    part = part["text"]
                else:
                    raise ValueError(f"Unrecognized chunk {type(part)}")
                if pos == 0:
                    part = '{"text": "' + part
                zw.write(part.encode("utf-8"))
            zw.write('"}'.encode("utf-8"))

# toy data
dump_stream(
    "test.json.zstd",
    [{"text": "very very"}, " very very", " very very very", {"text": " very very long"}],
    lvl=22,
)

A straightforward attempt to read it line by line from the zstd stream won’t work if the JSON is all on one line:

import io, json
import zstandard as zstd

def scan_zstd_lines(infile):
    dec = zstd.ZstdDecompressor()
    with open(infile, 'rb') as raw:
        with dec.stream_reader(raw) as zr:
            txt = io.TextIOWrapper(zr, encoding='utf-8')
            for ln in txt:
                if ln.strip():
                    yield json.loads(ln)

# next(scan_zstd_lines("test.json.zstd"))  # not helpful for a single-line JSON

Why line-by-line fails in this case

Iterating lines assumes boundaries exist in the text, but a single JSON object often occupies one line when serialized. Splitting on newlines produces either nothing or the entire document at once, which is exactly what we’re trying to avoid. Reading character by character with read(n) could stream the text, but you would still need to implement robust tokenization to extract valid JSON elements without breaking on commas inside nested objects.

Streaming the compressed JSON with ijson

ijson can consume file-like objects, including objects produced by urlopen() and io.BytesIO(). A zstd stream reader is also a file-like object, so you can pass it directly to ijson and iterate incrementally. Using an empty path selects the top-level JSON entity from the stream.

import zstandard as zstd
import ijson

# iterate items from a zstd-compressed JSON stream
def stream_zstd_json(zpath):
    dec = zstd.ZstdDecompressor()
    with open(zpath, 'rb') as fh:
        with dec.stream_reader(fh) as zstream:
            parser = ijson.items(zstream, '')  # empty path to read the top-level item
            for obj in parser:
                yield obj

# consume
fname = 'test.json.zstd'
for obj in stream_zstd_json(fname):
    print(obj)
    print('---')

This iterates over JSON items produced by the parser directly from the decompressed stream. For a single top-level object like {"text": "..."}, you get that object in the iterator.

A note on pandas

pandas can read from zstd out of the box if you’re fine with loading everything at once:

import pandas as pd

df = pd.read_json('test.json.zstd')

Chunked reads with lines=True and chunksize=... expect newline-delimited JSON, where each JSON object is on its own line without surrounding brackets and commas. That format is incompatible with a single-line, single-object JSON.

Why this matters

When data is larger than memory, the parsing strategy is as important as the compression format. Streaming the decompressed bytes into a streaming JSON parser avoids buffering the entire payload, and it does so without reformatting the source into newline-delimited JSON. This keeps the I/O model simple and prevents accidental quadratic memory spikes.

Takeaways

If you have a big .json.zstd and the JSON isn’t newline-delimited, don’t iterate lines and don’t try to load the whole file. Open a Zstandard stream and hand it to ijson.items with an empty path. You will get the JSON items directly from the compressed stream while staying within constant memory. If you need chunking with pandas, make sure your data is multiline JSON; otherwise, stick with a streaming parser that works with file-like objects.

The article is based on a question from StackOverflow by jeandut and an answer by furas.