2025, Nov 13 01:00

How to parse optional trailing newlines in Lark without UnexpectedEOF or extra trees: reliable grammar designs

Learn why optional NEWLINE and EOF can fragment Lark parse trees, and how robust grammar (coord+, split line rules) handles an optional final newline safely.

Parsing optional trailing newlines with Lark can be surprisingly tricky. A missing NL at the end of a file may change how the last line is grouped, leading to an extra tree where you expected two. The core of the issue is not the content of the line itself, but how the grammar allows or disallows line boundaries when EOF arrives.

Minimal reproduction of the issue

The following script demonstrates how an optional newline at the end of a file can cause an extra tree in the output. The only difference between “works” and “breaks” is the presence of a final NL.

from lark import Lark, Transformer, Token, Discard

RAW_INPUT = """__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __"""

SYNTAX = """
    start      : NEWLINE? map+
    map        : [coord coord*] NEWLINE?
    coord      : HEX | FILL
    HEX        : ("A".."F" | DIGIT)+
    FILL       : "__"

    %import common.DIGIT
    %import common.NEWLINE
    %import common.WS_INLINE
    %ignore WS_INLINE
"""

class TreeFold(Transformer):

    def start(self, items: list) -> list:
        return items

    def NEWLINE(self, _):
        return Discard

    def coord(self, parts: list[Token]) -> str:
        return parts[0].value

def run_parse(text: str):
    engine = Lark(SYNTAX, start='start')
    parsed = engine.parse(text)
    [print(node) for node in TreeFold().transform(parsed)]

run_parse(RAW_INPUT)

With a final newline present, the lines group as expected. Without it, the second line may split and form an extra tree. Making NEWLINE mandatory after each line fixes one case but fails when the file does not end in a newline with an UnexpectedEOF.

Why this happens

The grammar blends two optionalities at once: a line can end with NEWLINE? and the last line might also end at EOF. When both paths are open, the parser has to decide whether to consume a trailing NEWLINE or treat EOF as the natural end of a rule. Depending on where EOF hits, that flexibility can fragment the last logical “line” into multiple trees. The construct [coord coord*] also opens two paths where one would be enough; coord+ expresses the same intent more clearly.

A robust grammar that tolerates optional final NL

To support both cases—files that do and do not end with a newline—split the concept of “line with NL” and “line possibly without NL” into separate rules, and let the top-level rule admit both. This avoids forcing a newline while keeping grouping consistent.

from lark import Lark, Transformer, Token, Discard

EXAMPLES = [
# a newline at the beginning and at the end
"""
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
""",

# a newline at the end but not the beginning
"""__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __
""",

# a newline at the beginning but not at the end
"""
__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __""",

# no newline at the beginning nor the end
"""__ 95 95 36 __ 95 __ 95 __
__ __ 95 36 32 __ __ __ __""",

# empty lines everywhere
"""

__ 95 95 36 __ 95 __ 95 __

__ __ 95 36 32 __ __ __ __


"""
]

GRAMMAR_OK = """
    start      : NEWLINE? mapnl* (mapnl | map)
    mapnl      : coord+ NEWLINE
    map        : coord+
    coord      : HEX | FILL
    HEX        : ("A".."F" | DIGIT)+
    FILL       : "__"

    %import common.DIGIT
    %import common.NEWLINE
    %import common.WS_INLINE
    %ignore WS_INLINE
"""

class NodeReducer(Transformer):

    def start(self, nodes: list) -> list:
        return nodes

    def NEWLINE(self, _):
        return Discard

    def coord(self, toks: list[Token]) -> str:
        return toks[0].value

def exec_parse(text: str):
    parser = Lark(GRAMMAR_OK, start='start')
    tree = parser.parse(text)
    [print(node) for node in NodeReducer().transform(tree)]

for sample in EXAMPLES:
    exec_parse(sample)
    print()

This version consistently groups complete lines whether the input starts with a newline, ends with one, contains empty lines, or omits the final NL entirely.

An equivalent alternative

If you prefer a slightly different top rule, you can express the last line as a separate production that allows an optional trailing newline and reuse the “line with mandatory NL” form for the others.

GRAMMAR_ALT = """
    start      : NEWLINE? map* mapoptnl
    map        : coord+ NEWLINE
    mapoptnl   : coord+ NEWLINE?
    coord      : HEX | FILL
    HEX        : ("A".."F" | DIGIT)+
    FILL       : "__"

    %import common.DIGIT
    %import common.NEWLINE
    %import common.WS_INLINE
    %ignore WS_INLINE
"""

Both approaches produce the same structure, differing only in rule names.

Why this nuance matters

Relying on NEWLINE? in line rules and EOF as an implicit terminator can lead to ambiguous endings that fragment the last record. This shows up as extra trees or, if you make NEWLINE mandatory, as an UnexpectedEOF when the file lacks a trailing newline. The practical takeaway aligns with a common recommendation: forcing a newline at the end of the input before parsing “fixes” it, but that’s a workaround, not a structural solution. As one practitioner put it, EOF handling has been a recurring pain point and the suggested fix is to add the newline; it’s reassuring to know this isn’t user error.

The dev's suggested “fix” is “force a newline at the end of the input before parsing.” Your answer confirms that it wasn't really something I was doing wrong.

Closing notes and guidance

If a grammar must accept data with or without a trailing newline, make the last-line case explicit in the grammar rather than optionalizing NEWLINE everywhere. Prefer coord+ over [coord coord*] to avoid redundant paths. If you control the input pipeline and want a quick workaround, appending a final newline is valid, but encoding the option in the grammar is more robust and keeps the parse trees stable regardless of how the file ends.