https://pytroubles.com/en/posts/id969-fixing-t-sql-to-databricks-sql-conversion-when-sqlparse-misses-top-reliably-rewrite-top-to-limit

Fixing T-SQL to Databricks SQL conversion when sqlparse misses TOP: reliably rewrite TOP to LIMIT

Converting T-SQL to Databricks SQL: handle tokenization gaps (sqlparse) and rewrite TOP to LIMIT

Fixing T-SQL to Databricks SQL conversion when sqlparse misses TOP: reliably rewrite TOP to LIMIT

Learn why sqlparse fails to tag TOP in T-SQL, breaking T-SQL to Databricks SQL rewrites, and how to detect it and convert TOP to LIMIT with reliable guards.

2025-10-18T03:00:05+03:00

Converting T-SQL to Databricks SQL often looks straightforward until tokenization gets in the way. A common stumbling block is the TOP clause: it isn’t recognized as a keyword by the tokenizer you might rely on, so your control flow never switches into the transformation path. If your logic short-circuits when only common keywords are found, you may accidentally return the original query instead of rewriting TOP to LIMIT.Problem setupThe idea is to parse the SQL, flatten tokens, then proceed only if an uncommon keyword appears. If everything looks “common,” the code returns the input unchanged, deferring transformations. In practice, this prevents handling TOP, because it isn’t picked up as a keyword by the tokenizer.ast_unit = sqlparse.parse(input_sql)[0]stream = TokenList(ast_unit.tokens).flatten()shared_kw = ["SELECT","FROM","DISTINCT","WHERE","GROUP BY","ORDER BY","AS","JOIN","VALUES","INSERT INTO","UPDATE","DELETE FROM","SET","COUNT","AVG","MIN","MAX","SUM"]passthrough = Truefor piece in stream: if piece.ttype == Keyword: if piece.value.upper() not in shared_kw: passthrough = False breakif passthrough: return input_sqlThe expected next step is to transform TOP into LIMIT. Instead, the function falls through to return the original query, because the guard never flips.Root causesqlparse.token.Keyword recognizes DDL and DML keywords, but not DQL. Because of that, TOP isn’t tagged as a Keyword at all, so the check against your allowlist never triggers. The outcome is subtle: the loop sees no “uncommon keyword,” sets no break, and the code returns early.Fix and revised codeTo ensure the pipeline continues when encountering constructs outside the tokenizer’s Keyword coverage, add an explicit check on the raw SQL for TOP and other markers you care about, such as the @ character.ast_unit = sqlparse.parse(input_sql)[0]stream = TokenList(ast_unit.tokens).flatten()shared_kw = ["SELECT","FROM","DISTINCT","WHERE","GROUP BY","ORDER BY","AS","JOIN","VALUES","INSERT INTO","UPDATE","DELETE FROM","SET","COUNT","AVG","MIN","MAX","SUM"]passthrough = Truefor chunk in input_sql: if "TOP" in chunk or "@" in chunk: passthrough = Falseif passthrough: for piece in stream: if piece.ttype == Keyword: if piece.value.upper() not in shared_kw: passthrough = False breakif passthrough: return input_sql# continue with transformation steps (e.g., handling TOP -> LIMIT) when passthrough is FalseThis guarantees that queries containing TOP won’t be returned untouched, so subsequent logic can replace TOP with LIMIT as intended. For reference, the list of words, functions, and characters recognized by sqlparse.token.Keyword is available here: https://github.com/andialbrecht/sqlparse/blob/master/sqlparse/keywords.py.Why this mattersWhen building automated SQL migration tooling, your control flow often hinges on token classes and keyword detection. If the tokenizer excludes some categories, your tool may silently skip critical rewrites. Understanding what the tokenizer actually labels as Keyword avoids false assumptions and broken transformation paths.Practical takeawaysFirst, verify what your tokenizer emits before wiring business rules to it. If needed, complement token-based checks with simple string scans for constructs you must catch. Second, when behavior diverges from expectations, trace the values and execution path with print debugging to see what’s really happening at runtime and confirm whether a token is recognized at all.In short, TOP won’t appear as a Keyword in this setup, so guard for it explicitly and let the rest of your pipeline do the conversion to LIMIT. Keep your allowlist tight, know your tokenizer’s limits, and validate with lightweight diagnostics before scaling the transformation logic.

T-SQL, Databricks SQL, sqlparse, TOP to LIMIT, TOP clause, tokenizer, tokenization, keyword detection, SQL migration, convert T-SQL, rewrite TOP, limit clause, Databricks migration

2025

2025, Oct 18 03:00

Converting T-SQL to Databricks SQL: handle tokenization gaps (sqlparse) and rewrite TOP to LIMIT

Learn why sqlparse fails to tag TOP in T-SQL, breaking T-SQL to Databricks SQL rewrites, and how to detect it and convert TOP to LIMIT with reliable guards.

Problem setup

The idea is to parse the SQL, flatten tokens, then proceed only if an uncommon keyword appears. If everything looks “common,” the code returns the input unchanged, deferring transformations. In practice, this prevents handling TOP, because it isn’t picked up as a keyword by the tokenizer.

ast_unit = sqlparse.parse(input_sql)[0]
stream = TokenList(ast_unit.tokens).flatten()

shared_kw = ["SELECT","FROM","DISTINCT","WHERE","GROUP BY","ORDER BY","AS","JOIN","VALUES","INSERT INTO","UPDATE","DELETE FROM","SET","COUNT","AVG","MIN","MAX","SUM"]
passthrough = True
for piece in stream:
    if piece.ttype == Keyword:
        if piece.value.upper() not in shared_kw:
            passthrough = False
            break
if passthrough:
    return input_sql

The expected next step is to transform TOP into LIMIT. Instead, the function falls through to return the original query, because the guard never flips.

Root cause

sqlparse.token.Keyword recognizes DDL and DML keywords, but not DQL. Because of that, TOP isn’t tagged as a Keyword at all, so the check against your allowlist never triggers. The outcome is subtle: the loop sees no “uncommon keyword,” sets no break, and the code returns early.

Fix and revised code

To ensure the pipeline continues when encountering constructs outside the tokenizer’s Keyword coverage, add an explicit check on the raw SQL for TOP and other markers you care about, such as the @ character.

ast_unit = sqlparse.parse(input_sql)[0]
stream = TokenList(ast_unit.tokens).flatten()

shared_kw = ["SELECT","FROM","DISTINCT","WHERE","GROUP BY","ORDER BY","AS","JOIN","VALUES","INSERT INTO","UPDATE","DELETE FROM","SET","COUNT","AVG","MIN","MAX","SUM"]
passthrough = True

for chunk in input_sql:
    if "TOP" in chunk or "@" in chunk:
        passthrough = False

if passthrough:
    for piece in stream:
        if piece.ttype == Keyword:
            if piece.value.upper() not in shared_kw:
                passthrough = False
                break

if passthrough:
    return input_sql

# continue with transformation steps (e.g., handling TOP -> LIMIT) when passthrough is False

This guarantees that queries containing TOP won’t be returned untouched, so subsequent logic can replace TOP with LIMIT as intended. For reference, the list of words, functions, and characters recognized by sqlparse.token.Keyword is available here: https://github.com/andialbrecht/sqlparse/blob/master/sqlparse/keywords.py.

Why this matters

When building automated SQL migration tooling, your control flow often hinges on token classes and keyword detection. If the tokenizer excludes some categories, your tool may silently skip critical rewrites. Understanding what the tokenizer actually labels as Keyword avoids false assumptions and broken transformation paths.

Practical takeaways

First, verify what your tokenizer emits before wiring business rules to it. If needed, complement token-based checks with simple string scans for constructs you must catch. Second, when behavior diverges from expectations, trace the values and execution path with print debugging to see what’s really happening at runtime and confirm whether a token is recognized at all.

In short, TOP won’t appear as a Keyword in this setup, so guard for it explicitly and let the rest of your pipeline do the conversion to LIMIT. Keep your allowlist tight, know your tokenizer’s limits, and validate with lightweight diagnostics before scaling the transformation logic.

The article is based on a question from StackOverflow by SAMIBOIII and an answer by SAMIBOIII.

databricks keyword python sql-parser