2025, Oct 18 03:00
Converting T-SQL to Databricks SQL: handle tokenization gaps (sqlparse) and rewrite TOP to LIMIT
Learn why sqlparse fails to tag TOP in T-SQL, breaking T-SQL to Databricks SQL rewrites, and how to detect it and convert TOP to LIMIT with reliable guards.
Converting T-SQL to Databricks SQL often looks straightforward until tokenization gets in the way. A common stumbling block is the TOP clause: it isn’t recognized as a keyword by the tokenizer you might rely on, so your control flow never switches into the transformation path. If your logic short-circuits when only common keywords are found, you may accidentally return the original query instead of rewriting TOP to LIMIT.
Problem setup
The idea is to parse the SQL, flatten tokens, then proceed only if an uncommon keyword appears. If everything looks “common,” the code returns the input unchanged, deferring transformations. In practice, this prevents handling TOP, because it isn’t picked up as a keyword by the tokenizer.
ast_unit = sqlparse.parse(input_sql)[0]
stream = TokenList(ast_unit.tokens).flatten()
shared_kw = ["SELECT","FROM","DISTINCT","WHERE","GROUP BY","ORDER BY","AS","JOIN","VALUES","INSERT INTO","UPDATE","DELETE FROM","SET","COUNT","AVG","MIN","MAX","SUM"]
passthrough = True
for piece in stream:
    if piece.ttype == Keyword:
        if piece.value.upper() not in shared_kw:
            passthrough = False
            break
if passthrough:
    return input_sqlThe expected next step is to transform TOP into LIMIT. Instead, the function falls through to return the original query, because the guard never flips.
Root cause
sqlparse.token.Keyword recognizes DDL and DML keywords, but not DQL. Because of that, TOP isn’t tagged as a Keyword at all, so the check against your allowlist never triggers. The outcome is subtle: the loop sees no “uncommon keyword,” sets no break, and the code returns early.
Fix and revised code
To ensure the pipeline continues when encountering constructs outside the tokenizer’s Keyword coverage, add an explicit check on the raw SQL for TOP and other markers you care about, such as the @ character.
ast_unit = sqlparse.parse(input_sql)[0]
stream = TokenList(ast_unit.tokens).flatten()
shared_kw = ["SELECT","FROM","DISTINCT","WHERE","GROUP BY","ORDER BY","AS","JOIN","VALUES","INSERT INTO","UPDATE","DELETE FROM","SET","COUNT","AVG","MIN","MAX","SUM"]
passthrough = True
for chunk in input_sql:
    if "TOP" in chunk or "@" in chunk:
        passthrough = False
if passthrough:
    for piece in stream:
        if piece.ttype == Keyword:
            if piece.value.upper() not in shared_kw:
                passthrough = False
                break
if passthrough:
    return input_sql
# continue with transformation steps (e.g., handling TOP -> LIMIT) when passthrough is FalseThis guarantees that queries containing TOP won’t be returned untouched, so subsequent logic can replace TOP with LIMIT as intended. For reference, the list of words, functions, and characters recognized by sqlparse.token.Keyword is available here: https://github.com/andialbrecht/sqlparse/blob/master/sqlparse/keywords.py.
Why this matters
When building automated SQL migration tooling, your control flow often hinges on token classes and keyword detection. If the tokenizer excludes some categories, your tool may silently skip critical rewrites. Understanding what the tokenizer actually labels as Keyword avoids false assumptions and broken transformation paths.
Practical takeaways
First, verify what your tokenizer emits before wiring business rules to it. If needed, complement token-based checks with simple string scans for constructs you must catch. Second, when behavior diverges from expectations, trace the values and execution path with print debugging to see what’s really happening at runtime and confirm whether a token is recognized at all.
In short, TOP won’t appear as a Keyword in this setup, so guard for it explicitly and let the rest of your pipeline do the conversion to LIMIT. Keep your allowlist tight, know your tokenizer’s limits, and validate with lightweight diagnostics before scaling the transformation logic.