2026, Jan 13 09:00

How to Extract Columns, Tables, and Databases from Complex SQL (WHERE, JOINs, Subqueries) with sqlglot

Learn a robust way to parse complex SQL into a dataframe, resolving columns to real tables and databases using sqlglot qualification and scope traversal.

Parsing complex SQL into a structured view of columns, tables, and databases sounds straightforward until the WHERE clause and nested subqueries get involved. The goal here is to produce a dataframe that lists every referenced column along with its physical table and database, and to tag where in the query each reference appears, including joins, subqueries, and temp/created tables. The sticking points are accurate table/database resolution inside WHERE conditions and filtering out aliases so only real table names are returned.

Problem setup and a naive approach

The following function traverses the SQL AST, records columns, and tries to resolve their source tables and databases. It works for straightforward SELECTs and JOINs but struggles with mapping inside WHERE and other nested contexts. It also has difficulties distinguishing aliases from actual table names.

import sqlglot
from sqlglot import expressions as exp
import pandas as pd

def gather_query_facts(sql_text: str) -> pd.DataFrame:
    trees = sqlglot.parse(sql_text)

    col_buffer = []
    dedup = set()
    alias_to_source = {}
    area = None

    def log_col(col_node: exp.Column):
        nonlocal area
        col_name = col_node.name
        alias_name = col_node.table
        resolved_table, resolved_db = alias_to_source.get(alias_name, (alias_name, None))
        key = (col_name, resolved_table, resolved_db, area)
        if key not in dedup:
            dedup.add(key)
            col_buffer.append({
                "column": col_name,
                "table": resolved_table,
                "database": resolved_db,
                "query_section": area,
            })

    def walk_safe(val, ctx):
        if isinstance(val, exp.Expression):
            walk(val, ctx)
        elif isinstance(val, list):
            for v in val:
                if isinstance(v, exp.Expression):
                    walk(v, ctx)

    def walk(node, ctx=None):
        nonlocal area
        if not isinstance(node, exp.Expression):
            return

        if isinstance(node, exp.CTE):
            name = node.alias_or_name
            area = name
            walk(node.this, ctx=name)
            area = ctx

        elif isinstance(node, exp.Subquery):
            sub_alias = node.alias_or_name
            if sub_alias:
                alias_to_source[sub_alias] = (f"subquery_{sub_alias}", None)
                area = sub_alias
            walk(node.this, ctx=sub_alias)
            area = ctx

        elif isinstance(node, exp.Table):
            tname = node.name
            alias = node.alias_or_name or tname
            db_expr = node.args.get("db")
            dbname = db_expr.name if isinstance(db_expr, exp.Identifier) else None
            alias_to_source[alias] = (tname, dbname)

        elif isinstance(node, exp.Create):
            tname = node.this.name
            area = tname
            if node.expression:
                walk(node.expression, ctx=tname)
            area = ctx

        elif isinstance(node, exp.Insert):
            area = "final_select"
            walk(node.expression, ctx=area)
            area = ctx

        elif isinstance(node, exp.Select):
            for proj in node.expressions:
                if isinstance(proj, exp.Alias) and isinstance(proj.this, exp.Column):
                    log_col(proj.this)
                elif isinstance(proj, exp.Column):
                    log_col(proj)

        elif isinstance(node, exp.Column):
            log_col(node)
            return

        for _, child in node.args.items():
            if isinstance(child, (exp.Expression, list)):
                walk_safe(child, ctx)

    for stmt in trees:
        walk(stmt)

    return pd.DataFrame(col_buffer)

Here is a representative SQL statement that should be fully covered, including JOINs and a WHERE subquery:

CREATE TABLE TRD AS (
SELECT 
   TR.REQUEST_ID
   ,P17.THIS_WORKING
   ,P17.REQUEST_FIELD_VAL AS "AUTHORIZATION"
   ,P20.REQUEST_FIELD_VAL AS "CONTRACT PD/AOR"
FROM ADW_VIEWS_FSICS.FSICS_IPSS_TRAINING_REQUESTS TR
LEFT JOIN ADW_VIEWS_FSICS.FSICS_IPSS_TRNG_REQUESTS_DET P17 
   ON TR.REQUEST_ID = P17.REQUEST_ID 
   AND P17.REQUEST_FIELD_EXTERNAL_ID = 'IPSS_MD_PROPERTY_17'
LEFT JOIN ADW_VIEWS_FSICS.FSICS_IPSS_TRNG_REQUESTS_DET P20 
   ON TR.REQUEST_ID = P20.REQUEST_ID 
   AND P20.REQUEST_FIELD_EXTERNAL_ID = 'IPSS_MD_PROPERTY_20'
WHERE TR.REQUEST_ID IN (
   SELECT REQUEST_ID
   FROM ADW_VIEWS_FSICS.MY_TNG_REQUESTS
   WHERE EVENT_TYPE = 'BASIC'
   )
);

What goes wrong and why

Manually tracking tables through an AST and trying to associate each Column node with its physical Table quickly becomes brittle. WHERE conditions and subqueries introduce nested scopes, and aliases complicate lookups. The result is that column names are found, but their originating tables and databases may come back as null or as alias placeholders instead of real physical names. The logic above attempts to maintain a registry of aliases and sections, but it does not reliably resolve sources across all scopes where columns can appear.

Solution: lean on sqlglot’s qualification and scope traversal

Instead of maintaining a custom alias registry and walking everything by hand, it is more robust to let the optimizer qualify columns and then traverse semantic scopes. Column qualification annotates each column with its source, while scope traversal provides the mapping of sources for that scope. This produces the expected table and database values even for WHERE clauses and nested queries, and naturally avoids returning aliases in place of real table names.

from sqlglot.optimizer.qualify_columns import qualify_columns
from sqlglot.optimizer.scope import traverse_scope
from sqlglot import parse, exp
import pandas as pd

def dissect_sql_catalog(sql_text, dialect='tsql'):
    frames = []
    for node in parse(sql_text, read=dialect):
        node = qualify_columns(node, schema=None)
        zone = str(node.this).upper() if node.this else str(node.key).upper()
        collected = []
        for sc in traverse_scope(node):
            for col in sc.columns:
                src = sc.sources.get(col.table)
                if isinstance(sc.sources.get(col.table), exp.Table):
                    db_name = src.db if hasattr(src, 'db') else None
                    collected.append((zone, db_name, src.name, col.name))
                else:
                    collected.append((zone, None, None, col.name))
        df = pd.DataFrame(collected, columns=["section", "database", "table", "columns"])
        df = df.drop_duplicates()
        frames.append(df)
    return pd.concat(frames, ignore_index=True)

This approach surfaces columns from SELECT lists, JOIN predicates, WHERE conditions, and subqueries, and it tags them with a section derived from the statement node so you can see where each reference came from.

Why this matters

When you need reliable lineage or metadata extraction from SQL, correctness breaks down quickly if table and database resolution is incomplete. Transformations that depend on accurate mapping—governance checks, refactoring, query review, or impact analysis—require trustworthy attribution in every clause, especially in filters and nested statements. Using qualification and scope traversal ensures the mapping stays consistent across the entire query.

Takeaways

If you need a dataframe of columns with their physical tables and databases across complex queries, stop short of writing a custom AST walker that manually resolves aliases and scopes. Let column qualification and scope traversal do the heavy lifting, then normalize the result into a deduplicated dataframe. This keeps the logic compact, resilient to nested constructs, and aligned with how the query is actually interpreted.

dataframe python sql sql-parser sqlglot