2026, Jan 13 09:00
How to Extract Columns, Tables, and Databases from Complex SQL (WHERE, JOINs, Subqueries) with sqlglot
Learn a robust way to parse complex SQL into a dataframe, resolving columns to real tables and databases using sqlglot qualification and scope traversal.
Parsing complex SQL into a structured view of columns, tables, and databases sounds straightforward until the WHERE clause and nested subqueries get involved. The goal here is to produce a dataframe that lists every referenced column along with its physical table and database, and to tag where in the query each reference appears, including joins, subqueries, and temp/created tables. The sticking points are accurate table/database resolution inside WHERE conditions and filtering out aliases so only real table names are returned.
Problem setup and a naive approach
The following function traverses the SQL AST, records columns, and tries to resolve their source tables and databases. It works for straightforward SELECTs and JOINs but struggles with mapping inside WHERE and other nested contexts. It also has difficulties distinguishing aliases from actual table names.
import sqlglot
from sqlglot import expressions as exp
import pandas as pd
def gather_query_facts(sql_text: str) -> pd.DataFrame:
trees = sqlglot.parse(sql_text)
col_buffer = []
dedup = set()
alias_to_source = {}
area = None
def log_col(col_node: exp.Column):
nonlocal area
col_name = col_node.name
alias_name = col_node.table
resolved_table, resolved_db = alias_to_source.get(alias_name, (alias_name, None))
key = (col_name, resolved_table, resolved_db, area)
if key not in dedup:
dedup.add(key)
col_buffer.append({
"column": col_name,
"table": resolved_table,
"database": resolved_db,
"query_section": area,
})
def walk_safe(val, ctx):
if isinstance(val, exp.Expression):
walk(val, ctx)
elif isinstance(val, list):
for v in val:
if isinstance(v, exp.Expression):
walk(v, ctx)
def walk(node, ctx=None):
nonlocal area
if not isinstance(node, exp.Expression):
return
if isinstance(node, exp.CTE):
name = node.alias_or_name
area = name
walk(node.this, ctx=name)
area = ctx
elif isinstance(node, exp.Subquery):
sub_alias = node.alias_or_name
if sub_alias:
alias_to_source[sub_alias] = (f"subquery_{sub_alias}", None)
area = sub_alias
walk(node.this, ctx=sub_alias)
area = ctx
elif isinstance(node, exp.Table):
tname = node.name
alias = node.alias_or_name or tname
db_expr = node.args.get("db")
dbname = db_expr.name if isinstance(db_expr, exp.Identifier) else None
alias_to_source[alias] = (tname, dbname)
elif isinstance(node, exp.Create):
tname = node.this.name
area = tname
if node.expression:
walk(node.expression, ctx=tname)
area = ctx
elif isinstance(node, exp.Insert):
area = "final_select"
walk(node.expression, ctx=area)
area = ctx
elif isinstance(node, exp.Select):
for proj in node.expressions:
if isinstance(proj, exp.Alias) and isinstance(proj.this, exp.Column):
log_col(proj.this)
elif isinstance(proj, exp.Column):
log_col(proj)
elif isinstance(node, exp.Column):
log_col(node)
return
for _, child in node.args.items():
if isinstance(child, (exp.Expression, list)):
walk_safe(child, ctx)
for stmt in trees:
walk(stmt)
return pd.DataFrame(col_buffer)
Here is a representative SQL statement that should be fully covered, including JOINs and a WHERE subquery:
CREATE TABLE TRD AS (
SELECT
TR.REQUEST_ID
,P17.THIS_WORKING
,P17.REQUEST_FIELD_VAL AS "AUTHORIZATION"
,P20.REQUEST_FIELD_VAL AS "CONTRACT PD/AOR"
FROM ADW_VIEWS_FSICS.FSICS_IPSS_TRAINING_REQUESTS TR
LEFT JOIN ADW_VIEWS_FSICS.FSICS_IPSS_TRNG_REQUESTS_DET P17
ON TR.REQUEST_ID = P17.REQUEST_ID
AND P17.REQUEST_FIELD_EXTERNAL_ID = 'IPSS_MD_PROPERTY_17'
LEFT JOIN ADW_VIEWS_FSICS.FSICS_IPSS_TRNG_REQUESTS_DET P20
ON TR.REQUEST_ID = P20.REQUEST_ID
AND P20.REQUEST_FIELD_EXTERNAL_ID = 'IPSS_MD_PROPERTY_20'
WHERE TR.REQUEST_ID IN (
SELECT REQUEST_ID
FROM ADW_VIEWS_FSICS.MY_TNG_REQUESTS
WHERE EVENT_TYPE = 'BASIC'
)
);
What goes wrong and why
Manually tracking tables through an AST and trying to associate each Column node with its physical Table quickly becomes brittle. WHERE conditions and subqueries introduce nested scopes, and aliases complicate lookups. The result is that column names are found, but their originating tables and databases may come back as null or as alias placeholders instead of real physical names. The logic above attempts to maintain a registry of aliases and sections, but it does not reliably resolve sources across all scopes where columns can appear.
Solution: lean on sqlglot’s qualification and scope traversal
Instead of maintaining a custom alias registry and walking everything by hand, it is more robust to let the optimizer qualify columns and then traverse semantic scopes. Column qualification annotates each column with its source, while scope traversal provides the mapping of sources for that scope. This produces the expected table and database values even for WHERE clauses and nested queries, and naturally avoids returning aliases in place of real table names.
from sqlglot.optimizer.qualify_columns import qualify_columns
from sqlglot.optimizer.scope import traverse_scope
from sqlglot import parse, exp
import pandas as pd
def dissect_sql_catalog(sql_text, dialect='tsql'):
frames = []
for node in parse(sql_text, read=dialect):
node = qualify_columns(node, schema=None)
zone = str(node.this).upper() if node.this else str(node.key).upper()
collected = []
for sc in traverse_scope(node):
for col in sc.columns:
src = sc.sources.get(col.table)
if isinstance(sc.sources.get(col.table), exp.Table):
db_name = src.db if hasattr(src, 'db') else None
collected.append((zone, db_name, src.name, col.name))
else:
collected.append((zone, None, None, col.name))
df = pd.DataFrame(collected, columns=["section", "database", "table", "columns"])
df = df.drop_duplicates()
frames.append(df)
return pd.concat(frames, ignore_index=True)
This approach surfaces columns from SELECT lists, JOIN predicates, WHERE conditions, and subqueries, and it tags them with a section derived from the statement node so you can see where each reference came from.
Why this matters
When you need reliable lineage or metadata extraction from SQL, correctness breaks down quickly if table and database resolution is incomplete. Transformations that depend on accurate mapping—governance checks, refactoring, query review, or impact analysis—require trustworthy attribution in every clause, especially in filters and nested statements. Using qualification and scope traversal ensures the mapping stays consistent across the entire query.
Takeaways
If you need a dataframe of columns with their physical tables and databases across complex queries, stop short of writing a custom AST walker that manually resolves aliases and scopes. Let column qualification and scope traversal do the heavy lifting, then normalize the result into a deduplicated dataframe. This keeps the logic compact, resilient to nested constructs, and aligned with how the query is actually interpreted.