2025, Oct 24 13:00
Reproducing Synapse SQL/SQL Server CAST(HASHBYTES('SHA2_256') AS BIGINT) in PySpark: Right‑Most 8 Bytes, Big‑Endian
Learn why SHA-256 to BIGINT results differ between Synapse SQL and PySpark, and how to match SQL Server: use right-most 8 bytes and big-endian interpretation.
When migrating ETL logic from Synapse SQL to PySpark, it’s common to mirror an existing row identifier built from HASHBYTES. The raw SHA-256 value may match byte-for-byte between SQL and Python, yet the derived BIGINT diverges. This guide shows exactly why that happens and how to get a bit-for-bit compatible 64-bit integer.
The setup: how the key is generated in SQL
In SQL, the identifier is created by hashing a concatenation of column values with a pipe separator and then casting the hash to BIGINT:
SELECT CAST(HASHBYTES('SHA2_256', CONCAT_WS('|', [col1], [col2], ...)) AS BIGINT) AS EtlHashKey
FROM sample_table;
For a simple input like col1 = abc and col2 = 123, the expected SQL result is a specific negative BIGINT. The same input in Python may produce a different 64-bit integer even if the SHA-256 hex digests are identical. That mismatch is the core problem.
Reproducing the issue in Python
The following Python snippet demonstrates a seemingly correct approach that still yields a different BIGINT than SQL. It uses the same utf-8 encoding and null handling that CONCAT_WS enforces, and the SHA-256 hex matches SQL’s HASHBYTES output.
import hashlib
import struct
def calc_row_key(*parts):
    merged = '|'.join(['' if p is None else str(p) for p in parts])
    digest = hashlib.sha256(merged.encode('utf-8')).digest()
    key64 = struct.unpack('<q', digest[:8])[0]
    return key64
This produces a valid signed 64-bit integer, but it does not match SQL’s CAST(HASHBYTES(...) AS BIGINT).
Why the numbers diverge
SHA-256 returns 32 bytes. The discrepancy comes from how those 32 bytes are reduced to 8 bytes and how those 8 bytes are interpreted. SQL Server takes the right-most 8 bytes of the 32-byte hash and interprets them as a BIGINT. The Python code above takes the first 8 bytes and unpacks them as little-endian.
Once the correct slice and byte order are applied, the resulting 64-bit integer aligns with SQL Server’s CAST behavior. The raw hash values matched all along; the difference was only in which 8 bytes were chosen and how they were read.
The fix: match SQL’s CAST semantics
To replicate CAST(HASHBYTES('SHA2_256', ...) AS BIGINT), take the last 8 bytes of the SHA-256 digest and unpack them as a signed 64-bit integer in big-endian order.
import hashlib
import struct
def build_etl_key(*fields):
    joined = '|'.join(['' if f is None else str(f) for f in fields])
    hbytes = hashlib.sha256(joined.encode('utf-8')).digest()
    etl_key = struct.unpack('>q', hbytes[-8:])[0]
    return etl_key
This change aligns the Python result with SQL Server’s BIGINT derived from HASHBYTES. The crucial details are the right-most 8 bytes and big-endian interpretation.
Why it’s important to know this
Reducing a 32-byte SHA-256 digest to an 8-byte integer discards three quarters of the information. That reduction makes collisions more likely compared to using the full binary. The practice persists in some pipelines for legacy or schema reasons, but it’s worth understanding the trade-off. If the full hash can be stored as binary, the conversion step disappears and the chance of collisions is minimized relative to the original algorithm.
Practical wrap-up
If you must produce the same BIGINT that SQL Server yields for CAST(HASHBYTES('SHA2_256', ...) AS BIGINT), compute the SHA-256, take the right-most 8 bytes, and unpack them as a signed 64-bit big-endian value. Keep in mind that a BIGINT captures only a subset of the hash’s information, so if your system allows it, prefer storing the full 32-byte digest to preserve uniqueness properties.
The article is based on a question from StackOverflow by Santosh Reddy Kommidi and an answer by Charlieface.