https://pytroubles.com/en/posts/id2843-vectorized-detection-of-column-relationships-in-pandas-with-numpy-add-subtract-multiply-divide

Vectorized Detection of Column Relationships in Pandas with NumPy (Add/Subtract/Multiply/Divide)

Detect Arithmetic Relationships Between DataFrame Columns in Pandas Using Vectorized NumPy and isclose

Vectorized Detection of Column Relationships in Pandas with NumPy (Add/Subtract/Multiply/Divide)

Learn a fast, vectorized approach in Pandas with NumPy to detect column relationships (add, subtract, multiply, divide) using numpy.isclose and safe division.

2026-01-02T13:00:11+03:00

2026-01-02T13:00:12+03:00

When you need to uncover arithmetic relationships across a dataframe — for example, finding whether some target column equals the result of an operation on two other numeric columns — a straightforward nested-loop approach quickly becomes slow and clunky. The task gets more delicate with floats and when multiple relationships per target can exist, such as price = discount + final_price and price = total / quantity. Below is a practical, vectorized way to detect additions, subtractions, multiplications, and divisions between pairs of columns against a target, while remaining careful with floating-point comparisons.ProblemYou have a subset of numerical columns and want to iterate over all pairs, testing whether a target column equals the result of an operation between the pair. The naive structure often looks like this and raises the question of how to compare a computed series against a target series efficiently:from itertools import combinations # iterate over all pairs of numeric columns for a_col, b_col in combinations(numeric_subset, 2): # iterate over possible targets for goal in numeric_subset: # skip trivial self-comparisons if goal in (a_col, b_col): continue # how to compare goal with a_col op b_col efficiently? What is really going onThe direct approach repeatedly performs Pandas operations inside nested loops and then checks equality. That creates overhead and doesn’t handle floats robustly. When floats are involved, exact equality is fragile, so a tolerance-based comparison is safer. There is also an algebraic detail to consider: addition and multiplication are commutative and need only one ordering, while subtraction and division require both orderings to be checked. If there are built-in integer and float relationships and more than one relationship can apply to the same target column, your detection logic must cover all four operations and both directions where necessary.Solution: vectorize with NumPy and compare via iscloseA fast approach is to push as much work as possible into NumPy. Compute operations on NumPy arrays, generate boolean masks via numpy.isclose, and count matches. For division, avoid division by zero by using numpy.divide with a where mask and a floating fallback. This keeps the number of Pandas operations low and leverages the performance of NumPy’s C-based core.import numpy as np import pandas as pd from itertools import combinations def scan_relations(frame, num_fields, tol=1e-8): findings = [] specs = { "+": [(lambda u, v: u + v, "lhs+rhs")], # commutative: rhs+lhs not needed "-": [ (lambda u, v: u - v, "lhs-rhs"), (lambda u, v: v - u, "rhs-lhs") ], "*": [(lambda u, v: u * v, "lhs*rhs")], # commutative: rhs*lhs not needed "/": [ # guard against divide by zero (lambda u, v: np.divide(u, v, out=np.full_like(u, np.nan, dtype=float), where=(v != 0)), "lhs/rhs"), (lambda u, v: np.divide(v, u, out=np.full_like(u, np.nan, dtype=float), where=(u != 0)), "rhs/lhs") ], } for left, right in combinations(num_fields, 2): arr_left = frame[left].values arr_right = frame[right].values for _, variants in specs.items(): for fn, tag in variants: result = fn(arr_left, arr_right) for candidate in num_fields: if candidate in (left, right): continue target_vals = frame[candidate].values mask = np.isclose(result, target_vals, atol=tol, equal_nan=False) hit_count = int(mask.sum()) row_count = len(frame) findings.append({ "lhs": left, "rhs": right, "op": tag, "candidate": candidate, "hits": hit_count, "rows": row_count, "ratio": hit_count / row_count, }) return pd.DataFrame(findings) # --- Example --- example = pd.DataFrame({ "x": [1, 2, 3, 4], "y": [2, 2, 2, 2], "z": [2, 4, 6, 8], "w": [2, 0, 1, 8], }) num_fields = ["x", "y", "z", "w"] report = scan_relations(example, num_fields) print(report[report["hits"] > 0].sort_values("ratio", ascending=False)) Why this works and what to watchUsing NumPy arrays avoids a large number of Pandas operations in inner loops. For comparisons, numpy.isclose with a specified tolerance is the right tool when floats are involved. Addition and multiplication do not need both argument orders, which saves redundant work, whereas subtraction and division are checked in both directions. Division is guarded with numpy.divide and a where condition to prevent invalid operations and to keep the output numeric.Complexity and performanceThe loops traverse all pairs of m numeric columns, which is m(m−1)/2 and behaves like O(m²). For each pair, every remaining column can be a target, adding another factor O(m). The comparison against each target scans n rows, so it costs O(n). Overall, the time grows on the order of O(n · m³). That means runtime increases cubically with the number of numeric columns and linearly with the number of rows. On wide tables, this can become significant, which is why minimizing overhead and vectorizing the hot path is important.Why you want this in your toolboxWhen datasets embed arithmetic relationships — sums, differences, products, or quotients — an automated detector helps validate pipelines, surface data quality issues, and recover implicit business rules. The vectorized approach scales better and handles mixed integer and float columns safely by design. It also produces a compact report with hits and match ratios so you can rank and investigate only meaningful candidates.TakeawaysFavor NumPy over repeated Pandas operations inside nested loops. Compare floats with numpy.isclose using an explicit tolerance. Account for commutativity in addition and multiplication and check both directions for subtraction and division, while protecting against divide-by-zero. Be mindful of the O(n · m³) behavior as the number of numeric columns grows, and filter to relationships with nonzero matches to focus your analysis.

pandas, numpy, numpy.isclose, vectorized detection, dataframe columns, arithmetic relationships, addition, subtraction, multiplication, division, floating point, tolerance, divide by zero

2026

2026, Jan 02 13:00

Detect Arithmetic Relationships Between DataFrame Columns in Pandas Using Vectorized NumPy and isclose

Learn a fast, vectorized approach in Pandas with NumPy to detect column relationships (add, subtract, multiply, divide) using numpy.isclose and safe division.

Problem

You have a subset of numerical columns and want to iterate over all pairs, testing whether a target column equals the result of an operation between the pair. The naive structure often looks like this and raises the question of how to compare a computed series against a target series efficiently:

from itertools import combinations
# iterate over all pairs of numeric columns
for a_col, b_col in combinations(numeric_subset, 2):
    # iterate over possible targets
    for goal in numeric_subset:
        # skip trivial self-comparisons
        if goal in (a_col, b_col):
            continue
        # how to compare goal with a_col op b_col efficiently?

What is really going on

The direct approach repeatedly performs Pandas operations inside nested loops and then checks equality. That creates overhead and doesn’t handle floats robustly. When floats are involved, exact equality is fragile, so a tolerance-based comparison is safer. There is also an algebraic detail to consider: addition and multiplication are commutative and need only one ordering, while subtraction and division require both orderings to be checked. If there are built-in integer and float relationships and more than one relationship can apply to the same target column, your detection logic must cover all four operations and both directions where necessary.

Solution: vectorize with NumPy and compare via isclose

A fast approach is to push as much work as possible into NumPy. Compute operations on NumPy arrays, generate boolean masks via numpy.isclose, and count matches. For division, avoid division by zero by using numpy.divide with a where mask and a floating fallback. This keeps the number of Pandas operations low and leverages the performance of NumPy’s C-based core.

import numpy as np
import pandas as pd
from itertools import combinations
def scan_relations(frame, num_fields, tol=1e-8):
    findings = []
    specs = {
        "+": [(lambda u, v: u + v, "lhs+rhs")],  # commutative: rhs+lhs not needed
        "-": [
            (lambda u, v: u - v, "lhs-rhs"),
            (lambda u, v: v - u, "rhs-lhs")
        ],
        "*": [(lambda u, v: u * v, "lhs*rhs")],  # commutative: rhs*lhs not needed
        "/": [
            # guard against divide by zero
            (lambda u, v: np.divide(u, v, out=np.full_like(u, np.nan, dtype=float), where=(v != 0)), "lhs/rhs"),
            (lambda u, v: np.divide(v, u, out=np.full_like(u, np.nan, dtype=float), where=(u != 0)), "rhs/lhs")
        ],
    }
    for left, right in combinations(num_fields, 2):
        arr_left = frame[left].values
        arr_right = frame[right].values
        for _, variants in specs.items():
            for fn, tag in variants:
                result = fn(arr_left, arr_right)
                for candidate in num_fields:
                    if candidate in (left, right):
                        continue
                    target_vals = frame[candidate].values
                    mask = np.isclose(result, target_vals, atol=tol, equal_nan=False)
                    hit_count = int(mask.sum())
                    row_count = len(frame)
                    findings.append({
                        "lhs": left,
                        "rhs": right,
                        "op": tag,
                        "candidate": candidate,
                        "hits": hit_count,
                        "rows": row_count,
                        "ratio": hit_count / row_count,
                    })
    return pd.DataFrame(findings)
# --- Example ---
example = pd.DataFrame({
    "x": [1, 2, 3, 4],
    "y": [2, 2, 2, 2],
    "z": [2, 4, 6, 8],
    "w": [2, 0, 1, 8],
})
num_fields = ["x", "y", "z", "w"]
report = scan_relations(example, num_fields)
print(report[report["hits"] > 0].sort_values("ratio", ascending=False))

Why this works and what to watch

Using NumPy arrays avoids a large number of Pandas operations in inner loops. For comparisons, numpy.isclose with a specified tolerance is the right tool when floats are involved. Addition and multiplication do not need both argument orders, which saves redundant work, whereas subtraction and division are checked in both directions. Division is guarded with numpy.divide and a where condition to prevent invalid operations and to keep the output numeric.

Complexity and performance

The loops traverse all pairs of m numeric columns, which is m(m−1)/2 and behaves like O(m²). For each pair, every remaining column can be a target, adding another factor O(m). The comparison against each target scans n rows, so it costs O(n). Overall, the time grows on the order of O(n · m³). That means runtime increases cubically with the number of numeric columns and linearly with the number of rows. On wide tables, this can become significant, which is why minimizing overhead and vectorizing the hot path is important.

Why you want this in your toolbox

When datasets embed arithmetic relationships — sums, differences, products, or quotients — an automated detector helps validate pipelines, surface data quality issues, and recover implicit business rules. The vectorized approach scales better and handles mixed integer and float columns safely by design. It also produces a compact report with hits and match ratios so you can rank and investigate only meaningful candidates.

Takeaways

Favor NumPy over repeated Pandas operations inside nested loops. Compare floats with numpy.isclose using an explicit tolerance. Account for commutativity in addition and multiplication and check both directions for subtraction and division, while protecting against divide-by-zero. Be mindful of the O(n · m³) behavior as the number of numeric columns grows, and filter to relationships with nonzero matches to focus your analysis.

dataframe loops nested-loops python