2026, Jan 02 19:00

Build DuckDB per-timestamp price histograms that sum traded size (not counts) for tick-level market data

Learn why DuckDB's histogram(price) returns counts, not volume, and how to sum traded size per price by timestamp using GROUP BY + MAP(ARRAY_AGG) for tick data.

When you build per-timestamp histograms in DuckDB over tick-level data, a subtle trap awaits: the built-in histogram aggregator counts occurrences of values, while many market data tasks require summing a companion metric, such as size, for each price level. If you expect the histogram values to reflect total traded size per price, a straightforward call to histogram(price) won’t deliver what you want.

Problem setup

Consider a stream of rows with three fields: price, size and timestamp. The goal is to generate, for every timestamp, a histogram keyed by price where each value equals the total size observed at that price during that timestamp.

Code that looks right but isn’t

The following query groups by timestamp and calls histogram(price). It returns a map of price to counts per timestamp, not a sum of size per price:

result_frame = conn.query(f"""
  SET enable_progress_bar = true;
  SELECT
    timestamp,
    histogram(price)
  FROM 'data/tickdata.parquet'
  GROUP BY timestamp
  ORDER BY timestamp
""")

The output appears structurally correct, yet the numbers are frequency counts. For financial data analysis, that’s often not what you need.

Why this happens

histogram(price) is an aggregator that produces a map of distinct values to their counts. It is designed to answer “how many occurrences of each price are there?” not “what is the total size accumulated at each price?” If your intent is to accumulate size per price within each timestamp, you must first compute SUM(size) for every (timestamp, price) pair and only then assemble the result into a map.

The fix: aggregate first, then map

The correct approach is a two-step aggregation. First, produce one row per timestamp and price with the total size. Second, convert those grouped results into a map for each timestamp. The following query does that:

WITH rolled AS (
  SELECT
    timestamp,
    price,
    SUM(size) AS size_total
  FROM tickdata
  GROUP BY timestamp, price
)
SELECT
  timestamp,
  MAP(ARRAY_AGG(price), ARRAY_AGG(size_total)) AS price_size_map
FROM rolled
GROUP BY timestamp
ORDER BY timestamp;

This yields, for each timestamp, a map where keys are price levels and values are the summed sizes for those prices. The logic matches the expected “sum by price within timestamp” behavior.

Why it matters

In market microstructure work, treating counts as if they were notional or volume can skew downstream analytics and risk metrics. Distinguishing between frequency-based histograms and value-weighted aggregations prevents misleading insights, avoids incorrect backtests, and preserves the integrity of derived features for models. The difference is just one function call, but the impact on interpretation is material.

Takeaways

When you need a histogram keyed by price where values are totals, not counts, don’t rely on histogram(price). Aggregate the payload you care about first, then convert the grouped keys and values into a map. The pattern is simple and robust: GROUP BY timestamp, price, compute SUM(size), then MAP(ARRAY_AGG(price), ARRAY_AGG(sum)). This small shift keeps your analytics aligned with the actual question you’re asking.

duckdb python sql