2025, Nov 04 11:00
Stop RSS memory creep in FastAPI with Trino, PyArrow and Polars by writing Parquet to S3 in a short‑lived subprocess
Fix high RSS memory in a FastAPI service using Trino, PyArrow and Polars. Deletes and gc fall short—solve it by writing Parquet to S3 in a subprocess.
Running a long‑lived FastAPI service that pulls from Trino, transforms data with PyArrow and Polars, and writes Parquet to S3 often looks straightforward—until RSS memory stays high after each request. Deleting objects, forcing gc, and even prodding the allocator may not bring usage back to baseline. Below is a concise walkthrough of the failure pattern and a pragmatic, production‑friendly fix.
Minimal reproduction of the issue
The data path is simple: request in, read from Trino, build a PyArrow table, push to S3 with write_to_dataset, then try to free memory. Yet, the process memory doesn’t drop back.
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.fs as pafs
# Fetch and process data
rows_blob = grab_trino_rows()
tbl_arrow = pa.table(list(zip(*rows_blob)))
# Initialize S3 filesystem
s3fs = pafs.S3FileSystem()
# Upload to S3
pq.write_to_dataset(
tbl_arrow,
root_path=f"{RESULT_STORAGE_BUCKET}/{s3_storage_path}",
partition_cols=["organisation"],
filesystem=s3fs,
)
# Attempt to free memory
del rows_blob
del tbl_arrow
Empirical checks showed memory peaking and then sticking near the peak even after deletions.
{
"resource_stats": {
"memory_mb": {
"start": 140.30,
"peak": 589.03,
"end": 587.01
}
}
}
What’s really going on
In this pipeline, the memory pressure doesn’t just live in Python objects. The workload exercises native layers—Arrow buffers, Parquet writers, filesystem bindings, and allocator behavior—which means that after Python references are dropped, the process can still hold on to large arenas of native memory. Manual gc calls don’t change that outcome. Even deeper cleanup, like multiple gc passes combined with malloc_trim, only partially reduced RSS. Releasing the PyArrow memory pool helped Arrow’s counters but didn’t fully bring the process back to baseline, which aligns with the observation that native memory fragmentation or external library allocations can keep the process resident set high.
Collected observations were consistent across approaches. Explicit del and gc showed no meaningful drop. Advanced cleanup with malloc_trim showed partial success but not a full return to the starting point. A periodic cleanup routine reduced PyArrow’s internal accounting but left RSS elevated.
The practical fix: isolate work in a subprocess
A reliable way to avoid accumulation in a long‑running service is to execute the heavy data path in a short‑lived subprocess. When the subprocess exits, the operating system reclaims all its memory—Python objects, native pools, allocator arenas, everything—without relying on heuristics inside the main service process.
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.fs as pafs
from multiprocessing import Process
import gc
import ctypes
def child_exec(payload, s3_key_prefix, bucket_uri):
# Build Arrow table
tbl_arrow = pa.table(list(zip(*payload)))
# Write to S3 (Parquet dataset)
s3fs = pafs.S3FileSystem()
pq.write_to_dataset(
tbl_arrow,
root_path=f"{bucket_uri}/{s3_key_prefix}",
partition_cols=["organisation"],
filesystem=s3fs,
)
# Optional in‑process cleanup before exit
gc.collect()
try:
ctypes.CDLL("libc.so.6").malloc_trim(0)
except Exception:
pass
# Prepare inputs
payload = grab_trino_rows()
s3_prefix = "output/"
bucket_uri = "s3://my-bucket"
# Execute the data path in a separate process
proc = Process(target=child_exec, args=(payload, s3_prefix, bucket_uri))
proc.start()
proc.join()
This keeps FastAPI’s main process lean across requests. Even if native libraries or the allocator hold on to arenas while the subprocess runs, everything is returned to the OS when it exits, so there’s no cumulative climb over time.
Why you should care
Services that move large Arrow tables and Parquet partitions can hit high transient peaks. If those peaks keep ratcheting your baseline upward, you’ll eventually trip container limits or suffer unpredictable latency under memory pressure. Process isolation turns a leaky‑looking pipeline into a bounded one, with a clear upper limit per request.
Additional notes from the field
Two observations proved useful alongside process isolation. First, measuring RSS immediately after del can be misleading; allocator behavior means visible drops can take a short while, so add a small delay before sampling. Second, it’s possible to monitor subprocesses and collect their results while keeping the parent process tight. In some cases, breaking uploads into batches helped operationally, while still keeping the subprocess boundary.
Conclusion
If a FastAPI route that reads from Trino, processes with PyArrow and Polars, and writes Parquet to S3 doesn’t free memory back to baseline, assume native allocations are in play. Deleting Python objects, running gc, calling malloc_trim, and even releasing PyArrow’s memory pool may partially help but won’t guarantee a return to the starting point. Running the heavy path in a short‑lived subprocess is the most robust way to ensure memory is fully reclaimed between requests. Measure with a slight delay, keep the main process slender, and let the OS do the final cleanup when the subprocess exits.