2025, Sep 27 13:00

Polars partitioned Parquet to Amazon S3: resolving pyarrow ACCESS_DENIED and local staging in v1.33

Learn why partitioned Parquet writes to Amazon S3 failed in polars (pyarrow ACCESS_DENIED, local staging) and how v1.33 fixes write_parquet and sink_parquet.

Writing partitioned Parquet to Amazon S3 from polars can behave in surprising ways depending on the backend you select. Two symptoms show up in practice: when disabling the pyarrow engine, files are staged locally before upload and those local artifacts are not removed automatically; when enabling the pyarrow engine, the upload may fail with an ACCESS_DENIED error that mentions anonymous multipart uploads. Below is a compact walkthrough of the issue and the current state of the fix.

Repro in code

The following example writes a partitioned DataFrame to S3 with explicit AWS credentials passed via storage_options. Toggling the backend changes the outcome.

aws_sess = ...  # boto3 session
creds = aws_sess.get_credentials()
io_opts = {
    "aws_access_key_id": creds.access_key,
    "aws_secret_access_key": creds.secret_key,
    "aws_session_token": creds.token,
    "aws_region": "eu-west-1",
}
s3_uri = f"s3://{bucket_name}/{prefix_path}"
# polars DataFrame instance
tbl.write_parquet(
    s3_uri,
    partition_by=part_cols,
    storage_options=io_opts,
    use_pyarrow=...  # True | False, different behavior
)

What’s actually going on

With use_pyarrow set to False, polars creates local copies of the data before uploading to S3 and those files are not cleaned up automatically. With use_pyarrow set to True, the write can fail with an error like “Uploading to <file> FAILED with error When initiating multiple part upload for key <directory> in bucket <bucket>: AWS Error ACCESS_DENIED during CreateMultipartUpload operation: Anonymous users cannot initiate multipart uploads. Please authenticate.” This behavior was a known issue affecting both pl.DataFrame.write_parquet and pl.LazyFrame.sink_parquet. The discussion is tracked publicly under polars#23114 for sink_parquet and polars#23221 for write_parquet.

Resolution and updated behavior

The fix has landed. As of polars v1.33, the described problems no longer occur and both pl.LazyFrame.sink_parquet and pl.DataFrame.write_parquet behave as expected when writing partitioned Parquet to S3.

If you are on a version prior to v1.33, you can reproduce the behavior with the earlier snippet. On v1.33 and above, the same call path works as intended. For example:

aws_sess = ...  # boto3 session
creds = aws_sess.get_credentials()
io_opts = {
    "aws_access_key_id": creds.access_key,
    "aws_secret_access_key": creds.secret_key,
    "aws_session_token": creds.token,
    "aws_region": "eu-west-1",
}
s3_uri = f"s3://{bucket_name}/{prefix_path}"
# polars DataFrame instance
tbl.write_parquet(
    s3_uri,
    partition_by=part_cols,
    storage_options=io_opts,
    use_pyarrow=True  # behaves as expected on v1.33+
)

Why this matters

Partitioned data lakes are sensitive to correctness and consistency guarantees during writes. Local staging that leaves files behind can waste disk space and complicate operational hygiene. Failed multipart uploads to S3 stall pipelines and create ambiguity in monitoring. Knowing that the behavior was tied to specific versions of polars helps you diagnose symptoms quickly and remove unnecessary workarounds.

Practical takeaways

If you are hitting either the local-staging leftovers or the ACCESS_DENIED multipart error when writing partitioned Parquet to S3, check your polars version. Upgrading to v1.33 or later addresses the issue for both the eager write path and the lazy sink path. For background and historical context, refer to the public threads at polars#23114 and polars#23221.

In short, rely on the fixed behavior in v1.33+ and keep your data writing code simple and explicit about storage_options and partitioning. That keeps your S3 workloads predictable and reduces maintenance overhead.

The article is based on a question from StackOverflow by FBruzzesi and an answer by FBruzzesi.