2025, Dec 15 17:00
Prevent GIL-induced event loop stalls: CPU-bound tasks in asyncio, FastAPI, and Starlette
Learn why threads block the GIL for CPU-bound work in asyncio/FastAPI and how ProcessPoolExecutor keeps the event loop responsive. Includes code and Python 3.13.
CPU-bound work and asyncio often collide on one shared resource: the Global Interpreter Lock. If you push heavy synchronous computations into a ThreadPoolExecutor from a FastAPI or Starlette app, the event loop still competes for the same GIL. With enough busy worker threads, the loop can be delayed, and your service can feel sluggish even though you “moved” the work off the main thread.
Reproducing the issue in code
The following example shows a CPU-bound job invoked via ThreadPoolExecutor and a contrasting variant that uses a ProcessPoolExecutor. The logic is identical, only the execution strategy differs.
from fastapi import FastAPI
import concurrent.futures
import asyncio
from multiprocessing import current_process
from threading import current_thread
api = FastAPI()
def heavy_cpu_work():
pid = current_process().pid
tid = current_thread().ident
t_name = current_thread().name
p_name = current_process().name
print(f"{pid} - {p_name} - {tid} - {t_name}")
pow(365, 100000000000000)
# This WILL block the event loop (because of pow())
@api.get("/gil-blocking")
async def gil_blocking():
evt = asyncio.get_running_loop()
with concurrent.futures.ThreadPoolExecutor() as exec_pool:
out = await evt.run_in_executor(exec_pool, heavy_cpu_work)
return "OK"
# This WON'T block the event loop
@api.get("/gil-non-blocking")
async def gil_non_blocking():
evt = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as exec_pool:
out = await evt.run_in_executor(exec_pool, heavy_cpu_work)
return "OK"
if __name__ == "__main__":
import uvicorn
uvicorn.run(api)
In the “gil-blocking” endpoint, pow() performs a large computation that does not release the GIL; the main event loop can’t run Python bytecode while a worker thread holds the lock. In contrast, the “gil-non-blocking” endpoint offloads the same job to another process and therefore another GIL, keeping the loop responsive.
What is actually happening with the GIL
When any thread holds the GIL, no other thread—including the thread that drives the asyncio event loop—can execute Python bytecode until the lock is released, either voluntarily or by reaching the interpreter’s timeslice. CPython periodically enforces time-slicing, and the default switch interval is about 5ms. You can observe the current configuration as follows:
import sys
print(sys.getswitchinterval()) # 0.005
This floating-point value expresses the ideal duration of the thread timeslice and can be configured using sys.setswitchinterval().
Please note that the actual value can be higher, especially if long-running internal functions or methods are used. Also, which thread becomes scheduled at the end of the interval is the operating system's decision. The interpreter doesn't have its own scheduler.
There is no internal priority that favors the event loop’s thread over worker threads. That means multiple busy CPU-bound workers can win the GIL repeatedly, and the loop can be delayed. Moreover, the automatic release itself is best-effort and not guaranteed; certain native operations can effectively monopolize the GIL for long stretches.
The pow(365, 100000000000000) example is instructive precisely because it is one of those cases that will not release the GIL while it runs, so everything else is blocked until it finishes or the process boundary is introduced.
Answering the practical questions
The described understanding is accurate: when a worker holds the GIL, the event loop can’t execute Python bytecode until the lock is yielded via the switch interval or voluntarily. Time-slicing can temporarily starve the loop if many workers contend for the lock, and there is no scheduler or priority mechanism in CPython that prefers the loop’s thread in such situations.
A reliable way out: processes, not threads
If you must run CPU-bound code, move it to another process. A separate process means a separate GIL, so the event loop in your main process stays responsive. The earlier example shows the contrast. The logic of the CPU-bound function does not change; only the executor type changes.
from fastapi import FastAPI
import concurrent.futures
import asyncio
from multiprocessing import current_process
from threading import current_thread
api = FastAPI()
def heavy_cpu_work():
pid = current_process().pid
tid = current_thread().ident
t_name = current_thread().name
p_name = current_process().name
print(f"{pid} - {p_name} - {tid} - {t_name}")
pow(365, 100000000000000)
@api.get("/gil-non-blocking")
async def gil_non_blocking():
evt = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as exec_pool:
out = await evt.run_in_executor(exec_pool, heavy_cpu_work)
return "OK"
Using multiple server workers or processes has the same effect: each worker has its own event loop and its own GIL, allowing multiple requests to proceed in parallel without blocking each other.
Important nuances and trade-offs
If your CPU-bound code does not include operations like the pow() example, you can still use ThreadPoolExecutor and the event loop will eventually get time to run thanks to the switch interval. That said, you should not expect parallel speedups for pure Python CPU-bound work with threads. ThreadPoolExecutor shines for blocking I/O, not for CPU-heavy code.
The default number of ThreadPoolExecutor workers is derived from min(32, os.cpu_count() + 4), while ProcessPoolExecutor defaults to the number of CPUs. The defaults may or may not be right for your workload. With threads, having more workers than CPUs is common for I/O-bound tasks, but too many active threads can cause excessive context switching. For CPU-bound code, processes match the hardware more naturally. Benchmarking with realistic loads is the sensible way to pick max_workers.
When heavy jobs risk delaying request handling, you can introduce a queue to control concurrency, return immediately with a request identifier, and let clients poll for status or receive results over websockets. This decouples the request-response path from long-running computation and preserves responsiveness.
On free-threaded CPython in Python 3.13
Python 3.13 introduces an experimental free-threaded mode where the GIL can be disabled. The feature is not enabled by default and requires a separate executable or a build with --disable-gil. The docs note the potential for parallel speedups on multi-core hardware, but also warn about the experimental status and a substantial single-threaded performance hit.
This is an experimental feature and therefore is not enabled by default. The free-threaded mode requires a different executable, usually called python3.13t or python3.13t.exe. Pre-built binaries marked as free-threaded can be installed as part of the official Windows and macOS installers, or CPython can be built from source with the --disable-gil option.
Free-threaded execution allows for full utilization of the available processing power by running threads in parallel on available CPU cores. While not all software will benefit from this automatically, programs designed with threading in mind will run faster on multi-core hardware. The free-threaded mode is experimental and work is ongoing to improve it: expect some bugs and a substantial single-threaded performance hit.
Why this matters
Async servers are only as responsive as their event loops. If a worker monopolizes the GIL, coroutine scheduling, callbacks, and timeouts all slip. Latency spikes become visible even at modest loads. Understanding how the GIL and the switch interval govern execution ensures you avoid subtle stalls and protect user-perceived performance.
Takeaways
For CPU-bound work, prefer a ProcessPoolExecutor or additional worker processes so the event loop is not starved. Be aware that there is no thread priority in CPython that privileges the loop thread. Some native computations will not release the GIL and can block everyone else. If you stay with threads, do so for I/O-bound tasks, and tune max_workers only after measuring the real workload. When requests kick off long jobs, decouple the response path via a queue and expose status retrieval or push completion via websockets. Keep an eye on Python 3.13’s free-threaded mode as it matures, but treat it as experimental for now.