2025, Dec 04 07:00

Python process pool to run a multithreaded C++ executable with a strict 3-process concurrency cap

Learn how to cap concurrency at three while running a multithreaded C++ tool per sample using Python ThreadPool and subprocess.run, with tqdm and GNU parallel.

When you need to launch a multithreaded C++ executable across many inputs but keep a strict cap on concurrency, a simple process pool in Python solves the orchestration cleanly. The goal here is to always have three external executables running in parallel, where each executable spawns 10 threads, so the total number of threads in the system stays within the intended limit.

Baseline: what we start with

The starting point is a Python driver that reads sample names from JSON and wants to keep exactly three tasks in flight. The sketch below shows the intended control flow without a working scheduler yet.

import json

if __name__ == "__main__":
    # Path to the C++ binary that spawns 10 threads internally
    TOOL_BIN = "./GenerateTrimmedNTuple"
    sample_ids = json.load(open("samples.json"))
    task_list = [[TOOL_BIN, s_id] for s_id in sample_ids]

    # Keep three external processes active at any time
    active = 0
    while active < 3:
        # launch a task and increment counter
        # on task completion, decrement counter
        pass

What’s really going on

You want to cap the number of concurrently running executables to three, regardless of how many samples there are. Each executable spawns 10 threads on its own, so this orchestration level ensures you don’t exceed 30 threads total. The missing piece is a reliable way to launch external commands in parallel and be notified as they finish.

Practical solution with subprocess and a ThreadPool

The combination of subprocess.run to execute the external binary and multiprocessing.pool.ThreadPool to bound concurrency provides exactly what is needed. The pool maps your argument lists to worker functions, runs up to three at a time, and yields results as soon as individual processes finish.

import multiprocessing.pool as mp_pool
import json
import subprocess

if __name__ == "__main__":
    # C++ executable: spawns 10 threads per run
    BIN_PATH = "./GenerateTrimmedNTuple"
    items = json.load(open("samples.json"))
    cmd_args = [[BIN_PATH, name] for name in items]

    concurrency = 3
    with mp_pool.ThreadPool(concurrency) as executor:

        def run_job(argv):
            subprocess.run(argv, check=True)
            return argv

        for finished_cmd in executor.imap_unordered(run_job, cmd_args):
            print(f"command: {finished_cmd} finished!")

Using check=True makes the whole driver terminate with an error if any external process exits non‑zero. If you prefer to continue on failures, drop check=True and explore other subprocess.run parameters such as capture_output to manage output streams explicitly.

Optional progress feedback

If you want a progress bar, wrap the iterator with tqdm. The total is the number of commands to run.

import tqdm
# ... inside the same context and with the same variables as above:
for finished_cmd in tqdm.tqdm(executor.imap_unordered(run_job, cmd_args), total=len(cmd_args)):
    print(f"command: {finished_cmd} finished!")

Why this approach matters

This setup guarantees that at most three executables are running concurrently, which, given each spawns 10 threads, helps keep the overall thread count aligned with your system envelope. Unlike juggling multiple shells or panes, the pool coordinates lifecycle, collects completions as soon as they happen, and keeps the pipeline saturated without manual supervision.

Shell-first alternatives

If you prefer to stay in the shell, there’s also a direct route using GNU parallel with jq for extracting JSON values. For a list of values in samples.json, the following starts three concurrent runs:

parallel -j3 -a <( jq -r .[] samples.json ) ./GenerateTrimmedNTuple

And if your JSON holds a dictionary and you need just its keys:

jq -r 'keys | .[]' sample2.json

Wrap-up

For controlled parallel execution of an external C++ program per sample, a small ThreadPool around subprocess.run is straightforward and robust. It enforces a fixed concurrency level, surfaces failures predictably with check=True, and can be paired with a progress bar for visibility. Keep your inputs as a list, feed them to the pool, and let the runner keep three processes active until all samples complete.