2025, Nov 09 05:00

Enable Agent Tool Discovery for MCP on Cloud Run using Streamable HTTP and Google ID Token Auth

Connect an MCP server on Google Cloud Run to LLM agents using streamable HTTP and Authorization: Bearer with a Google identity token to enable tool discovery.

When you move an MCP server from a local stdio process to a managed HTTP endpoint on Google Cloud Run, your LLM agent suddenly loses easy access to those tools. Calling a tool directly over HTTP works, but letting the agent discover and invoke tools from natural language prompts is the whole point. The missing piece is an MCP client that speaks streamable HTTP and can forward an identity token as an Authorization header.

Baseline: MCP server on Cloud Run

The MCP server exposes a single addition tool and listens with the streamable HTTP transport. This runs as a container on Cloud Run.

import asyncio
import os
from fastmcp import FastMCP, Context

svc = FastMCP("MCP Server on Cloud Run")

@svc.tool()
'''Use when two integers need addition; supply both inputs as parameters'''
async def sum_two(x: int, y: int, meta: Context) -> int:
    await meta.debug(f"[sum_two] {x}+{y}")
    out = x + y
    await meta.debug(f"result={out}")
    return out

if __name__ == "__main__":
    asyncio.run(
        svc.run_async(
            transport="streamable-http",
            host="0.0.0.0",
            port=os.getenv("PORT", 8080),
        )
    )

Direct call from a Python client works, but the agent can’t plan

A simple client can reach the server, fetch an ID token for the Cloud Run URL, and invoke the tool by name. That proves the deployment and authentication are fine, yet it doesn’t let an LLM decide when to call the tool based on a free-form prompt.

from fastmcp import Client
import asyncio
import google.oauth2.id_token
import google.auth.transport.requests
import os
import sys

argv = sys.argv
if len(argv) != 3:
    sys.stderr.write(f"Usage: python {argv[0]} <a> <b>\n")
    sys.exit(1)

arg_a = argv[1]
arg_b = argv[2]

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:\\Path\\to\\file.json'
base_url = "https://mcp-server-url-from-cloud-run"

req = google.auth.transport.requests.Request()
id_tok = google.oauth2.id_token.fetch_id_token(req, base_url)

cfg = {
    "mcpServers": {
        "cloud-run":{
            "transport": "streamable-http",
            "url": f"{base_url}/mcp/",
            "headers": {
                "Authorization": "Bearer token",
            },
            "auth": id_tok,
        }
    }
}

cli = Client(cfg)

async def main():
    async with cli:
        print("Connected")
        a_val = int(arg_a)
        b_val = int(arg_b)
        res = await cli.call_tool(
            name="sum_two",
            arguments={"x": a_val, "y": b_val},
        )
        print(res)

if __name__ == "__main__":
    asyncio.run(main())

What actually blocks the agent

Local setups often rely on stdio to wire an MCP server straight into an agent runtime, so tools are auto-discovered and used during reasoning. Once the server sits behind an HTTP endpoint on Cloud Run, the host needs an MCP client that can connect over streamable HTTP and attach a Google identity token in the Authorization header. Without that, the agent framework can’t see or authenticate to the remote tool registry, so it never plans tool calls from natural language.

The working approach: MultiServerMCPClient with bearer token header

The fix is to instantiate an MCP client that natively supports HTTP transport and custom headers, then pass the Cloud Run identity token as Authorization: Bearer. That preserves your Cloud Run deployment and avoids any local proxy.

The key was using MultiServerMCPClient to reach the MCP server and providing the Auth token as a header.

import asyncio
import os
import google.oauth2.id_token
import google.auth.transport.requests
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_mcp_adapters.client import MultiServerMCPClient

os.environ["OPENAI_API_KEY"] = "OpenAI_API_Key"
agent_llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:\\Path\\To\\file.json'
service_url = "https://mcp-server-url"
http_req = google.auth.transport.requests.Request()
bearer_jwt = google.oauth2.id_token.fetch_id_token(http_req, service_url)

remote_cfg = {
    "cloud-run": {
        "transport": "streamable_http",
        "url": f"{service_url}/mcp/",
        "headers": {
            "Authorization": "Bearer " + bearer_jwt,
        }
    }
}

mcp_pool = MultiServerMCPClient(remote_cfg)

async def drive():
    toolset = await mcp_pool.get_tools()
    prompt_text = "What is 4 + 8"
    orchestrator = create_react_agent(agent_llm, toolset)
    agent_result = await orchestrator.ainvoke({"messages": prompt_text})
    last_ai_reply = None
    for msg in agent_result['messages']:
        if "AIMessage" in str(type(msg)):
            last_ai_reply = msg
    print(last_ai_reply.content)

if __name__ == "__main__":
    asyncio.run(drive())

This lets the agent discover the remote MCP tool and call it when the prompt demands arithmetic. The run prints the expected answer, for example: 4+8=12.

Why this matters

Tool use only becomes valuable when the LLM decides, from unstructured input, that a call is necessary and then routes to the right capability. Exposing an MCP server over Cloud Run doesn’t change your tools; it changes how the agent must connect and authenticate. Once the client can speak streamable HTTP and pass the Authorization bearer token, you regain the same seamless tool orchestration you had locally, without introducing non-scalable proxies.

Takeaways

Keep the MCP server on Cloud Run as is; focus on the client side. Use a client that supports streamable HTTP and custom headers. Acquire a Google identity token from the JSON key referenced by GOOGLE_APPLICATION_CREDENTIALS for the Cloud Run URL and attach it as Authorization: Bearer. With that, the agent can load the remote tool registry and invoke tools directly from natural language prompts.

The article is based on a question from StackOverflow by Sachu and an answer by Sachu.