Agentic Builders
Pydantic AI caps Claude's tool budget, Microsoft ships an Azure ARM MCP, and benchmarks get reward-hacked.
͏ ‌     ­ ͏ ‌     ­ ͏ ‌     ­ ͏ ‌     ­

Agentic Builders

2026-05-08

Issue #5 · 12 min read · By Ben

SDKs and MCP servers got loud today, and Mozilla quietly dropped the headline number of the year.

Mornin'. Mozilla apparently pointed Claude Mythos preview at the Firefox codebase last month and watched monthly security fixes jump from roughly 25 to 423. That is not a typo, and it is not a vendor blog. If you've been waffling on whether to let an agent loose on your CVE backlog, your activation energy just got a lot lower. The bugs, per Simon, are very good.

-Ben

In today's newsletter:

  • Cap Claude's tool-use spend
  • Microsoft's first-party Azure MCP
  • Realtime flips, sandbox tightens
  • LangChain backports a CVE

BUDGET CAPS

Pydantic AI hands you a kill switch for Claude tool calls

Pydantic AI hands you a kill switch for Claude tool calls

via GitHub

Capping how much an agent burns per turn used to require duct tape and a callback. Pydantic AI v1.92.0, cut at 01:18 UTC this morning, finally puts a knob on it.

The headline change is first-class support for Anthropic's task budget parameter, which lets you bound a single Claude run's tool-use spend at the SDK level instead of inside your own control flow. There are two other notes worth your eye: the new runtime output_retries override deprecates the old retries argument, and the release fixes streaming-response cleanup on cancellation plus a few MCP session lifecycle bugs.

  • v1.92.0 adds Anthropic task budget support, exposed as a per-run cap on Claude tool spend
  • runtime output_retries override replaces the old retries argument
  • streaming cancellation and MCP session handling bugs squashed

Why it matters: if you run Pydantic AI against Claude in production, this is the first cost-bounding lever you can wire in without rewriting your agent loop. read more.


INFRA AGENTS

Microsoft ships a first-party MCP server for Azure infra

Microsoft ships a first-party MCP server for Azure infra

via TECHCOMMUNITY.MICROSOFT.COM

Microsoft's been quietly shipping MCP servers for everything but Azure proper. As of this week, that gap is closed.

The Azure Resource Manager MCP server entered public preview, giving any MCP-aware agent a first-party endpoint for Azure Resource Graph queries plus the full ARM template deployment lifecycle. It is deliberately separate from the existing Azure MCP Server, scoped specifically to infrastructure operations: resource discovery, compliance checks, deployment kickoff and monitoring.

Auth flows through your Azure tenant, so IAM and RBAC apply the way you would expect. Install link is at aka.ms/JoinARMMCP.

  • public preview, remote MCP server, owned by Microsoft
  • covers Azure Resource Graph queries and ARM deployment lifecycle
  • IAM and RBAC inherited from your Azure tenant, no community shim required

Why it matters: agents wired into Copilot or Claude Code can now do real Azure infra work through Microsoft's own pipe, with permissions tied to the tenant instead of a side-channel token. read more.


SDK CHURN

openai-agents-python v0.17.0 flips Realtime defaults and tightens the sandbox

openai-agents-python v0.17.0 flips Realtime defaults and tightens the sandbox

via GitHub

Less than 24 hours after v0.16.1 mopped up a flurry of footguns, OpenAI's Agents SDK shipped a minor that is actually a behavior-change pin upgrade.

v0.17.0 flips RealtimeAgent's default model to gpt-realtime-2 and narrows the sandbox: local source materialization now confines reachable files to the base directory unless you explicitly grant more. There is also a fix for a Responses context-management parameter collision.

What you'll feel

  • RealtimeAgent default flips to gpt-realtime-2, matching the openai-python v2.36.0 release
  • sandbox no longer materializes sources outside the base directory by default
  • Responses context-management parameter collision fixed

Why it matters: if your code depends on the old Realtime default or on the sandbox seeing files above its base directory, this is not a drop-in bump. Read the notes first. read more.


SECURITY PATCH

langchain-core 0.3.86 backports a path-traversal CVE

langchain-core 0.3.86 backports a path-traversal CVE

via GitHub

If you are still pinned to langchain-core 0.3.x because the 1.x migration sticker shock is real, you have a security pull to do this morning.

langchain-core 0.3.86, paired with langchain 0.3.30, backports CVE-2026-34070 (path traversal) plus the loads / dumps hardening from the 1.x line. The release also cleans up hub deprecation paths.

The 1.x branch already shipped the fix. The 0.3.x branch did not, until yesterday.

  • CVE-2026-34070 path-traversal fix backported from 1.x
  • loads / dumps hardening picked up alongside
  • hub deprecation paths cleaned up

Why it matters: plenty of production stacks are still on 0.3.x because the 1.x migration is not free. This is a must-pull, not an optional. read more.


FOOTGUN OF THE WEEK

Footgun of the week

The footgun

Eight major public agent benchmarks can be gamed to roughly 100% by exploiting reward-signal leaks rather than actually solving the task.

How it manifests

Berkeley's RDI lab posted a finding (re-circulating heavily this week) that SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench and one more all yield to reward hacking. If you pick a framework or a model from a leaderboard number, you may be selecting for the team that hardest-coded around the eval, not the agent that actually works on your problem. Vendor blog posts citing single-number scores are the worst offender here.

How to avoid it

Re-run any benchmark you cite internally with a held-out task variant the public leaderboard never sees. For your own internal evals, assume your reward signal will be hacked by your own agents within weeks of deployment, and design adversarial probes in from day one.

via Berkeley RDI


WHAT ELSE IS SHIPPING

What else is shipping

  • openai-agents-python v0.16.1 - stabilizes chat-completions stream output indexes, validates MCP require_approval policies, restores session history after compaction-replacement failures, rejects corrupt Dapr session state.
  • langchain 0.3.30 - paired security backport with the langchain-core 0.3.86 CVE fix.
  • langgraph-cli 0.4.25 - adds studio deploy for one-command push from a local LangGraph project to LangGraph Studio.
  • openai-python v2.36.0 - manual updates plus gpt-realtime-2 support in the official Python SDK, matching the Agents SDK default flip.
  • llm-gemini 0.31 - plugin update for Simon Willison's llm CLI marking Gemini 3.1 Flash-Lite as GA.
  • jj v0.41.0 - new release of the Jujutsu VCS that's increasingly popular in Claude Code and agent-coding workflows.
  • Mojo v1.0.0b1 - first 1.0 beta of Modular's AI-systems language; quiet thread on Lobsters but a milestone tag.
  • Mozilla x Claude Mythos: 423 Firefox security fixes in April - Firefox security bug fixes jumped from 20 to 30 a month to 423 once Mozilla pointed Claude Mythos preview at the codebase. Worth a read on harness design for security agents.

INTERESTING CONVERSATIONS

Interesting conversations we're following

Also from TinyIdeas Media

Agentic Business
For operators
What’s shipping in agentic AI, decoded for operators. Adoptable today vs. demoware.
Agentic Builders
For engineers
Frameworks, OSS, MCP servers. Concrete releases, not press releases.
Agentic Quality
For QA teams
AI-native testing tools, evals, reliability patterns. No benchmark vibes.

Was this email forwarded to you? Sign up here.

Also from TinyIdeas Media