Agentic Builders
2026-05-08
SDKs and MCP servers got loud today, and Mozilla quietly dropped the headline number of the year.
Mornin'. Mozilla apparently pointed Claude Mythos preview at the Firefox codebase last month and watched monthly security fixes jump from roughly 25 to 423. That is not a typo, and it is not a vendor blog. If you've been waffling on whether to let an agent loose on your CVE backlog, your activation energy just got a lot lower. The bugs, per Simon, are very good.
-Ben
In today's newsletter:
- Cap Claude's tool-use spend
- Microsoft's first-party Azure MCP
- Realtime flips, sandbox tightens
- LangChain backports a CVE
BUDGET CAPS
Pydantic AI hands you a kill switch for Claude tool calls
via GitHub
Capping how much an agent burns per turn used to require duct tape and a callback. Pydantic AI v1.92.0, cut at 01:18 UTC this morning, finally puts a knob on it.
The headline change is first-class support for Anthropic's task budget parameter, which lets you bound a single Claude run's tool-use spend at the SDK level instead of inside your own control flow. There are two other notes worth your eye: the new runtime output_retries override deprecates the old retries argument, and the release fixes streaming-response cleanup on cancellation plus a few MCP session lifecycle bugs.
- v1.92.0 adds Anthropic task budget support, exposed as a per-run cap on Claude tool spend
- runtime
output_retriesoverride replaces the oldretriesargument - streaming cancellation and MCP session handling bugs squashed
Why it matters: if you run Pydantic AI against Claude in production, this is the first cost-bounding lever you can wire in without rewriting your agent loop. read more.
INFRA AGENTS
Microsoft ships a first-party MCP server for Azure infra
via TECHCOMMUNITY.MICROSOFT.COM
Microsoft's been quietly shipping MCP servers for everything but Azure proper. As of this week, that gap is closed.
The Azure Resource Manager MCP server entered public preview, giving any MCP-aware agent a first-party endpoint for Azure Resource Graph queries plus the full ARM template deployment lifecycle. It is deliberately separate from the existing Azure MCP Server, scoped specifically to infrastructure operations: resource discovery, compliance checks, deployment kickoff and monitoring.
Auth flows through your Azure tenant, so IAM and RBAC apply the way you would expect. Install link is at aka.ms/JoinARMMCP.
- public preview, remote MCP server, owned by Microsoft
- covers Azure Resource Graph queries and ARM deployment lifecycle
- IAM and RBAC inherited from your Azure tenant, no community shim required
Why it matters: agents wired into Copilot or Claude Code can now do real Azure infra work through Microsoft's own pipe, with permissions tied to the tenant instead of a side-channel token. read more.
SDK CHURN
openai-agents-python v0.17.0 flips Realtime defaults and tightens the sandbox
via GitHub
Less than 24 hours after v0.16.1 mopped up a flurry of footguns, OpenAI's Agents SDK shipped a minor that is actually a behavior-change pin upgrade.
v0.17.0 flips RealtimeAgent's default model to gpt-realtime-2 and narrows the sandbox: local source materialization now confines reachable files to the base directory unless you explicitly grant more. There is also a fix for a Responses context-management parameter collision.
What you'll feel
- RealtimeAgent default flips to
gpt-realtime-2, matching the openai-python v2.36.0 release - sandbox no longer materializes sources outside the base directory by default
- Responses context-management parameter collision fixed
Why it matters: if your code depends on the old Realtime default or on the sandbox seeing files above its base directory, this is not a drop-in bump. Read the notes first. read more.
SECURITY PATCH
langchain-core 0.3.86 backports a path-traversal CVE
via GitHub
If you are still pinned to langchain-core 0.3.x because the 1.x migration sticker shock is real, you have a security pull to do this morning.
langchain-core 0.3.86, paired with langchain 0.3.30, backports CVE-2026-34070 (path traversal) plus the loads / dumps hardening from the 1.x line. The release also cleans up hub deprecation paths.
The 1.x branch already shipped the fix. The 0.3.x branch did not, until yesterday.
- CVE-2026-34070 path-traversal fix backported from 1.x
loads/dumpshardening picked up alongside- hub deprecation paths cleaned up
Why it matters: plenty of production stacks are still on 0.3.x because the 1.x migration is not free. This is a must-pull, not an optional. read more.
FOOTGUN OF THE WEEK
Footgun of the week
The footgun
Eight major public agent benchmarks can be gamed to roughly 100% by exploiting reward-signal leaks rather than actually solving the task.
How it manifests
Berkeley's RDI lab posted a finding (re-circulating heavily this week) that SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench and one more all yield to reward hacking. If you pick a framework or a model from a leaderboard number, you may be selecting for the team that hardest-coded around the eval, not the agent that actually works on your problem. Vendor blog posts citing single-number scores are the worst offender here.
How to avoid it
Re-run any benchmark you cite internally with a held-out task variant the public leaderboard never sees. For your own internal evals, assume your reward signal will be hacked by your own agents within weeks of deployment, and design adversarial probes in from day one.
WHAT ELSE IS SHIPPING
What else is shipping
- openai-agents-python v0.16.1 - stabilizes chat-completions stream output indexes, validates MCP
require_approvalpolicies, restores session history after compaction-replacement failures, rejects corrupt Dapr session state. - langchain 0.3.30 - paired security backport with the langchain-core 0.3.86 CVE fix.
- langgraph-cli 0.4.25 - adds
studio deployfor one-command push from a local LangGraph project to LangGraph Studio. - openai-python v2.36.0 - manual updates plus
gpt-realtime-2support in the official Python SDK, matching the Agents SDK default flip. - llm-gemini 0.31 - plugin update for Simon Willison's
llmCLI marking Gemini 3.1 Flash-Lite as GA. - jj v0.41.0 - new release of the Jujutsu VCS that's increasingly popular in Claude Code and agent-coding workflows.
- Mojo v1.0.0b1 - first 1.0 beta of Modular's AI-systems language; quiet thread on Lobsters but a milestone tag.
- Mozilla x Claude Mythos: 423 Firefox security fixes in April - Firefox security bug fixes jumped from 20 to 30 a month to 423 once Mozilla pointed Claude Mythos preview at the codebase. Worth a read on harness design for security agents.
INTERESTING CONVERSATIONS
Interesting conversations we're following
- Agents need control flow, not more prompts on Hacker News - 507 points, 250 comments on the framing fight of the moment: code-as-orchestrator vs LLM-as-orchestrator.
- AlphaEvolve: Gemini-powered coding agent scaling impact across fields on Hacker News - 307 points, 132 comments dissecting DeepMind's autonomous-coding-agent multi-domain results.
- AI slop is killing online communities on Hacker News - 733 points, 624 comments. The meta-conversation engineers are having about what their own tools are doing to the open web.
- How to make SSE token streams resumable, cancellable, and multi-device on Hacker News - practitioner writeup on the unglamorous infra under any chat or agent UI.
- addyosmani/agent-skills surges on GitHub Trending on GH Trending (Python) - +1,794 stars today (around 34.2k total), reusable skill packs riding the Claude Code skills wave.
- Fission-AI/OpenSpec on GH Trending (TS) - spec-driven workflow for AI coding assistants trending hard at around 46k stars; the spec, not the prompt, is the primary artifact.
Also from TinyIdeas Media
|
Agentic Business
For operators
What’s shipping in agentic AI, decoded for operators. Adoptable today vs. demoware.
|
Agentic Builders
For engineers
Frameworks, OSS, MCP servers. Concrete releases, not press releases.
|
Agentic Quality
For QA teams
AI-native testing tools, evals, reliability patterns. No benchmark vibes.
|
Was this email forwarded to you? Sign up here.