Agentic Builders

Issue #4 · 13 min read · By Ben

Busy day on the agent beat: AWS MCP goes GA, two flagship SDKs ship majors, and Claude agents start dreaming.

Mornin'. I spent the morning ripping out a hand-rolled boto3 wrapper because AWS just shipped one MCP tool, call_aws, that points at 15,000+ AWS API operations behind a single GA announcement. It is genuinely funny watching a year of "let me write a custom tool for that" evaporate before lunch. While you are at it, pin your model defaults today: that is also when OpenAI's agents SDK quietly swapped the out-of-the-box brain on every agent you forgot to lock down.

-Ben

In today's newsletter:

AWS MCP swallows 15,000 APIs
OpenAI Agents flips its defaults
Anthropic SDK lands Managed Agents
Claude agents start dreaming
Simon admits the line collapsed

PLATFORM PLAY

AWS MCP Server hits GA with 15,000 APIs behind one tool

via Amazon Web Services

AWS just stuffed its entire control plane into a single tool call. The MCP server graduated from preview, and call_aws is now a one-stop drive-thru window for 15,000+ AWS API operations, IAM-scoped and ready for any MCP-aware agent to chew through.

Out of preview, the server picks up IAM context-key support, a sandboxed Python run_script tool for "go compute this thing without making me write a Lambda", and a curated "Skills" model that replaces preview's Agent SOPs with AWS-maintained guidance. It runs in us-east-1 and eu-central-1 at no extra charge.

What's actually new versus preview

Single call_aws tool fronts the full AWS API surface, scoped per request via IAM context keys
New run_script tool gives agents sandboxed Python without you provisioning the infra
"Skills" replaces Agent SOPs with curated, AWS-maintained workflows the model can pull from

Why it matters: Agents in Claude Code, Cursor, and Kiro can finally hit current AWS APIs with scoped credentials and AWS-blessed docs, instead of you babysitting boto3 wrappers and feeding the model stale 2024 API shapes. read more.

DEFAULT DRIFT

OpenAI Agents SDK quietly rewires every new agent

via GitHub

A 0.0.1 minor version bump, a totally different default brain. openai-agents-python v0.16.0 swaps the SDK's stock model to gpt-5.4-mini, lets you pass max_turns=None to disable the loop ceiling, and exposes tool execution concurrency as a first-class config knob.

It also adds server-prefixed MCP tool naming, which sounds tiny until the moment you wire two MCP servers that both export a search tool. Anyone pinning to defaults inherits all of the above on next install.

Default model flips to gpt-5.4-mini on fresh agents
max_turns=None removes the turn ceiling for long-running loops
Tool execution concurrency is now configurable without custom plumbing
MCP tools get server prefixes, killing one whole class of name collisions

Why it matters: If you do not pin a model in your agent config, your behavior changed today. Pin explicitly, then turn the new concurrency knob up. read more.

SDK MILESTONE

anthropic-sdk-python v0.100.0 makes Managed Agents first-class

via GitHub

The Python SDK stopped pretending Managed Agents was a side product. v0.100.0 (yes, three digits past the dot) ships typed support for multiagents, outcomes graders, webhooks, and vault validation, all as first-class types instead of a hand-rolled HTTP client you maintain yourself.

That makes the SDK surface for the Managed Agents feature Anthropic announced at Code w/ Claude actually usable, and sets the contract that the next round of frameworks will adapt against.

Multiagent orchestration types are exported and typed end to end
Outcomes graders get first-class SDK methods, not raw JSON payloads
Webhook delivery and vault validation typed in the same release

Why it matters: If you have been waiting to wire Managed Agents into a Python codebase, this is the version that does not require you to write the HTTP layer yourself. LangChain and pydantic-ai will be backporting against this contract within the week. read more.

AGENT MEMORY

Claude agents now "dream" and grade their own homework

via SiliconANGLE

Anthropic just shipped the two missing primitives every team running production agents has been duct-taping for a year. Managed Agents now retain cross-run memory of recurring mistakes and shared team preferences (the marketing word is "dreaming"), and Outcomes lets you define a success bar with a separate grader model.

Netflix and Wisedocs are the cited launch users. The pitch: agents stop repeating the same mistake at run 47 because they remember it from run 12, and you stop reviewing every output by hand because a grader model decides whether the run actually cleared the bar.

Cross-run memory persists corrections and team preferences between sessions
Outcomes graders replace ad-hoc human review with a programmatic success bar
Netflix and Wisedocs are the named launch deployments

Why it matters: "The agent keeps making the same mistake" and "how do we know it actually succeeded" are the two questions every prod-agent post-mortem starts with. They now have first-party primitives instead of homegrown glue. read more.

PRACTITIONER PULSE

Simon Willison: the line between vibe coding and agentic engineering just collapsed

via Unsplash

The man who coined "vibe coding" is publicly admitting he has stopped reviewing the code his agents write, even for production systems. The post hit 646 points and 721 comments on HN inside 21 hours, which tells you he is not the only one who quietly noticed.

Willison's framing: "If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks?" The answer is that review and accountability scale at 1x while throughput just went 10x, and his own discipline collapsed when he was not looking.

Willison concedes he no longer reads model-generated production code in full
Frames the gap as a "normalization of deviance" risk creeping into every agent-using team
HN thread: 646 points, 721 comments in 21 hours

Why it matters: Set a review policy for your team before your team sets one for you, because the policy your team is converging on is "ship it, the agent looked confident." read more.

TERM OF THE DAY

Term of the day

Context engineering

Definition: The discipline of deciding what goes into (and stays out of) an agent's context window across a multi-turn run, so the model has what it needs to act without drowning in noise it does not.

The phrase started replacing "prompt engineering" in practitioner posts about six months ago, once people noticed the real bottleneck on long agent runs is not the prompt template but the cumulative spend of tool output, retrieved docs, and scratchpad. It is now the umbrella term for system-prompt curation, tool-output sandboxing, retrieval design, and run-time pruning.

Seen in the wild: Cursor 3.3 just shipped a per-agent context-usage breakdown across rules, skills, MCPs, and subagents, and Salesforce's Data 360 MCP server collapses ~200 REST endpoints into three facade tools to keep the window from blowing.

WHAT ELSE IS SHIPPING

What else is shipping

openai-agents-python v0.16.1 - same-day patch stabilizing chat-completions stream output indexes and tightening MCP policy validation.
langchain v1.3.0a2 - first 1.3 alpha with schema-resolution fixes and v3 streaming events.
openai-python v2.35.0 / v2.35.1 - image-generation API refresh, removal of legacy Python CLI, same-day patch fixes an imagegen size enum regression.
pydantic-ai v1.91.0 - adds gpt-image-2 options and DeepSeek v4-flash / v4-pro variants, fixes MCP history replay with empty tool arguments.
agno v2.6.5 - Gemini Multimodal File Search, Gmail/Calendar context providers, plus a security fix for an IDOR on AgentOS MCP tool handlers (user_id was not bound).
Cursor 3.3 - per-agent context-usage breakdown across rules, skills, MCPs, and subagents, the long-awaited "why is my agent context blowing up" surface.
Salesforce Data 360 MCP Server (Developer Preview) - open-source MCP server collapsing ~200 Data 360 REST operations into three facade tools (search, payload_examples, execute).
Harvey Legal Agent Benchmark (LAB) - open-source domain-specific agent benchmark for legal work, a template for vertical evals.

INTERESTING CONVERSATIONS

Interesting conversations we're following

Show HN: Tilde.run, an agent sandbox with a transactional, versioned filesystem on Hacker News - 178 points, 121 comments. GitHub, S3, and Drive mount as a single versioned ~/sandbox; agent runs commit atomically on clean exit or roll back on failure.
ZAYA1-8B: 8B MoE with 760M active params matching DeepSeek-R1 on math on Hacker News - 46 points, 39 comments doing the verification work on R1-class claims at a fraction of active params.
ProgramBench: can LMs rebuild programs from scratch? on Hacker News - 87 points, 43 comments debating whether spec-to-program reconstruction is the right eval for coding agents.
Open weights are quietly closing up, and that's a problem on Lobsters - argument that "open weights" labels are getting steadily more restrictive in practice, relevant to anyone banking on local or OSS models.
r/ClaudeAI verdict on Claude Design: "container soup" on r/ClaudeAI - within hours of launch, practitioners named the visual signature "container soup" of pills, cards, serif font, blinking status dot, and reported 2-3 prompts can exhaust weekly Pro limits.
"Gaslightus 4.7": Opus 4.7 regressions thread on r/ClaudeCode - ~1.7K-upvote thread reports Opus 4.7 inventing files and defending hallucinated test results across 10 turns; Anthropic's migration note says 4.7 "takes the instructions literally" so loose prompts now bite.

Was this email forwarded to you? Sign up here.