Agentic Builders
2026-05-07
Busy day on the agent beat: AWS MCP goes GA, two flagship SDKs ship majors, and Claude agents start dreaming.
Mornin'. I spent the morning ripping out a hand-rolled boto3 wrapper because AWS just shipped one MCP tool, call_aws, that points at 15,000+ AWS API operations behind a single GA announcement. It is genuinely funny watching a year of "let me write a custom tool for that" evaporate before lunch. While you are at it, pin your model defaults today: that is also when OpenAI's agents SDK quietly swapped the out-of-the-box brain on every agent you forgot to lock down.
-Ben
In today's newsletter:
- AWS MCP swallows 15,000 APIs
- OpenAI Agents flips its defaults
- Anthropic SDK lands Managed Agents
- Claude agents start dreaming
- Simon admits the line collapsed
PLATFORM PLAY
AWS MCP Server hits GA with 15,000 APIs behind one tool
via Amazon Web Services
AWS just stuffed its entire control plane into a single tool call. The MCP server graduated from preview, and call_aws is now a one-stop drive-thru window for 15,000+ AWS API operations, IAM-scoped and ready for any MCP-aware agent to chew through.
Out of preview, the server picks up IAM context-key support, a sandboxed Python run_script tool for "go compute this thing without making me write a Lambda", and a curated "Skills" model that replaces preview's Agent SOPs with AWS-maintained guidance. It runs in us-east-1 and eu-central-1 at no extra charge.
What's actually new versus preview
- Single
call_awstool fronts the full AWS API surface, scoped per request via IAM context keys - New
run_scripttool gives agents sandboxed Python without you provisioning the infra - "Skills" replaces Agent SOPs with curated, AWS-maintained workflows the model can pull from
Why it matters: Agents in Claude Code, Cursor, and Kiro can finally hit current AWS APIs with scoped credentials and AWS-blessed docs, instead of you babysitting boto3 wrappers and feeding the model stale 2024 API shapes. read more.
DEFAULT DRIFT
OpenAI Agents SDK quietly rewires every new agent
via GitHub
A 0.0.1 minor version bump, a totally different default brain. openai-agents-python v0.16.0 swaps the SDK's stock model to gpt-5.4-mini, lets you pass max_turns=None to disable the loop ceiling, and exposes tool execution concurrency as a first-class config knob.
It also adds server-prefixed MCP tool naming, which sounds tiny until the moment you wire two MCP servers that both export a search tool. Anyone pinning to defaults inherits all of the above on next install.
- Default model flips to
gpt-5.4-minion fresh agents max_turns=Noneremoves the turn ceiling for long-running loops- Tool execution concurrency is now configurable without custom plumbing
- MCP tools get server prefixes, killing one whole class of name collisions
Why it matters: If you do not pin a model in your agent config, your behavior changed today. Pin explicitly, then turn the new concurrency knob up. read more.
SDK MILESTONE
anthropic-sdk-python v0.100.0 makes Managed Agents first-class
via GitHub
The Python SDK stopped pretending Managed Agents was a side product. v0.100.0 (yes, three digits past the dot) ships typed support for multiagents, outcomes graders, webhooks, and vault validation, all as first-class types instead of a hand-rolled HTTP client you maintain yourself.
That makes the SDK surface for the Managed Agents feature Anthropic announced at Code w/ Claude actually usable, and sets the contract that the next round of frameworks will adapt against.
- Multiagent orchestration types are exported and typed end to end
- Outcomes graders get first-class SDK methods, not raw JSON payloads
- Webhook delivery and vault validation typed in the same release
Why it matters: If you have been waiting to wire Managed Agents into a Python codebase, this is the version that does not require you to write the HTTP layer yourself. LangChain and pydantic-ai will be backporting against this contract within the week. read more.
AGENT MEMORY
Claude agents now "dream" and grade their own homework
via SiliconANGLE
Anthropic just shipped the two missing primitives every team running production agents has been duct-taping for a year. Managed Agents now retain cross-run memory of recurring mistakes and shared team preferences (the marketing word is "dreaming"), and Outcomes lets you define a success bar with a separate grader model.
Netflix and Wisedocs are the cited launch users. The pitch: agents stop repeating the same mistake at run 47 because they remember it from run 12, and you stop reviewing every output by hand because a grader model decides whether the run actually cleared the bar.
- Cross-run memory persists corrections and team preferences between sessions
- Outcomes graders replace ad-hoc human review with a programmatic success bar
- Netflix and Wisedocs are the named launch deployments
Why it matters: "The agent keeps making the same mistake" and "how do we know it actually succeeded" are the two questions every prod-agent post-mortem starts with. They now have first-party primitives instead of homegrown glue. read more.
PRACTITIONER PULSE
Simon Willison: the line between vibe coding and agentic engineering just collapsed
via Unsplash
The man who coined "vibe coding" is publicly admitting he has stopped reviewing the code his agents write, even for production systems. The post hit 646 points and 721 comments on HN inside 21 hours, which tells you he is not the only one who quietly noticed.
Willison's framing: "If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks?" The answer is that review and accountability scale at 1x while throughput just went 10x, and his own discipline collapsed when he was not looking.
- Willison concedes he no longer reads model-generated production code in full
- Frames the gap as a "normalization of deviance" risk creeping into every agent-using team
- HN thread: 646 points, 721 comments in 21 hours
Why it matters: Set a review policy for your team before your team sets one for you, because the policy your team is converging on is "ship it, the agent looked confident." read more.
TERM OF THE DAY
Term of the day
Context engineering
Definition: The discipline of deciding what goes into (and stays out of) an agent's context window across a multi-turn run, so the model has what it needs to act without drowning in noise it does not.
The phrase started replacing "prompt engineering" in practitioner posts about six months ago, once people noticed the real bottleneck on long agent runs is not the prompt template but the cumulative spend of tool output, retrieved docs, and scratchpad. It is now the umbrella term for system-prompt curation, tool-output sandboxing, retrieval design, and run-time pruning.
Seen in the wild: Cursor 3.3 just shipped a per-agent context-usage breakdown across rules, skills, MCPs, and subagents, and Salesforce's Data 360 MCP server collapses ~200 REST endpoints into three facade tools to keep the window from blowing.
WHAT ELSE IS SHIPPING
What else is shipping
- openai-agents-python v0.16.1 - same-day patch stabilizing chat-completions stream output indexes and tightening MCP policy validation.
- langchain v1.3.0a2 - first 1.3 alpha with schema-resolution fixes and v3 streaming events.
- openai-python v2.35.0 / v2.35.1 - image-generation API refresh, removal of legacy Python CLI, same-day patch fixes an imagegen
sizeenum regression. - pydantic-ai v1.91.0 - adds gpt-image-2 options and DeepSeek v4-flash / v4-pro variants, fixes MCP history replay with empty tool arguments.
- agno v2.6.5 - Gemini Multimodal File Search, Gmail/Calendar context providers, plus a security fix for an IDOR on AgentOS MCP tool handlers (user_id was not bound).
- Cursor 3.3 - per-agent context-usage breakdown across rules, skills, MCPs, and subagents, the long-awaited "why is my agent context blowing up" surface.
- Salesforce Data 360 MCP Server (Developer Preview) - open-source MCP server collapsing ~200 Data 360 REST operations into three facade tools (search, payload_examples, execute).
- Harvey Legal Agent Benchmark (LAB) - open-source domain-specific agent benchmark for legal work, a template for vertical evals.
INTERESTING CONVERSATIONS
Interesting conversations we're following
- Show HN: Tilde.run, an agent sandbox with a transactional, versioned filesystem on Hacker News - 178 points, 121 comments. GitHub, S3, and Drive mount as a single versioned
~/sandbox; agent runs commit atomically on clean exit or roll back on failure. - ZAYA1-8B: 8B MoE with 760M active params matching DeepSeek-R1 on math on Hacker News - 46 points, 39 comments doing the verification work on R1-class claims at a fraction of active params.
- ProgramBench: can LMs rebuild programs from scratch? on Hacker News - 87 points, 43 comments debating whether spec-to-program reconstruction is the right eval for coding agents.
- Open weights are quietly closing up, and that's a problem on Lobsters - argument that "open weights" labels are getting steadily more restrictive in practice, relevant to anyone banking on local or OSS models.
- r/ClaudeAI verdict on Claude Design: "container soup" on r/ClaudeAI - within hours of launch, practitioners named the visual signature "container soup" of pills, cards, serif font, blinking status dot, and reported 2-3 prompts can exhaust weekly Pro limits.
- "Gaslightus 4.7": Opus 4.7 regressions thread on r/ClaudeCode - ~1.7K-upvote thread reports Opus 4.7 inventing files and defending hallucinated test results across 10 turns; Anthropic's migration note says 4.7 "takes the instructions literally" so loose prompts now bite.
Also from TinyIdeas Media
|
Agentic Business
For operators
What’s shipping in agentic AI, decoded for operators. Adoptable today vs. demoware.
|
Agentic Builders
For engineers
Frameworks, OSS, MCP servers. Concrete releases, not press releases.
|
Agentic Quality
For QA teams
AI-native testing tools, evals, reliability patterns. No benchmark vibes.
|
Was this email forwarded to you? Sign up here.