Agentic Builders
2026-05-05
Quiet on framework releases, loud on where agents actually break.
Mornin'. Spent last night watching Claude rearrange clap dynamics in Ableton via a brand-new MCP server, which is either the best or the worst use of a frontier model depending on the hour. Meanwhile a fresh state-of-AI report ran the same class of models on actual financial trades and 21 of 24 configs went unprofitable. So, you know, stick to sidechain compression.
-Ben
In today's newsletter:
- Claude takes over Ableton
- Claude Code patches MCP gremlins
- Where agents quietly fall apart
- LangGraph cracks open its checkpoints
- Agent tests escape flaky hell
TOOLS GO WEIRD
An MCP server just turned Ableton Live into a tool call
via GitHub
The MCP gold rush has, until now, mostly meant "wrap another developer tool in JSON-RPC." This week somebody plugged a professional DAW into Claude.
The ableton-mcp-extended server exposes Ableton Live's playback, MIDI editing, device chains, and audio generation (via an ElevenLabs hookup) as MCP tools. The creator's demo has them telling Claude to "make a self-reflective song" and then iterating with notes like "improve the clap dynamics," with the agent moving MIDI clips and tweaking effect chains in real time.
The interesting bit is not the music: it is that the loop (describe intention, agent acts, listen, iterate) generalizes to any domain that exposes a clean tool API. Treat the DAW as a stand-in for your CAD tool, your video editor, your CRM.
- covers playback, MIDI, device control, and ElevenLabs audio generation
- works with Claude or Cursor over MCP
- HN discussion is live
Why it matters: this is the template for multimodal creative agents, and it ships as a working OSS server today. read more.
MCP PLUMBING
Claude Code 2.1.128 quietly fixes the MCP papercuts you hate most
via Anthropic
If you have ever watched a stdio MCP server eat your quoted arguments and emit cursed JSON, the new Claude Code release reads like a personal apology.
The 2.1.128 changelog is mostly stability work on the MCP client surface. The /mcp command now reports tool counts and flags dead servers, so you can see at a glance which integration silently fell over. Stdio servers stop corrupting arguments that contain spaces or shell metacharacters, which was the bug behind a lot of "my server works in isolation but not in Claude" tickets.
And tool results that mix images with structured content actually preserve the images now, instead of dropping them on the floor when the response also includes JSON.
/mcpreports tool counts and flags dead servers- stdio argument parsing fixed for spaces and metacharacters
- image plus structured-content responses no longer drop the image
Why it matters: the broader MCP ecosystem only feels good when the client behaves; this release closes a stack of long-standing client bugs. read more.
FOUND IT
Came across Nathan Benaich's May state-of-AI, and the agent results split clean down the middle
via Nathan Benaich
Not a release, just a piece I finally got around to reading this week. Nathan Benaich's monthly state-of-AI is the rare report with a load-bearing footnote: agents are great, except where they are catastrophically bad.
On the success side, Anthropic's Project Deal closed 186 transactions across 500+ listings, and Ramp-style procurement workflows shaved 16% off cost. Bounded environment, deterministic outcome, agent wins.
On the failure side, KellyBench-style adversarial financial trading collapsed: 21 of 24 frontier model configurations went unprofitable across the run. Same models, different terrain, dramatically different results.
The takeaway for builders
- structurally bounded workflows with clear success criteria are the green zone
- open-ended reasoning over noisy adversarial inputs is still the red zone
- "agentic" is a property of the environment as much as the model
Worth bookmarking if: you are scoping an agent for production. This is the data you cite when someone wants to point it at the stock market. Read more.
CHECKPOINT VISIBILITY
LangGraph finally gives you a peek at writes history
via LangChain
Every LangGraph debugger has, at some point, opened the saver internals with a flashlight and a prayer. Pre-release 1.2.0a7 ships an actual API door.
The headline addition is a public get_writes_history() on the saver, which exposes checkpoint write patterns without making you import private modules. It lands alongside a delta cadence rework that promises more efficient state tracking across runs, useful if your graph is fanning out into hundreds of branches per session.
It is still alpha, which means breakage is fair game, but introspection over checkpoint writes has been one of the quiet pain points of running real agentic workloads at any scale.
- public
get_writes_history()saver API for checkpoint introspection - delta cadence rework for tracking state changes more efficiently
- shipping as 1.2.0a7 pre-release on GitHub Releases
Why it matters: if you maintain a LangGraph deployment, you can finally answer "what did this run actually write, and when" without forking the framework. read more.
TEST DETERMINISM
TrainForgeTester wants to end the LLM-judge coin flip
via GitHub
Agent testing has been a choose-your-own-misery: either you let an LLM grade outputs on a 0 to 1 scale that drifts every run, or you write golden-path string matches that shatter on a stray whitespace. TrainForgeTester picks a third door.
The v0.1.0 release splits validation into two lanes. Anything structural (tool calls, arguments, ordering) is checked in plain Python equality, so flakiness has nowhere to hide. Anything semantic (was the response on-topic, did the assistant stay in character) gets handed to an LLM, but only as a binary yes-or-no question instead of a fuzzy score.
It is opinionated in a useful way: validate what the agent does deterministically, validate what it says with a yes/no judge, and stop pretending the rubric in between is reliable.
- structural checks via Python equality on tool invocations and arguments
- LLM judgment limited to binary semantic questions, not 0-1 scales
- v0.1.0 source available on GitHub
Why it matters: if your CI is currently green-then-red-then-green for the same agent, this is the pattern to copy. read more.
PRIME NUMBER
Prime number
21 of 24
Frontier model trading configurations that went unprofitable in Anthropic's Project Deal scenarios, a hard data point on where agentic systems still fall over.
- same study logged 186 successful transactions across 500+ listings on the bounded side
- procurement workflows in the report showed 16% cost reductions
- failure mode clusters on open-ended, adversarial reasoning, not on tool use
WHAT ELSE IS SHIPPING
What else is shipping
- CrewAI 1.14.5a2 - pre-release restores task output in finally blocks, preserves outputs across async batch flushes, and stops LLM stop-words from leaking between agents.
INTERESTING CONVERSATIONS
Interesting conversations we're following
- Agent-Eval (Claude Skill) on Hacker News - a Show HN for a Claude Skill that builds eval systems for agents, aimed at teams without dedicated evals infra.
- Agent Skills on Hacker News - debate on structuring agent instructions as markdown workflows vs. prose essays, with side arguments about context token cost.
- Audience of One on isene.org - one engineer building a custom Rust/Assembly desktop OS with Claude Code, arguing personal software is now a few-evenings project.