Agentic Builders

Issue #2 · 5 min read · By Ben

Quiet on framework releases, loud on where agents actually break.

Mornin'. Spent last night watching Claude rearrange clap dynamics in Ableton via a brand-new MCP server, which is either the best or the worst use of a frontier model depending on the hour. Meanwhile a fresh state-of-AI report ran the same class of models on actual financial trades and 21 of 24 configs went unprofitable. So, you know, stick to sidechain compression.

-Ben

In today's newsletter:

Claude takes over Ableton
Claude Code patches MCP gremlins
Where agents quietly fall apart
LangGraph cracks open its checkpoints
Agent tests escape flaky hell

TOOLS GO WEIRD

An MCP server just turned Ableton Live into a tool call

via GitHub

The MCP gold rush has, until now, mostly meant "wrap another developer tool in JSON-RPC." This week somebody plugged a professional DAW into Claude.

The ableton-mcp-extended server exposes Ableton Live's playback, MIDI editing, device chains, and audio generation (via an ElevenLabs hookup) as MCP tools. The creator's demo has them telling Claude to "make a self-reflective song" and then iterating with notes like "improve the clap dynamics," with the agent moving MIDI clips and tweaking effect chains in real time.

The interesting bit is not the music: it is that the loop (describe intention, agent acts, listen, iterate) generalizes to any domain that exposes a clean tool API. Treat the DAW as a stand-in for your CAD tool, your video editor, your CRM.

covers playback, MIDI, device control, and ElevenLabs audio generation
works with Claude or Cursor over MCP
HN discussion is live

Why it matters: this is the template for multimodal creative agents, and it ships as a working OSS server today. read more.

MCP PLUMBING

Claude Code 2.1.128 quietly fixes the MCP papercuts you hate most

via Anthropic

If you have ever watched a stdio MCP server eat your quoted arguments and emit cursed JSON, the new Claude Code release reads like a personal apology.

The 2.1.128 changelog is mostly stability work on the MCP client surface. The /mcp command now reports tool counts and flags dead servers, so you can see at a glance which integration silently fell over. Stdio servers stop corrupting arguments that contain spaces or shell metacharacters, which was the bug behind a lot of "my server works in isolation but not in Claude" tickets.

And tool results that mix images with structured content actually preserve the images now, instead of dropping them on the floor when the response also includes JSON.

/mcp reports tool counts and flags dead servers
stdio argument parsing fixed for spaces and metacharacters
image plus structured-content responses no longer drop the image

Why it matters: the broader MCP ecosystem only feels good when the client behaves; this release closes a stack of long-standing client bugs. read more.

FOUND IT

Came across Nathan Benaich's May state-of-AI, and the agent results split clean down the middle

via Nathan Benaich

Not a release, just a piece I finally got around to reading this week. Nathan Benaich's monthly state-of-AI is the rare report with a load-bearing footnote: agents are great, except where they are catastrophically bad.

On the success side, Anthropic's Project Deal closed 186 transactions across 500+ listings, and Ramp-style procurement workflows shaved 16% off cost. Bounded environment, deterministic outcome, agent wins.

On the failure side, KellyBench-style adversarial financial trading collapsed: 21 of 24 frontier model configurations went unprofitable across the run. Same models, different terrain, dramatically different results.

The takeaway for builders

structurally bounded workflows with clear success criteria are the green zone
open-ended reasoning over noisy adversarial inputs is still the red zone
"agentic" is a property of the environment as much as the model

Worth bookmarking if: you are scoping an agent for production. This is the data you cite when someone wants to point it at the stock market. Read more.

CHECKPOINT VISIBILITY

LangGraph finally gives you a peek at writes history

via LangChain

Every LangGraph debugger has, at some point, opened the saver internals with a flashlight and a prayer. Pre-release 1.2.0a7 ships an actual API door.

The headline addition is a public get_writes_history() on the saver, which exposes checkpoint write patterns without making you import private modules. It lands alongside a delta cadence rework that promises more efficient state tracking across runs, useful if your graph is fanning out into hundreds of branches per session.

It is still alpha, which means breakage is fair game, but introspection over checkpoint writes has been one of the quiet pain points of running real agentic workloads at any scale.

public get_writes_history() saver API for checkpoint introspection
delta cadence rework for tracking state changes more efficiently
shipping as 1.2.0a7 pre-release on GitHub Releases

Why it matters: if you maintain a LangGraph deployment, you can finally answer "what did this run actually write, and when" without forking the framework. read more.

TEST DETERMINISM

TrainForgeTester wants to end the LLM-judge coin flip

via GitHub

Agent testing has been a choose-your-own-misery: either you let an LLM grade outputs on a 0 to 1 scale that drifts every run, or you write golden-path string matches that shatter on a stray whitespace. TrainForgeTester picks a third door.

The v0.1.0 release splits validation into two lanes. Anything structural (tool calls, arguments, ordering) is checked in plain Python equality, so flakiness has nowhere to hide. Anything semantic (was the response on-topic, did the assistant stay in character) gets handed to an LLM, but only as a binary yes-or-no question instead of a fuzzy score.

It is opinionated in a useful way: validate what the agent does deterministically, validate what it says with a yes/no judge, and stop pretending the rubric in between is reliable.

structural checks via Python equality on tool invocations and arguments
LLM judgment limited to binary semantic questions, not 0-1 scales
v0.1.0 source available on GitHub

Why it matters: if your CI is currently green-then-red-then-green for the same agent, this is the pattern to copy. read more.

PRIME NUMBER

Prime number

21 of 24

Frontier model trading configurations that went unprofitable in Anthropic's Project Deal scenarios, a hard data point on where agentic systems still fall over.

same study logged 186 successful transactions across 500+ listings on the bounded side
procurement workflows in the report showed 16% cost reductions
failure mode clusters on open-ended, adversarial reasoning, not on tool use

via Nathan Benaich

WHAT ELSE IS SHIPPING

What else is shipping

CrewAI 1.14.5a2 - pre-release restores task output in finally blocks, preserves outputs across async batch flushes, and stops LLM stop-words from leaking between agents.

INTERESTING CONVERSATIONS

Interesting conversations we're following

Agent-Eval (Claude Skill) on Hacker News - a Show HN for a Claude Skill that builds eval systems for agents, aimed at teams without dedicated evals infra.
Agent Skills on Hacker News - debate on structuring agent instructions as markdown workflows vs. prose essays, with side arguments about context token cost.
Audience of One on isene.org - one engineer building a custom Rust/Assembly desktop OS with Claude Code, arguing personal software is now a few-evenings project.