Agentic Business

Issue #4 · 13 min read · By Ben

Code w/ Claude 2026 dominated the day: a SpaceX-backed capacity unlock, a Managed Agents SDK landing, and serious new public benchmarks from Harvey and Grafana.

Mornin'. Anthropic doubled Claude Code's rate limits this morning and pinned the credit on 220,000 fresh NVIDIA GPUs sitting in a SpaceX-built facility called Colossus 1. If your agent loop has spent the last six months pinging the 5-hour cap by lunchtime, today is the first day it gets to keep working. The compute arms race is now visible from orbit.

-Ben

In today's newsletter:

SpaceX GPUs juice Claude caps
Multiagents land in Python SDK
Harvey's all-pass legal benchmark
Grafana grades observability agents
DeepMind buys into EVE Online

RATE LIMIT REPRIEVE

Anthropic doubles Claude Code limits, credits a SpaceX compute deal

Higher usage limits for Claude and a compute deal with SpaceX

via Anthropic

Anthropic just yanked the leash off Claude Code's rate limits, and the slack came pre-packaged with a rocket company's logo on the box.

At the Code w/ Claude 2026 keynote, Anthropic announced 2x the per-window quota on Pro, Max, Team and Enterprise tiers and killed peak-hour throttling for paid Pro and Max users. Opus API rate limits got a meaningful bump in the same swing.

The headroom comes from a freshly-signed SpaceX Colossus 1 deal that puts 300+ MW of capacity, roughly 220,000+ NVIDIA GPUs, online inside the next month.

2x per-window quota across Pro, Max, Team and Enterprise
Peak-hour reductions removed for Pro and Max
300+ MW / 220,000+ GPUs spinning up within 30 days, per Anthropic's announcement

Why it matters: the single loudest complaint about Claude Code, hitting the 5-hour cap mid-session and choking during US business hours, just got materially relaxed for paying users. Read more.

SDK MILESTONE

anthropic-sdk-python crosses v0.100 with Managed Agents support

via GitHub

Anthropic's Python SDK quietly hit a round-number version, and the headline isn't the digit, it's the multiagent plumbing it now exposes.

v0.100.0 adds first-class Managed Agents APIs for orchestrating multiagents and capturing outcomes, alongside webhook configuration changes and vault validation. It's the canonical SDK catching up to the hosted product surface Anthropic has been previewing for months.

Translation for builders: you can now spawn, observe and react to multi-agent runs without writing your own orchestrator shim or scraping a private API.

Managed Agents APIs for multiagents and outcomes are now first-class in the SDK
Webhook config adjusted in the same release for event-driven agent flows
Full release notes on GitHub Releases

Why it matters: teams building on Anthropic's hosted agent stack get supported multi-agent orchestration in the box, which makes "did anyone here roll their own runner" a much rarer standup question. Read more.

LEGAL BENCHMARK

Harvey ships LAB, a brutal all-pass benchmark for legal agents

via Harvey

Harvey just dropped a public bar exam for legal agents, and the grading rubric is the kind of thing that makes vendor demo decks sweat.

The Legal Agent Benchmark (LAB) covers 1,250+ agent tasks across 24 practice areas, scored against 75,000+ expert-written rubric criteria. The twist: it uses an "all-pass" model, where a task only counts as complete if every single criterion is satisfied. One M&A example carries 57 criteria across 9 issues.

That's a hard departure from short-horizon contract Q&A benchmarks, which have started to look like saturated flashcards. LAB asks whether an agent can produce the kind of long-form work-product a partner would actually delegate.

1,250+ tasks, 24 practice areas, 75,000+ rubric criteria
All-pass scoring: miss one criterion, fail the task
Methodology and sample tasks on the Harvey blog

Why it matters: there's now a public, externally legible bar for any agent claiming to do "real legal work," and the all-pass rule is harsh enough to expose how brittle today's pipelines are over long, document-heavy assignments. Read more.

SRE SHOWDOWN

Grafana's o11y-bench scores agents on real observability work

via Grafana Labs

Grafana stood up an open benchmark for SRE agents at GrafanaCON 2026, and the early scoreboard delivers a quietly inconvenient verdict: reliability beats raw smarts.

o11y-bench grades agents on the actual on-call workflow: querying metrics, correlating logs and traces, hypothesis-testing against a live system. It's vendor-neutral, public, and aimed straight at one of the few "agents in production" use cases people are willing to put on a status page.

The early leaderboard finds consistency, not capability, separates the top tier. Claude Opus 4.7 (high reasoning) lands in second.

Real workflows: metrics queries, log/trace correlation, hypothesis testing
Reliability is the discriminator, raw capability is not
Methodology and leaderboard from Grafana Labs

Why it matters: if you're picking a model for SRE automation, this is the first public scorecard that maps to your actual workflow, and "pick the consistent one, not the smartest one" is a real reframe of how teams have been choosing. Read more.

SIM ENVIRONMENT

DeepMind buys a stake in EVE Online's maker

via Unsplash

Google DeepMind decided that the best multi-agent simulation environment money can buy was already running, and it's been running for two decades inside a spaceship MMO.

Per Bloomberg, DeepMind has taken a minority stake in CCP Games, the studio behind EVE Online, to study "player-driven systems" inside one of the largest persistent multi-agent economies ever built. EVE has run live since 2003, with player coalitions, currencies, wars and corporate espionage all emerging without scripted prompts.

The notable bit isn't the dollars, it's the direction: a frontier lab is buying access to a real living simulation rather than spinning up another synthetic gridworld.

Minority equity stake, not a research grant
Target: player-driven economic and coalition dynamics inside EVE
First reported by Bloomberg

Why it matters: agent research is leaning harder into massive multi-agent simulation, and downstream papers (and capabilities) will be shaped by EVE's economy and coalition dynamics rather than another bespoke benchmark. Read more.

TERM OF THE DAY

Term of the day

Agent Skills

Definition: a reusable, file-based bundle of instructions, tools and examples that an agent loads on demand to do a specific kind of task well, basically scoped expertise modules instead of one giant system prompt.

The term crystallized around Anthropic's "Skills" feature for Claude and has since leaked into the broader practitioner vocabulary as a generic pattern for keeping agents focused and stopping context bloat. It's contested: critics say Skills are just prompts in a folder with a marketing name, while proponents argue the file-system convention is exactly what makes them composable across teams and agents.

Seen in the wild: addyosmani/agent-skills hit GitHub trending today (+3,058 stars) as a community-curated library of "production-grade engineering skills for AI coding agents."

WHAT ELSE IS SHIPPING

What else is shipping

pydantic-ai v1.91.0 - adds OpenAI image options and DeepSeek model support, plus YAML-dataset and tool-argument fixes.
agno v2.6.5 - multimodal Gemini file search, Gmail/Calendar context providers, MongoDB scheduler, workflow-condition error handling.
openai-agents-python v0.16.1 - a quiet patch on OpenAI's agents SDK with minor fixes only.
crewAI 1.14.5a3 (pre-release) - status endpoint moves to /status/{kickoff_id}, breaking any client hitting the old path, plus a gitpython security bump.
ROME - red-team rewriting of unsafe agent trajectories synthesizes 300 deceptive OOD evals from 100 unsafe runs, exposing safety judges that overfit to surface patterns.
SciResearcher - automated pipeline that synthesizes science tasks from academic evidence to train long-horizon tool-using research agents.
CreativityBench - 14K affordance-based tasks targeting the "MacGyver" gap where agents can call tools but can't repurpose them.
ServiceNow + Accenture - a Forward Deployed Engineering program aimed squarely at the enterprise pilot-to-production gap.

INTERESTING CONVERSATIONS

Interesting conversations we're following

Vibe coding and agentic engineering are getting closer than I'd like on Hacker News - 646 points, 721 comments. A respected practitioner concedes he's merging unreviewed agent diffs, and the thread is the live debate over whether "I read every line" is still a sustainable posture.
Show HN: Tilde.run, agent sandbox with a transactional, versioned filesystem on Hacker News - 178 points, 121 comments. Pitches every agent run as a rollback-able transaction over a unified versioned FS; commenters debating whether this is the missing primitive for letting agents touch real data.
addyosmani/agent-skills on GitHub trending - +3,058 stars today. Community-curated "production-grade engineering skills for AI coding agents" riding the Claude Skills wave.
Hmbown/DeepSeek-TUI on GitHub trending - +5,787 stars today. A Rust terminal coding agent for DeepSeek; the Claude-Code-style TUI pattern getting rebuilt against open weights.
ProgramBench: Can Language Models Rebuild Programs from Scratch? on Hacker News - 87 points, 43 comments. New benchmark probing whether LLMs can reconstruct full programs from spec, directly testing the agentic-engineering ceiling people are arguing about this week.
vercel-labs/open-agents on GitHub trending - +406 stars today. Vercel's open template for cloud agents; another data point for "agents-as-deployable-apps."

Was this email forwarded to you? Sign up here.