Agentic Business
2026-05-07
Code w/ Claude 2026 dominated the day: a SpaceX-backed capacity unlock, a Managed Agents SDK landing, and serious new public benchmarks from Harvey and Grafana.
Mornin'. Anthropic doubled Claude Code's rate limits this morning and pinned the credit on 220,000 fresh NVIDIA GPUs sitting in a SpaceX-built facility called Colossus 1. If your agent loop has spent the last six months pinging the 5-hour cap by lunchtime, today is the first day it gets to keep working. The compute arms race is now visible from orbit.
-Ben
In today's newsletter:
- SpaceX GPUs juice Claude caps
- Multiagents land in Python SDK
- Harvey's all-pass legal benchmark
- Grafana grades observability agents
- DeepMind buys into EVE Online
RATE LIMIT REPRIEVE
Anthropic doubles Claude Code limits, credits a SpaceX compute deal
via Anthropic
Anthropic just yanked the leash off Claude Code's rate limits, and the slack came pre-packaged with a rocket company's logo on the box.
At the Code w/ Claude 2026 keynote, Anthropic announced 2x the per-window quota on Pro, Max, Team and Enterprise tiers and killed peak-hour throttling for paid Pro and Max users. Opus API rate limits got a meaningful bump in the same swing.
The headroom comes from a freshly-signed SpaceX Colossus 1 deal that puts 300+ MW of capacity, roughly 220,000+ NVIDIA GPUs, online inside the next month.
- 2x per-window quota across Pro, Max, Team and Enterprise
- Peak-hour reductions removed for Pro and Max
- 300+ MW / 220,000+ GPUs spinning up within 30 days, per Anthropic's announcement
Why it matters: the single loudest complaint about Claude Code, hitting the 5-hour cap mid-session and choking during US business hours, just got materially relaxed for paying users. Read more.
SDK MILESTONE
anthropic-sdk-python crosses v0.100 with Managed Agents support
via GitHub
Anthropic's Python SDK quietly hit a round-number version, and the headline isn't the digit, it's the multiagent plumbing it now exposes.
v0.100.0 adds first-class Managed Agents APIs for orchestrating multiagents and capturing outcomes, alongside webhook configuration changes and vault validation. It's the canonical SDK catching up to the hosted product surface Anthropic has been previewing for months.
Translation for builders: you can now spawn, observe and react to multi-agent runs without writing your own orchestrator shim or scraping a private API.
- Managed Agents APIs for multiagents and outcomes are now first-class in the SDK
- Webhook config adjusted in the same release for event-driven agent flows
- Full release notes on GitHub Releases
Why it matters: teams building on Anthropic's hosted agent stack get supported multi-agent orchestration in the box, which makes "did anyone here roll their own runner" a much rarer standup question. Read more.
LEGAL BENCHMARK
Harvey ships LAB, a brutal all-pass benchmark for legal agents
via Harvey
Harvey just dropped a public bar exam for legal agents, and the grading rubric is the kind of thing that makes vendor demo decks sweat.
The Legal Agent Benchmark (LAB) covers 1,250+ agent tasks across 24 practice areas, scored against 75,000+ expert-written rubric criteria. The twist: it uses an "all-pass" model, where a task only counts as complete if every single criterion is satisfied. One M&A example carries 57 criteria across 9 issues.
That's a hard departure from short-horizon contract Q&A benchmarks, which have started to look like saturated flashcards. LAB asks whether an agent can produce the kind of long-form work-product a partner would actually delegate.
- 1,250+ tasks, 24 practice areas, 75,000+ rubric criteria
- All-pass scoring: miss one criterion, fail the task
- Methodology and sample tasks on the Harvey blog
Why it matters: there's now a public, externally legible bar for any agent claiming to do "real legal work," and the all-pass rule is harsh enough to expose how brittle today's pipelines are over long, document-heavy assignments. Read more.
SRE SHOWDOWN
Grafana's o11y-bench scores agents on real observability work
via Grafana Labs
Grafana stood up an open benchmark for SRE agents at GrafanaCON 2026, and the early scoreboard delivers a quietly inconvenient verdict: reliability beats raw smarts.
o11y-bench grades agents on the actual on-call workflow: querying metrics, correlating logs and traces, hypothesis-testing against a live system. It's vendor-neutral, public, and aimed straight at one of the few "agents in production" use cases people are willing to put on a status page.
The early leaderboard finds consistency, not capability, separates the top tier. Claude Opus 4.7 (high reasoning) lands in second.
- Real workflows: metrics queries, log/trace correlation, hypothesis testing
- Reliability is the discriminator, raw capability is not
- Methodology and leaderboard from Grafana Labs
Why it matters: if you're picking a model for SRE automation, this is the first public scorecard that maps to your actual workflow, and "pick the consistent one, not the smartest one" is a real reframe of how teams have been choosing. Read more.
SIM ENVIRONMENT
DeepMind buys a stake in EVE Online's maker
via Unsplash
Google DeepMind decided that the best multi-agent simulation environment money can buy was already running, and it's been running for two decades inside a spaceship MMO.
Per Bloomberg, DeepMind has taken a minority stake in CCP Games, the studio behind EVE Online, to study "player-driven systems" inside one of the largest persistent multi-agent economies ever built. EVE has run live since 2003, with player coalitions, currencies, wars and corporate espionage all emerging without scripted prompts.
The notable bit isn't the dollars, it's the direction: a frontier lab is buying access to a real living simulation rather than spinning up another synthetic gridworld.
- Minority equity stake, not a research grant
- Target: player-driven economic and coalition dynamics inside EVE
- First reported by Bloomberg
Why it matters: agent research is leaning harder into massive multi-agent simulation, and downstream papers (and capabilities) will be shaped by EVE's economy and coalition dynamics rather than another bespoke benchmark. Read more.
TERM OF THE DAY
Term of the day
Agent Skills
Definition: a reusable, file-based bundle of instructions, tools and examples that an agent loads on demand to do a specific kind of task well, basically scoped expertise modules instead of one giant system prompt.
The term crystallized around Anthropic's "Skills" feature for Claude and has since leaked into the broader practitioner vocabulary as a generic pattern for keeping agents focused and stopping context bloat. It's contested: critics say Skills are just prompts in a folder with a marketing name, while proponents argue the file-system convention is exactly what makes them composable across teams and agents.
Seen in the wild: addyosmani/agent-skills hit GitHub trending today (+3,058 stars) as a community-curated library of "production-grade engineering skills for AI coding agents."
WHAT ELSE IS SHIPPING
What else is shipping
- pydantic-ai v1.91.0 - adds OpenAI image options and DeepSeek model support, plus YAML-dataset and tool-argument fixes.
- agno v2.6.5 - multimodal Gemini file search, Gmail/Calendar context providers, MongoDB scheduler, workflow-condition error handling.
- openai-agents-python v0.16.1 - a quiet patch on OpenAI's agents SDK with minor fixes only.
- crewAI 1.14.5a3 (pre-release) - status endpoint moves to
/status/{kickoff_id}, breaking any client hitting the old path, plus a gitpython security bump. - ROME - red-team rewriting of unsafe agent trajectories synthesizes 300 deceptive OOD evals from 100 unsafe runs, exposing safety judges that overfit to surface patterns.
- SciResearcher - automated pipeline that synthesizes science tasks from academic evidence to train long-horizon tool-using research agents.
- CreativityBench - 14K affordance-based tasks targeting the "MacGyver" gap where agents can call tools but can't repurpose them.
- ServiceNow + Accenture - a Forward Deployed Engineering program aimed squarely at the enterprise pilot-to-production gap.
INTERESTING CONVERSATIONS
Interesting conversations we're following
- Vibe coding and agentic engineering are getting closer than I'd like on Hacker News - 646 points, 721 comments. A respected practitioner concedes he's merging unreviewed agent diffs, and the thread is the live debate over whether "I read every line" is still a sustainable posture.
- Show HN: Tilde.run, agent sandbox with a transactional, versioned filesystem on Hacker News - 178 points, 121 comments. Pitches every agent run as a rollback-able transaction over a unified versioned FS; commenters debating whether this is the missing primitive for letting agents touch real data.
- addyosmani/agent-skills on GitHub trending - +3,058 stars today. Community-curated "production-grade engineering skills for AI coding agents" riding the Claude Skills wave.
- Hmbown/DeepSeek-TUI on GitHub trending - +5,787 stars today. A Rust terminal coding agent for DeepSeek; the Claude-Code-style TUI pattern getting rebuilt against open weights.
- ProgramBench: Can Language Models Rebuild Programs from Scratch? on Hacker News - 87 points, 43 comments. New benchmark probing whether LLMs can reconstruct full programs from spec, directly testing the agentic-engineering ceiling people are arguing about this week.
- vercel-labs/open-agents on GitHub trending - +406 stars today. Vercel's open template for cloud agents; another data point for "agents-as-deployable-apps."
Also from TinyIdeas Media
|
Agentic Business
For operators
What’s shipping in agentic AI, decoded for operators. Adoptable today vs. demoware.
|
Agentic Builders
For engineers
Frameworks, OSS, MCP servers. Concrete releases, not press releases.
|
Agentic Quality
For QA teams
AI-native testing tools, evals, reliability patterns. No benchmark vibes.
|
Was this email forwarded to you? Sign up here.