Agentic QA

Agentic QA

Issue #6 · 13 min read · By Ben

Quiet day on the beat, but Grafana cut k6 v2.0.0 and your auto-bumping CI is about to find out.

Mornin'. Most teams I see treat their load-testing tool the way they treat the office plant: leave it alone, water it once a quarter, hope nobody notices. Then Grafana cuts a k6 major on a sleepy Monday and the next greenfield Renovate PR quietly rewrites your Go module path, your CLI flags, and your executor config in one go. One thing to do this week: pin k6 to a minor, read the v2.0.0 notes, and budget an hour for the migration before your CI does it for you at 3am.

-Ben

In today's newsletter:

Grafana drops a k6 grenade
Bugbot's new recall dial
Naming the agent failure mode
Jest patches the monorepo bite

BREAKING THE BUMP

k6 v2.0.0 ships, and CI auto-upgrades walk into a wall

via GitHub

Grafana's load-testing tool just graduated to a major version, which is another way of saying every pipeline that bumps k6 on green is about to learn what "completes the deprecation cleanup" really means.

The v2.0.0 release finishes off the v1 deprecation list in one swing: the Go module path is updated, a batch of CLI commands are removed, executor behavior changes, and a few config shapes no longer parse. Individually each item is small. Stacked together, they are the kind of thing a Renovate PR cheerfully merges before anyone reads the diff.

If you treat k6 as plumbing rather than a product, this is the week to put a pin in it and schedule the migration on purpose.

Pin to the last v1.x minor in your Dockerfile or action before the dependabot run.
Audit your scripts for removed CLI flags and the deprecated executor knobs.
Re-baseline your perf thresholds after the bump. Executor changes can shift numbers even when scripts still run.

Why it matters: Auto-upgrade pipelines are great until a major version turns a quiet Tuesday into a perf-regression hunt. Plan the bump, do not let it ambush you. read more.

RECALL DIAL

Cursor's Bugbot gets effort tiers, and a real recall knob

via Cursor

Bot reviewers have spent the last year doing the AI equivalent of a polite nod on every PR. Cursor just gave Bugbot a volume slider, and it is calibrated in bugs per run.

The latest changelog adds selectable effort tiers to Bugbot: Default, High, and a Custom natural-language mode that lets admins describe what "thorough" should look like on their repo. Cursor's own numbers: Default finds about 0.7 bugs per run, of which 79% get fixed at merge time. High effort pushes that to roughly 0.95 bugs per run, at the cost of latency and tokens.

The interesting bit is not the raw recall lift. It is that QA leads finally have a knob that matches how humans actually review: light touch on routine churn, full-court press on the release branch and the auth code.

Route risky paths (auth, payments, migrations) through High effort via CODEOWNERS or path rules.
Keep Default on the firehose of formatting and dependency PRs to control spend.
Use Custom mode to encode your team's actual pet peeves ("flag any test that mocks the database," "complain about untyped fetches").

Why it matters: A tunable AI reviewer is finally something a QA org can put a policy around, instead of a single bot doing one job badly across every diff. read more.

FAILURE MODE NAMED

"Constraint decay" is the agent-eval gap QA already feels

Every QA engineer who has watched an LLM agent ace the demo and faceplant on the third real ticket now has a phrase to put on the postmortem.

A fresh arXiv paper from earlier this week introduces "constraint decay," a measurable degradation in multi-file backend code generation as functional and structural requirements accumulate. The authors show that the same models that crush loose, single-shot specs lose ground in a fairly predictable curve as you stack constraints: package layout, public API shape, error handling, persistence boundaries, and so on.

This is the gap most agent leaderboards politely ignore. SWE-bench-style evals reward "did the patch land," not "did the patch land without quietly violating four other constraints the team takes for granted."

Why this lands for testers

Gives you a benchmark frame to argue against "but it passed the demo" with data instead of vibes.
Suggests where to invest test coverage when agents touch your repo: structural invariants, not just unit behavior.
Hints that adversarial constraint-stacking is the next useful eval move for in-house agent harnesses.

Why it matters: Naming a failure mode is half the battle. The other half is testing for it before the agent ships to prod. read more.

MONOREPO MEND

Jest 30.4.2 patches the CJS/ESM bug that bites real monorepos

via GitHub

If your monorepo has ever stared at a "named import undefined" error from a perfectly normal CJS module, the Jest team just sent you a Saturday-night present.

The v30.4.2 patch follows last week's 30.4.0 and 30.4.1 line and fixes named imports from CJS modules whose module.exports is a function with own-property exports. That is exactly the shape you hit when ESM test code reaches into a CJS dep that exports a callable plus a handful of helpers.

Not glamorous. But this is the kind of fix you want landed before the next CI run, especially if you have been carrying a workaround alias or a custom resolver to paper over it.

Bump Jest, then delete any local "fix the named import" wrappers and re-run.
Re-check your moduleNameMapper entries that were doing CJS interop duty.
If you skipped 30.4.0/30.4.1 waiting for the dust to settle, 30.4.2 is the one to land.

Why it matters: Quiet interop bugs eat hours of dev time and never look like bugs in dashboards. Patches like this pay back instantly. read more.

THE WEEK AHEAD

The week ahead

TUE · MAY 12 Antithesis Breakpoint 2026 kicks off, the deterministic-simulation testing conference. Their "Testing Techniques" doc that lit up Lobsters this week is the warm-up reading.

If you ship anything stateful (queues, ledgers, schedulers), Breakpoint is the one to skim talk titles from this week. Deterministic simulation is the closest thing the industry has to a real answer for "we cannot reproduce it in CI," and the techniques translate even if you never adopt the tool itself.

WHAT ELSE IS SHIPPING

What else is shipping

Puppeteer v24.43.1 - Same-day patch on v24.43.0 bumping puppeteer-core to 24.43.1 and @puppeteer/browsers to 2.13.2, with a BiDi URL-restriction fix.
Vitest v4.1.6 - Bug fixes for browser-mode screenshot resolution paths and concurrent test execution.
Claude Code v2.1.138 - Internal-fixes point release keeping the agentic-coding loop stable for CI and test-generation use.
Claude Code v2.1.136 - About 65 fixes covering plugin Stop and UserPromptSubmit hooks, MCP schema validation, and @-file-picker correctness. Matters if you wire Claude Code into CI gates.
Mythical Man Month (Bliki) - Fowler revisits Brooks framed around modern AI-augmented teams. The line engineering leads will quote back against "just add agents" pressure.
Bas Dijkstra restarts his test-automation newsletter - Meta announcement, but a useful follow for practitioner commentary outside vendor marketing.

INTERESTING CONVERSATIONS

Interesting conversations we're following

An AI coding agent, used to write code, needs to reduce your maintenance costs on Hacker News - 281 points in 16 hours on James Shore reframing AI-coding ROI around maintenance-to-feature ratio. Commenters note AI code "optimizes for compiles and passes the happy path rather than the boring stuff."
Testing an AI Agent Harness over a Few Weekends on Hacker News - Dev built a shell-loop harness driving Cursor CLI plus Kimi K2.5 through plan, implement, test, and refine across 47 requirements. About 1B tokens, under $100. Concrete cost data for self-testing agent loops.
What is random generation? on Lobsters (testing tag) - Fresh primer on random-data generation as a testing methodology. Useful for the property-based-testing crowd.
LLMs Corrupt Your Documents When You Delegate (DELEGATE-52) on Hacker News - 472 points this week on a 52-domain benchmark showing frontier models corrupt about 25% of document content over long delegation workflows. Paper at arxiv.org/abs/2604.15597.
Show HN: adamsreview, multi-agent PR reviews for Claude Code on Hacker News - 67 points on an OSS multi-agent PR-review harness on top of Claude Code. Relevant if you are building QA-via-AI-reviewer workflows.

Was this email forwarded to you? Sign up here.