Agentic QA
2026-05-06
Quiet beat day, but Claude Code patched itself twice before lunch and Puppeteer finally clicked the checkbox.
Mornin'. Most teams I see treat "the test failed" as gospel, then spend three sprints chasing a phantom regression that turned out to be the harness. That's why today's arXiv drop on false failures in LLM code-translation evals stuck out: a real chunk of the red CI you're staring at on AI-generated code is the test rig misjudging semantically identical output, not the model getting it wrong. One thing to try this week: before you ship the next "AI broke our tests" Slack message, sanity-check whether your assertion is actually comparing behavior or just bytes.
-Ben
In today's newsletter:
- Puppeteer locators learn checkboxes
- Claude Code's same-day double patch
- Your LLM eval is lying
- Vitest 5 beta breaks stuff
FORM PRIMITIVES
Puppeteer 24.43 finally teaches its locators about checkboxes and radios
via GitHub
For years, automating a checkbox click in Puppeteer has been the QA equivalent of parallel parking: doable, but rarely on the first try, and somebody always ends up writing a custom helper.
The new puppeteer-core 24.43.0 rolls to Chrome 148.0.7778.56 and extends the locator API to operate on checkboxes and radios directly. That closes a long-standing ergonomic gap that pushed test authors back to raw page.click calls plus ad-hoc waits, which is exactly how flakes get born.
The release also lands an allowlist implementation for locator ops and Firefox 150.0 updates, while the sibling browsers package 2.13.1 patches the WebUIReloadButton experiment.
- Auto-waiting locators now cover the most common form primitives end-to-end.
- Allowlist support lets you constrain what locator operations can touch.
- Firefox 150.0 is supported alongside the Chrome roll.
Why it matters: One of the top reasons teams fall back to raw clicks and brittle waits just got removed from the menu. read more.
HOTFIX HUSTLE
Claude Code shipped twice in six hours to unbreak Windows VS Code
via GitHub
Anthropic spent its Wednesday morning doing the dev-tools equivalent of a fire drill: two Claude Code releases in roughly six hours, the second one specifically to undo the first.
v2.1.129 went out at 01:40 UTC. v2.1.131 followed at 07:47 UTC, patching a createRequire polyfill bug whose hardcoded build path in the bundled SDK had silently bricked the VS Code extension on Windows. If your Windows agents stopped activating Claude Code overnight, this is your fix.
The earlier release wasn't just a stub either. It added --plugin-url for fetching plugin zips, a CLAUDE_CODE_FORCE_SYNC_OUTPUT env var for terminal output sync, gated gateway /v1/models discovery behind CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1, and stopped the Mantle endpoint from dropping the x-api-key header on auth.
The bits that matter for test harnesses
- Windows CI agents running the VS Code extension are unbroken.
CLAUDE_CODE_FORCE_SYNC_OUTPUTkills interleaved-stdout flakes when capturing Claude Code from a test runner.- Gateway model discovery is now opt-in, so existing pipelines won't surprise-call new endpoints.
Why it matters: Anyone running Claude Code inside a CI or test loop just had a silent Windows regression and a stdout-flake foot-gun fixed in the same morning. read more.
FALSE NEGATIVES
Your LLM code-translation eval is over-counting failures
via Unsplash
Turns out a meaningful chunk of the red on your AI code-translation dashboard isn't the model whiffing. It's your test harness crying wolf.
A new arXiv paper argues that a meaningful share of "failures" in LLM-based code-translation evals are false failures: tests fail not because the translated code is wrong, but because the harness mis-judges semantically equivalent output (different floats, different ordering, different exception text, same behavior).
The contribution is a methodology for separating genuine translation defects from harness-induced false negatives, which is exactly the diagnostic step most internal eval pipelines skip.
- If you run LLM-as-judge or differential testing on AI-generated code, your failure counts are probably overstated.
- Useful framework for sanitizing internal eval pipelines before you ship a "regression" report up the chain.
- Pairs naturally with mutation-style sanity checks on the harness itself.
Why it matters: Before you tell the team an AI tool got worse this week, prove the harness didn't. read more.
VERSION VERTIGO
Vitest 5.0 beta is the breaking-change release you should read before pinning
via GitHub
A week after the agent reporter landed in 4.1, Vitest's v5 beta is out, and it's the kind of release where letting Renovate auto-bump on a Friday is a self-own.
The headline breakages: the attachments directory has been restructured, the sequential test option is gone in favor of concurrent, and expect is now inlined. There are good additions too: merge-reports for multi-environment runs and expanded browser-mode capability.
sequentialremoval will turn perfectly-good test files into red CI overnight if you don't update.expectinlining changes how custom matchers and shared assertion utilities resolve.- Merge-reports finally make multi-environment runs reportable as one artifact.
Why it matters: Read the changelog now, on your terms, instead of at 9am Monday when the bot has already opened the PR. read more.
TOOL OF THE DAY
Tool of the day
cargo-affect
Maps a Git diff to the minimum subset of Rust workspace tests that need to run. The Nx and Bazel "affected" pattern, finally for Cargo monorepos.
cargo install cargo-affect
Instead of cargo test --workspace on every PR, run cargo affect test --since origin/main and only the crates touched by the diff get exercised. If your Rust monorepo CI is currently a 25-minute coffee break, this is the 30-minute experiment that pays for itself by lunch.
WHAT ELSE IS SHIPPING
What else is shipping
- Storybook v10.4.0-alpha.17 - Inlines an
@storybook/docs-mdxreplacement, adds a Metro AST codemod for React Native init, ships@storybook/tanstack-react, and fixes agentic onboarding to preserve sample content. - Selenium nightly (ca9b244) - Rust chromedriver version-handling fix and a Python Edge service argument to inherit browser I/O streams. Small, but relevant if you're chasing flaky driver-version mismatches.
- HEJ-Robust - A robustness benchmark stress-testing LLM program-repair systems against perturbed inputs. Useful when evaluating auto-fix tooling claims.
- Randomized and diverse input-state generation for quantum program testing - A niche test-input generation method for quantum programs, one for the "what's now testable that wasn't" file.
- ProgramBench - Benchmark probing end-to-end program reconstruction by LLMs. A harder bar than line-level completion evals.
INTERESTING CONVERSATIONS
Interesting conversations we're following
- Show HN: open-source CLI to generate UI tests from user flows on Hacker News - Small thread (10 points, 3 comments) but the day's only fresh AI-test-generation Show HN, and a CLI-first entry into a SaaS-dominated category.
- Code coverage in CI/CD: what it really tells you and what it doesn't on Hacker News - Argues coverage % is a presence-of-execution metric, not a quality one, and pushes mutation testing and assertion density instead.
- cargo-affect: plan affected Rust workspace tests from a Git diff on Hacker News - Brings Nx-style affected-test selection to Cargo. Early signal on whether the Rust monorepo crowd actually wants this.
- Antithesis publishes its testing-techniques guide on Hacker News - Practitioner-oriented reference on deterministic-simulation testing, surfaced as a how-to-escape-flake reading list.
Also from TinyIdeas Media
|
Agentic Business
For operators
What’s shipping in agentic AI, decoded for operators. Adoptable today vs. demoware.
|
Agentic Builders
For engineers
Frameworks, OSS, MCP servers. Concrete releases, not press releases.
|
Agentic Quality
For QA teams
AI-native testing tools, evals, reliability patterns. No benchmark vibes.
|
Was this email forwarded to you? Sign up here.