Agentic QA
Puppeteer learns checkboxes, Claude Code hot-fixes twice in 6 hours, and your LLM eval is lying.
͏ ‌     ­ ͏ ‌     ­ ͏ ‌     ­ ͏ ‌     ­

Agentic QA

2026-05-06

Issue #3 · 12 min read · By Ben

Quiet beat day, but Claude Code patched itself twice before lunch and Puppeteer finally clicked the checkbox.

Mornin'. Most teams I see treat "the test failed" as gospel, then spend three sprints chasing a phantom regression that turned out to be the harness. That's why today's arXiv drop on false failures in LLM code-translation evals stuck out: a real chunk of the red CI you're staring at on AI-generated code is the test rig misjudging semantically identical output, not the model getting it wrong. One thing to try this week: before you ship the next "AI broke our tests" Slack message, sanity-check whether your assertion is actually comparing behavior or just bytes.

-Ben

In today's newsletter:

  • Puppeteer locators learn checkboxes
  • Claude Code's same-day double patch
  • Your LLM eval is lying
  • Vitest 5 beta breaks stuff

FORM PRIMITIVES

Puppeteer 24.43 finally teaches its locators about checkboxes and radios

Puppeteer 24.43 finally teaches its locators about checkboxes and radios

via GitHub

For years, automating a checkbox click in Puppeteer has been the QA equivalent of parallel parking: doable, but rarely on the first try, and somebody always ends up writing a custom helper.

The new puppeteer-core 24.43.0 rolls to Chrome 148.0.7778.56 and extends the locator API to operate on checkboxes and radios directly. That closes a long-standing ergonomic gap that pushed test authors back to raw page.click calls plus ad-hoc waits, which is exactly how flakes get born.

The release also lands an allowlist implementation for locator ops and Firefox 150.0 updates, while the sibling browsers package 2.13.1 patches the WebUIReloadButton experiment.

  • Auto-waiting locators now cover the most common form primitives end-to-end.
  • Allowlist support lets you constrain what locator operations can touch.
  • Firefox 150.0 is supported alongside the Chrome roll.

Why it matters: One of the top reasons teams fall back to raw clicks and brittle waits just got removed from the menu. read more.


HOTFIX HUSTLE

Claude Code shipped twice in six hours to unbreak Windows VS Code

Claude Code shipped twice in six hours to unbreak Windows VS Code

via GitHub

Anthropic spent its Wednesday morning doing the dev-tools equivalent of a fire drill: two Claude Code releases in roughly six hours, the second one specifically to undo the first.

v2.1.129 went out at 01:40 UTC. v2.1.131 followed at 07:47 UTC, patching a createRequire polyfill bug whose hardcoded build path in the bundled SDK had silently bricked the VS Code extension on Windows. If your Windows agents stopped activating Claude Code overnight, this is your fix.

The earlier release wasn't just a stub either. It added --plugin-url for fetching plugin zips, a CLAUDE_CODE_FORCE_SYNC_OUTPUT env var for terminal output sync, gated gateway /v1/models discovery behind CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1, and stopped the Mantle endpoint from dropping the x-api-key header on auth.

The bits that matter for test harnesses

  • Windows CI agents running the VS Code extension are unbroken.
  • CLAUDE_CODE_FORCE_SYNC_OUTPUT kills interleaved-stdout flakes when capturing Claude Code from a test runner.
  • Gateway model discovery is now opt-in, so existing pipelines won't surprise-call new endpoints.

Why it matters: Anyone running Claude Code inside a CI or test loop just had a silent Windows regression and a stdout-flake foot-gun fixed in the same morning. read more.


FALSE NEGATIVES

Your LLM code-translation eval is over-counting failures

Data and research analysis illustration

via Unsplash

Turns out a meaningful chunk of the red on your AI code-translation dashboard isn't the model whiffing. It's your test harness crying wolf.

A new arXiv paper argues that a meaningful share of "failures" in LLM-based code-translation evals are false failures: tests fail not because the translated code is wrong, but because the harness mis-judges semantically equivalent output (different floats, different ordering, different exception text, same behavior).

The contribution is a methodology for separating genuine translation defects from harness-induced false negatives, which is exactly the diagnostic step most internal eval pipelines skip.

  • If you run LLM-as-judge or differential testing on AI-generated code, your failure counts are probably overstated.
  • Useful framework for sanitizing internal eval pipelines before you ship a "regression" report up the chain.
  • Pairs naturally with mutation-style sanity checks on the harness itself.

Why it matters: Before you tell the team an AI tool got worse this week, prove the harness didn't. read more.


VERSION VERTIGO

Vitest 5.0 beta is the breaking-change release you should read before pinning

Vitest 5.0 beta is the breaking-change release you should read before pinning

via GitHub

A week after the agent reporter landed in 4.1, Vitest's v5 beta is out, and it's the kind of release where letting Renovate auto-bump on a Friday is a self-own.

The headline breakages: the attachments directory has been restructured, the sequential test option is gone in favor of concurrent, and expect is now inlined. There are good additions too: merge-reports for multi-environment runs and expanded browser-mode capability.

  • sequential removal will turn perfectly-good test files into red CI overnight if you don't update.
  • expect inlining changes how custom matchers and shared assertion utilities resolve.
  • Merge-reports finally make multi-environment runs reportable as one artifact.

Why it matters: Read the changelog now, on your terms, instead of at 9am Monday when the bot has already opened the PR. read more.


TOOL OF THE DAY

Tool of the day

cargo-affect

Maps a Git diff to the minimum subset of Rust workspace tests that need to run. The Nx and Bazel "affected" pattern, finally for Cargo monorepos.

cargo install cargo-affect

Instead of cargo test --workspace on every PR, run cargo affect test --since origin/main and only the crates touched by the diff get exercised. If your Rust monorepo CI is currently a 25-minute coffee break, this is the 30-minute experiment that pays for itself by lunch.

repo / docs


WHAT ELSE IS SHIPPING

What else is shipping

  • Storybook v10.4.0-alpha.17 - Inlines an @storybook/docs-mdx replacement, adds a Metro AST codemod for React Native init, ships @storybook/tanstack-react, and fixes agentic onboarding to preserve sample content.
  • Selenium nightly (ca9b244) - Rust chromedriver version-handling fix and a Python Edge service argument to inherit browser I/O streams. Small, but relevant if you're chasing flaky driver-version mismatches.
  • HEJ-Robust - A robustness benchmark stress-testing LLM program-repair systems against perturbed inputs. Useful when evaluating auto-fix tooling claims.
  • Randomized and diverse input-state generation for quantum program testing - A niche test-input generation method for quantum programs, one for the "what's now testable that wasn't" file.
  • ProgramBench - Benchmark probing end-to-end program reconstruction by LLMs. A harder bar than line-level completion evals.

INTERESTING CONVERSATIONS

Interesting conversations we're following

Also from TinyIdeas Media

Agentic Business
For operators
What’s shipping in agentic AI, decoded for operators. Adoptable today vs. demoware.
Agentic Builders
For engineers
Frameworks, OSS, MCP servers. Concrete releases, not press releases.
Agentic Quality
For QA teams
AI-native testing tools, evals, reliability patterns. No benchmark vibes.

Was this email forwarded to you? Sign up here.

Also from TinyIdeas Media