Code & Prompts - Week 4
Benchmark wars, agent skill ecosystems, harness engineering, and breaking through hostile web environments
This week: Benchmark wars, agent skill ecosystems, harness engineering, and breaking through hostile web environments.
---
đŻ SWE-rebench Exposes AI Benchmark Gaming
A new benchmark called SWE-rebench reveals that many Chinese AI labs have been optimizing their models specifically for popular benchmarks rather than building genuinely capable systems. By using fresh GitHub tasks that haven't been in training data, SWE-rebench shows the real gap between frontier models.
MiniMax M2.5 scored 80.2% on original SWE-bench (vs Opus 4.6's 80.8%), but on fresh problems? MiniMax dropped to 39.6% while Opus hit 51.7%âa 12-point gap. Claude Code with Opus 4.6 leads at 52.9%, followed by Opus 4.6 alone (51.7%), GPT-5.2 variants (~51%), and Sonnet 4.5 (47.1%). The entire top 10 is Anthropic, OpenAI, and Google.
Chinese models collapse on unseen problems: Kimi K2 (43.8%), GLM-5 (42.1%), Qwen3-Coder-Next (40.0%), MiniMax M2.5 (39.6%), Kimi K2.5 (37.9%). The thread argues this is classic Chinese tech strategy: copy architecture, overfit on public tests, but lack the compute, talent, and research culture to build truly frontier models.
â Read the thread
My take: Every LLM vendor presents data with benchmarks that are far from standardizedâoften rigged to show their model is best. The reality is that the best benchmark is testing how models perform for your specific use case. But the idea of creating benchmarks that can't be gamed is really interesting and valuable.
---
đ ď¸ Skills.sh: The Open Agent Skills Ecosystem
A centralized directory for discovering and installing reusable capabilities for AI coding agents. Skills are procedural knowledge packages that extend what agents can doâinstall them with a single command: `npx skills add <owner/repo>`.
Works with 20+ agents including Claude Code, Codex, Cursor, Windsurf, Cline, Goose, VS Code, and GitHub Copilot. The leaderboard shows 59K+ total installs tracked, with top skills like find-skills (231K), vercel-react-best-practices (134K), web-design-guidelines (101K), and remotion-best-practices (91K).
Categories span coding best practices (React, Next.js, Vue, Flutter) to specialized tools (PDF/PPTX/DOCX handling, SEO audit, marketing, systematic debugging, test-driven development). Skills are GitHub-hosted, making them transparent, forkable, and community-driven.
â Browse skills
My take: Skills have become the fundamental way to "teach" our agents specific tasks. They're part of what we need to do as developers in this new world: provide the tools, context, and loops that let our agents work better and for longer.
---
đ§ Sentry Opens Their Agent Skills as Claude Code Plugins
Sentry has open-sourced their internal agent skills as Claude Code plugins, following the Agent Skills open format. Available skills include code-review (guidelines and checklist), commit (message conventions), create-pr, find-bugs, iterate-pr (until CI passes), security-review (OWASP), claude-settings-audit, brand-guidelines, doc-coauthoring, code-simplifier, skill-creator, and skill-scanner.
The repository follows a marketplace structure where plugins can be installed independently. Sentry practices "vendoring"âcopying and adapting external skills into their repo with proper attributionâensuring consistency, enabling customization, and improving reliability.
Installation via Claude Code plugin (`claude plugin marketplace add getsentry/skills`) or skills.sh (`npx skills add getsentry/skills`). Works across Claude Code, Cursor, Cline, GitHub Copilot, and other compatible agents.
â GitHub repo
My take: Not every skill here is ready to adopt as-is, but they're excellent templates to base your own skills onâadapted to your specific needs, coding styles, and team conventions.
---
đŻ Harness Engineering: Leveraging Codex in an Agent-First World
OpenAI ran a five-month experiment: building and shipping an internal product with zero lines of manually-written code. Every lineâapplication logic, tests, CI config, documentation, observability, toolingâwas written by Codex. They estimate it took 1/10th the time of traditional development.
Three engineers (now seven) averaged 3.5 PRs per day, producing roughly 1 million lines of code across 1,500 merged pull requests. The product has daily internal users and external alpha testers.
Key learnings: Engineers no longer write codeâthey design environments, specify intent, and build feedback loops. "Harness engineering" means breaking down goals, prompting agents, and asking "what capability is missing?" when something fails. The repository serves as system of record (AGENTS.md as table of contents, structured docs/ directory). Architecture is mechanically enforced (strict layers, custom lints). "Garbage collection" for AI slop via golden principles and background Codex tasks scanning daily.
Given a single prompt, Codex can now: validate codebase â reproduce bug â record video demo â implement fix â validate fix â record resolution video â open PR â respond to feedback â remediate failures â merge.
â Read the article
My take: This connects directly with skills and the concept that programmers must now configure agents for successâthat's the new role when writing code. I love the idea of harness engineering.
---
âď¸ 12 Ways to Customize Claude Code
Boris Cherny (Claude Code team) shares how engineers customize their setups. Core insight: customizability is why developers fall in love with the productâgreat defaults plus deep configurability.
The 12 customizations: Configure terminal (theme, notifications, shift+enter, vim mode), adjust effort level (/model for low/medium/high), install plugins/MCPs/skills (/plugin), create custom agents (drop .md files in .claude/agents/), pre-approve permissions (/permissions with wildcards), enable sandboxing (/sandbox for file/network isolation), add status line (model, directory, context, cost), customize keybindings (/keybindings), set up hooks (auto-route permissions, nudge continuations, logging), customize spinner verbs, use output styles ("explanatory" for learning, "learning" for coaching), check settings.json into git (37 settings + 84 env vars).
Settings can be scoped: per-codebase, per-folder, per-user, or enterprise-wide. Settings live reload.
â Read the thread
My take: I already knew many of these, but several were new to me or I simply hadn't applied them yet. Everything connects with harness engineering in my head.
---
đ Camofox-Browser: An OpenClaw Browser That Doesn't Get Blocked
An OpenClaw plugin that lets agents browse sites that normally block automationâX, Product Hunt, Amazon, and more. Built on Camoufox, a Firefox fork that spoofs browser fingerprints at the C++ level rather than JavaScript, giving higher probability of passing detection systems.
The problem with JS-level evasion: any property you override in JavaScript can be inspected in JavaScript (property descriptors, prototype chains, function toString() all leak). C++ is the right layer: changes are undetectable through JavaScript inspection. Intercepts window geometry, navigator fields, WebGL (GPU fingerprints), WebRTC IP masking, audio fingerprints, battery API, speech synthesis, and BĂŠzier curve-based mouse trajectories.
Wrapped as a server for LLM efficiency: Google results page ~500KB HTML â ~5KB accessibility tree (100x reduction). Provides accessibility snapshots, element refs (e1, e2, e3), and macros (@google_search, @youtube_search, @amazon_search).
Installation: `npm install @askjo/camofox-browser` or `openclaw plugins install @askjo/camofox-browser`. Exposes tools: camofox_create_tab, camofox_snapshot, camofox_click, camofox_type, camofox_navigate, camofox_scroll, camofox_screenshot.
â Read the post â GitHub repo
My take: Online research is one of my key steps when working with agents, but many websites are a hostile environment. Camofox promises to solve this. I haven't tried it yet, but I plan to test it today with OpenClaw and see how to integrate it into my Claude Code setup.
---
That's it for this week. See you next Sunday.
â David

