GPT-5.4 Shows Real Agent Reliability Gains — Here's the Data
Early benchmarks on GPT-5.4 show 15–20% better multi-step tool use accuracy and stronger few-shot agent learning. Worth testing against your current stack.
Read post →Writing
Thoughts on AI, LLMs, shipping products, and the tools I use every day. RSS
Early benchmarks on GPT-5.4 show 15–20% better multi-step tool use accuracy and stronger few-shot agent learning. Worth testing against your current stack.
Read post →
xAI completed training on Grok V9-Medium, a 1.5T-parameter model targeting coding and agentic tasks. Expected public release: mid-June 2026.
Read post →
Anthropic's Claude 4 family raises the bar on reasoning, tool use, and context length. Here's what changes for developers building AI-powered products.
Read post →
Most AI prototypes die in demo mode. Here's the decision framework I use to take a product from 'cool demo' to deployed — and what I've learned building Orion, the HBS Campus Guide, and other live apps.
Read post →
I've run both models through the same real-world tasks I use in production. Here's an honest comparison — not benchmarks, but what matters when you're shipping a product.
Read post →