- Published on
AI in 2026: A Practitioner's Guide to What Comes Next
Software 2.0 easily automates what you can verify.
— Andrej Karpathy, Verifiability
We're a few days into 2026. The hype cycles have settled into something more useful: pattern recognition. After reading through Phil Schmid's predictions, Rob Toews' Forbes analysis, and Karpathy's 2025 year-in-review, a practical framework emerges. Not just what might happen, but how to think about your work every day.
The Daily Question: Is This Verifiable?
Here's a mental tool you can use immediately. Before delegating any task to an LLM or agent, ask: can the output be verified? The answer predicts how much you can trust the result.
Verifiable means the environment is resettable (you can try again), efficient (many attempts are possible), and rewardable (there's a clear signal for success). Code compilation, test suites, mathematical proofs, structured data extraction: these are highly verifiable. The model has practiced millions of variations against clear feedback. Creative writing, strategic recommendations, nuanced judgment calls: these are weakly verifiable. The model is generalizing from patterns, not optimizing against ground truth.
This isn't about capability limits in the abstract. It's about your workflow right now. When you ask an agent to refactor a function, you can verify the result: tests pass, types check, behavior unchanged. Trust accordingly. When you ask an agent for architectural advice, there's no verification signal. The output might be brilliant or subtly wrong in ways that only surface months later. Adjust your review depth accordingly.
The verifiability question also guides where to invest automation effort. If you're building internal tools, prioritize automating tasks where you can construct verification. A data pipeline that validates outputs against schema? Highly automatable. A report that requires judgment about what's important? Keep humans in the loop longer.
Architecting for Jagged Intelligence
Karpathy's year-in-review crystallizes something practitioners have felt: LLMs are simultaneously genius polymaths and confused grade schoolers. They spike in capability wherever verification is possible (math, code, puzzles) and remain surprisingly weak elsewhere. This jaggedness isn't a bug being fixed; it's a fundamental characteristic of how these systems are optimized.
The practical implication: stop designing agents as if capability were uniform. A shallow agent that handles everything will hit the weak spots constantly. Deep agents with specialized sub-agents work because each sub-agent operates in a narrower capability band where the model is more reliably strong.
When you're designing an agent system, map the task to the jagged frontier. Which sub-tasks are verifiable? Those can be delegated with confidence. Which require judgment, context, or common sense? Those need deep context engineering, human oversight and/or explicit guardrails. The Deep Agents pattern with explicit to-do tracking helps because it creates verification infrastructure: each completed task is a checkpoint you can inspect. The to-do list isn't just project management; it's a verification artifact that compensates for the model's weakness in long-horizon planning.
Here's a concrete pattern: break complex tasks into steps where each step has a verifiability score, and based on it, decide verification pattern, context engineering level and Human-In-The-Loop necessity. Research agent finds sources (verifiable: did it return relevant URLs?). Extraction agent pulls key facts (verifiable: do the facts appear in the sources?). Synthesis agent combines facts into narrative (less verifiable: requires human review). By isolating the less verifiable step, you know exactly where to focus attention.
Your Review Skills Are the Moat
Schmid predicts engineers will spend 99% of their time reviewing, evaluating, and thinking. This sounds dramatic but matches what's already happening. The question isn't whether this shift occurs; it's whether you're building the right review muscles.
Code review changes when AI writes the code. You're no longer checking if a junior developer understood the requirements. You're checking if the model's pattern-matching produced something that looks correct but has subtle issues: wrong assumptions baked in, edge cases missed, abstractions that don't fit your codebase's conventions, dangerous fallbacks... The failure modes are different, so your review approach needs to adapt.
Concretely: develop a checklist for AI-generated code that may differ from human-generated code review. Does it handle the actual edge cases in your system, or generic ones? Does it use your existing abstractions, or reinvent them? Does it match the error handling patterns in your codebase? AI code often looks clean but doesn't integrate well. Your value is in catching that. MR Review Definition of Done should include this checklist, decided collaboratively by the team.
Plus, trace and store everything you can. Most agent frameworks now support observability that shows which tools were called, what intermediate results looked like, where decisions were made. When an agent produces a wrong answer, the bug is usually in the reasoning and tool selection chains, not the final step. Build the habit of tracing failures back to their source.
Benchmarks Are Broken, Build Your Own Evals
Karpathy notes his growing distrust of benchmarks. The core issue: benchmarks are verifiable environments by construction, so they're immediately susceptible to optimization. Labs grow capability "jaggies" to cover benchmark pockets in embedding space. Training on the test set has become an art form.
For practitioners, this means published benchmarks tell you little about how a model will perform on your specific tasks. The solution: build your own evaluation sets. Collect examples from your actual use cases. Include the edge cases that matter for your domain. Run new model versions against your evals before adopting them.
This isn't optional overhead; it's core infrastructure. Teams with good evals can confidently adopt new models when they help and reject them when they don't. Teams without evals are guessing, vibe testing. It's worth noting that there's no need to change your model each time a new "frontier" one arises—the change should be worth it and tested before any implementation.
For agent systems, evals are even more critical. An agent that works 90% of the time might fail catastrophically on the 10% that matters most to your users. Build evals that specifically target your failure modes. If your agent handles customer requests, include the adversarial cases, the ambiguous requests, the edge cases where wrong answers are costly. The multi-agent patterns we've explored make this easier: each sub-agent can be evaluated independently against its specific success criteria.
Edge Deployment Is Coming (But Which Edge?)
Two distinct "edge" paradigms are emerging, and conflating them leads to confusion.
Edge Apps (Cursor, Claude Code): The model still runs in the cloud, but the agent lives on your machine. As Karpathy notes, Anthropic got this right with Claude Code: the distinction isn't where the AI ops run, but about everything else—your already-booted computer, its installation, context, data, secrets, configuration, and low-latency interaction. These edge apps orchestrate cloud LLM calls while accessing your local environment. OpenAI's early Codex efforts missed this by focusing on cloud containers orchestrated from ChatGPT instead of localhost. The "ghost" lives on your computer even though its brain is in the cloud.
Full Edge (SLMs on device): Schmid predicts personal agents migrate to edge-first architectures with Small Language Models running locally. Here both the agent AND the model run on device. This is a fundamentally different architecture with different constraints and benefits.
Edge apps already change the feedback loop significantly. A pure cloud agent processes your request and returns a result; you accept or reject it. An edge app like Cursor observes you correcting its output in real-time and accumulates that signal through its local context. This tighter feedback enables personalization that cloud-only agents can't match.
Full edge with SLMs amplifies this further. If you're building for this future, start experimenting with smaller models now. The patterns that work with Claude or GPT-5 may not work with a 7B parameter model. Context windows are smaller. Reasoning is shallower. You'll need more explicit structure, more aggressive task decomposition, more careful prompt engineering. The deep agent patterns become even more important: explicit to-do tracking compensates for weaker implicit planning, specialized sub-agents keep each model call within its capability band.
The privacy and latency benefits of full edge are obvious. The learning benefits from tighter feedback loops (available in both edge paradigms) matter more in the long run.
Content Provenance Becomes Infrastructure
Schmid predicts "AI Slop" creates a premium for human-created content, with cryptographic signatures (C2PA) becoming standard for proving human origin. The default assumption flips: content is AI until proven human.
The broader pattern: as AI-generated content becomes indistinguishable in quality, origin becomes the differentiator. Systems that can answer "who made this and how?" will have advantages that quality-focused systems lack. This intersects with agent development too: when an agent synthesizes information from multiple sources, knowing the provenance of each piece is crucial.
The Industry Context
Toews predicts Anthropic goes public, AGI discourse cools, and China makes chip progress. What does this mean for daily work?
More capital in the ecosystem means more competition on APIs and tools. Expect pricing pressure, new entrants, and faster iteration on developer experience. The practical response: don't over-invest in any single provider's ecosystem. Build or use abstractions that let you swap models.
The cooling AGI discourse is a gift. It creates space for practical building without the distraction of existential speculation. Sutskever estimates five to twenty years, Karpathy estimates ten. These timelines suggest years of productive work ahead on systems that are useful today. The deep agent patterns, the evaluation infrastructure, the edge deployment strategies: these investments pay off regardless of when or whether AGI arrives.
Also, China's chip progress affects supply chains and may influence export controls. If you're planning infrastructure that depends on specific hardware, build in flexibility. The geopolitical situation is fluid; your architecture shouldn't be.
Working With Ghosts
Karpathy frames LLMs as humanity's "first contact" with non-animal intelligence. We're not evolving animals; we're summoning ghosts. The optimization pressure is completely different: human neural nets optimized for tribal survival in the jungle, LLM neural nets optimized for imitating text and collecting rewards on puzzles.
This reframe is practically useful. When you're debugging an agent, you're not diagnosing why a junior developer made a mistake. You're understanding why a ghost optimized for text prediction and puzzle rewards produced this particular output. The failure modes are different. The ghost doesn't "misunderstand" in the human sense; it pattern-matches to something that looked similar in training.
The agent patterns we build are protocols for working with ghosts. The orchestrator pattern treats the ghost as capable but forgetful, benefiting from external memory. The multi-agent pattern treats it as a specialist that spikes in narrow domains. The to-do tool creates verification artifacts that help the ghost track its own progress. These aren't just engineering choices; they're implicit theories about what these entities are.
Refining these theories through deployment experience will help you build better systems than those who rely on benchmarks or intuitions from human intelligence.
Building for 2026
The verifiability question should become instinct. Before delegating, ask it. The answer shapes everything: how much you trust the output, how deeply you review, whether automation is even worth pursuing.
Trace your failures. Every agent breakdown has a reasoning chain behind it. Finding where that chain snapped builds the intuition that benchmarks can't give you. Your evals should grow from these failures: real examples, real edge cases, real production weirdness. They're your compass when new models drop and the hype cycle spins up again.
Stay loose on providers, tight on patterns. The deep agent architectures, knowledge (context) engineering, the verification infrastructure, the edge deployment experiments: these transfer across whatever model or API dominates next quarter. The ecosystem rewards flexibility; brittleness will cost you.
The intelligence we're working with is jagged, optimized for verification, and fundamentally different from human reasoning. Those who internalize this and build accordingly will have significant advantages. The predictions for 2026 are interesting. The framework for navigating them is essential.
PA,