Karpathy for Finance & Ops
Andrej Karpathy (founding member of OpenAI and astute namer of things) gave a talk at Sequoia Capital called "From Vibe Coding to Agentic Engineering." It was nominally geared toward software devs, but the ideas are useful for any function where judgment, systems, and verification all matter. Especially finance and ops. Here are the key points, and a look at the agentic finance stack his framework implies.
Frontier intelligence is jagged, not general
Frontier models aren't generally smart in the way the marketing implies. They are unevenly smart: peaks of capability in some directions, with real gaps right next to those peaks. Karpathy calls this jagged intelligence. The popular story is a smooth intelligence curve climbing toward general competence. Karpathy's view is more nuanced. And more honest.
His example is this car-wash prompt: "I want to go to a car wash to wash my car and it's 50 meters away. Should I drive or should I walk?" State-of-the-art models (Karpathy specifically calls out Opus 4.7) answer walk. Karpathy's reaction:
"How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a 100,000-line codebase or find zero-day vulnerabilities and yet tells me to walk to this car wash? This is insane."
The same model that produces a clean DCF will, in the next prompt, confidently produce a lopsided balance sheet for you and call it "done". And even when it lands the right answer, there's no guarantee it'll get there the same way next time. Fluency isn't reliability.
Capability spikes come from training data, not general reasoning
Each new frontier model arrives as a black box. Karpathy's phrase: users are "slightly at the mercy" of what the labs put into the training and reinforcement-learning mixes. You don't get a manual. You don't know which capabilities were trained deeply and which were left rough. You find out by using the thing.
His example is chess. GPT-4 looked dramatically better at chess than its predecessor, not because reasoning got uniformly stronger, but because a lot more chess data appears to have made it into pretraining. A specific capability spiked. General reasoning did not.
The translation for finance: do not infer from "this model is great at analyzing SEC filings" that it will accurately understand your multi-entity consolidation, your revenue recognition policy, your chart of accounts. Public filings are well-represented in training data. Your company's specific operating model is not.
The context window is the lever
"What's in the context window is your lever over the interpreter."
In Karpathy's Software 3.0 framing, the model is the interpreter (the engine that runs your instructions, the way Excel's formula engine runs the formulas you write), and the context window is where the program lives. The work isn't typing a clever prompt. It's arranging everything the model needs to see before it generates a token. What you put in front of it is the engineering.
Finance-AI
The picture above is the finance-team version of an agentic system. Walk it left to right.
Business context is everything you put in front of the model so it understands your business. Two layers. The durable layer lives in an LLM-wiki: company and team objectives, the GAAP codification, regulatory updates, historical data, KPI dictionary, accounting workpapers, the financial model. The periodic layer sits on top: forecast scenarios, board materials, contracts, insurance, team retros. Together they're what Karpathy calls the lever — "what's in the context window is your lever over the interpreter." This is bigger than a prompt; it's a curated, versioned, growing "second brain".
The coding harness is where context meets the queue of work. A specific task — a board memo, an FP&A update, a variance analysis, a risk pack — gets composed and routed off that same surface. The harness is how an operator says "this is the task, this is the context, here's what good looks like, go." It's the difference between a fresh ChatGPT tab and a system that knows the company.
The interpreter is the frontier model plus the surfaces it can act on — Google Sheets, NetSuite, Slack, Supabase, GitHub. The connectors are: MCP for governed reads (e.g., pulling GL detail directly into the context window), standard APIs for writes, CLI for the dev-flavored work in between. The agent doesn't simply "have access" — it has named protocols pointing at named systems.
Quality bars are how the output gets ready to ship. Verification against an explicit standard — accuracy, completeness, consistency, variance explanation, format & clarity, source traceability, narrative quality — using agent subchecks alongside human review. Reconciliations, redlines, sign-offs, gold-standard examples — the same conventions a senior teammate would apply, written down so the agent does them every time.
Business context is easy to treat as exhaust — securely archived and forgotten. Treat it as input instead, and every analytical query will be handled smarter than the one before it.
This is where the compounding lives. The mechanical loop: when a verifiable deliverable is finalized — a tied-out close package, a reconciled flux, an adopted forecast scenario, an after-action retro — it all routes back into the business context layer. Today's verified output becomes tomorrow's context. The LLM-wiki gets richer, and the next query starts from a better baseline. Teams that invest in this loop get better outputs every month. Teams that keep starting from a fresh ChatGPT tab get the same mid answers forever.
Verifiability is the trust gate
The threshold for putting agentic work in front of a CFO has two axes. Context lets you instruct. Verifiability lets you trust. Without context, the work is uninformed. Without verifiability, you still have to redo the work from scratch to trust the answer.
The ceiling for people who do this well is very high
The talk's title names the move: from vibe coding to agentic engineering. Vibe coding is casual prompt-and-iterate: chat with a model, accept the output, move on. Agentic engineering is the discipline of coordinating these spiky models with real engineering rigor.
"There is a very high ceiling on agentic engineer capability."
He's blunt: people who get good at this can hit "a lot more than 10x," well past the old 10x-engineer trope. That's a compounding claim. Context engineering, model calibration, verification design, and team memory all stack. They get better the longer you run them.
Human in the loop
Karpathy closes with a constraint that lands as firmly for finance operators as it does for engineers:
"You can outsource your thinking but you can't outsource your understanding."
A controller signs the rep letter. An auditor signs the opinion. A CFO stands behind the board materials. The buck stops with a human. An LLM in the workflow doesn't change that. What's changing is how much else our teams can accomplish with the right compounding AI system.
Technology stack links
Tools used: links to the official repos and websites.
Subscribe for thoughtful, cutting-edge rants from Gary
and the opportunity to beta-test new finance superpowers and parenting superpowers