Most enterprise AI agent purchases in 2026 are made under three pressures at once. The board wants to ship agents. The security team wants to slow down. The vendor wants to close the quarter. The buyer in the middle has thirty days to figure out which platform will still be the right answer in three years.
This is the checklist we wish every buyer would use on us. It is the same checklist Clarm Atlas was designed to pass. If a vendor cannot answer most of these with a live demo and a real audit log, they are selling a roadmap, not a substrate.
The numbers that should frame every demo
Before the first vendor call, internalize the three numbers that define the 2026 agent-buying environment:
- ~8x. Enterprise applications integrating task-specific AI agents are projected to go from under 5% at the start of 2025 to roughly 40% by the end of 2026. The pressure to ship is real and not going away.
- 88%. Organizations that shipped agents in the last year and reported a confirmed or suspected security incident. The pressure to slow down is also real.
- 14%. Agents reaching production with full security and IT approval. Most agent deployments live in a gray zone where executive confidence (around 82%) far exceeds the controls actually in place.
Every question below exists to close that gap between the 82% confidence number and the 14% reality.
The substrate questions
These are the architectural choices that decide whether the platform you buy in May 2026 is still the platform you trust in May 2028. If a vendor cannot answer these on a single call, they are not ready for an enterprise deployment.
1. Source receipts: does every answer point to the document it came from?
Ask for a live demo. Ask the agent a question. The answer should arrive with a document name, a section, and a version number attached, not a footnote-style citation rendered after the fact. If the citation is generated by a separate process from the retrieval, two failure modes appear: citations that look real but do not match what the model actually used, and citations that disappear when the model paraphrases. Source-receipt-on-retrieval is a substrate property. Source-receipt-on-render is a feature you can lose at any time.
2. Approval gate: what can the agent do without a human reviewing first?
The single highest-leverage architectural question. The right answer is: nothing that reaches a customer, a CRM record, a regulator, a contract, or a payment system. The agent drafts; an operator approves; only then does anything move. If the vendor offers an “autonomous mode” toggle, ask what happens to the audit trail when it is on. If the answer is “the audit log still captures the action”, the answer is wrong, because what you wanted was for the action not to happen.
3. Audit trail: can your auditor replay a specific agent action a year from now?
Ask for the audit-log schema and a sample log entry. The log should be append-only, structured (not free-text), and queryable by tenant, agent, action type, and time range. Critical: it should capture every retrieval, every LLM call (prompt, model, response), every approval decision, and every external action. If any of those are missing, the audit story has a gap your compliance team will find at the worst possible time.
4. Tenant isolation: where is it enforced?
The right answer is at the database layer, not at the application layer. Application-layer tenant filters are one developer mistake away from cross-tenant data leakage; database-layer filters are not. Ask: “If a developer forgot to add the tenant filter to a query, what would happen?” The correct answer is “the query would return nothing” or “the query would fail”, not “our QA process catches that.”
5. Bring-your-own LLM: can you switch the model without rebuilding the substrate?
Model portability is becoming a hard procurement requirement. The EU AI Act is rolling out. Vendor concentration risk is on procurement scorecards. The cost of switching from Claude to OpenAI to a self-hosted open-source model in two years should be hours of configuration, not a six-month re-platform. Ask: “If we wanted to swap the LLM provider tomorrow, what changes?” The right answer is one line in a config file. The wrong answer is a roadmap discussion.
6. Marketplace approval policy: what happens when a third party publishes a skill?
The OpenClaw ClawHub incident in early 2026 is the canonical lesson. ClawHub ran with no approval gate on uploaded skills. Antiy CERT confirmed roughly 1,184 malicious skills across the registry at peak. The Atomic macOS Stealer payload landed on developer workstations through skill packages whose listings looked benign. Ask every vendor: “If we install a third-party connector or skill, who reviews it? Can it update itself? Can my own team review the diff before it goes live?” If the answer is “the vendor curates the marketplace”, ask for the curation policy in writing.
7. Per-tenant audit export: can you hand a regulator the evidence package?
The audit log is necessary but not sufficient. The export pipeline is the part that turns the log into a SOC 2 evidence package, a GDPR right-of-access response, a FINMA review submission, a HIPAA breach-notification artifact. Ask for a sample export. If the vendor cannot show one, the compliance team will have to write the export tooling themselves, and that work always slips past the date the regulator already chose.
8. Failure-mode transparency: what does the vendor say did not work?
A vendor that publishes only a curated highlight reel is harder to evaluate than one that publishes what they retired and why. Ask to see the “what we tried and stopped doing” list. The shape of that list is the most accurate predictor of how the vendor will behave when something goes wrong in your deployment.
The customer-evidence questions
Substrate properties matter for the next three years. Customer evidence matters for the next thirty days.
9. Name a customer who has been on this platform for at least 12 months and added at least three workflows since go-live.
A pilot that ran for three months and a customer that has been compounding for twelve months are two completely different things. The 12-month customer has lived through the failure modes, the LLM swap, the change in compliance posture, the integration that broke. Ask what changed over the year and whether the substrate held.
10. What was the path that customer took from one agent to many?
The right answer looks like Legacy, a healthcare growth team running on Atlas: email-only support deflection at go-live, then web chat once the knowledge base proved itself, then voice agents on inbound calls, then agents integrated with CRM and kit-ordering systems from the connector catalogue. Twelve months in, total case volume across all channels is roughly 8x what it was at go-live. The customer’s team has been in the approval seat the entire 12 months. If a vendor cannot describe a customer’s expansion pattern in this much detail, the vendor probably does not have one.
11. What is the operator’s daily work?
Ask to see a recording or a screenshare of a real operator using the platform. The operator’s daily work is the part of the platform you are actually buying. If the operator spends an hour a day clicking through queues that should auto-route, the platform is making your team slower, not faster. If the operator spends an hour a day signing off on work that the agent drafted with the source attached, the platform is doing what it is supposed to do.
The procurement questions
12. Cancel-anytime, or annual commit?
Monthly cancel-anytime is a feature, not a flexibility concession. It aligns the vendor’s incentives with shipping value continuously. Annual commits are fine if the price is right, but they do less work for the buyer than they do for the vendor.
13. What does the pilot commit you to?
The right shape is: a fixed-fee 4-6 week pilot that builds the first workflow on your data, with the pilot fee crediting against the first months of subscription if you proceed. The wrong shape is: a free pilot that locks you into a 12-month contract on signature.
14. Deployment model: cloud, single-tenant, on-prem, air-gapped?
Match this to your data classification, not to the vendor’s preferred deployment shape. Regulated industries should expect the on-prem option to be available; if it is not, the vendor is not built for your buyer.
What to ignore
- Model accuracy benchmarks. Every vendor cherry-picks. Run the test on your own data.
- “Powered by GPT-N” marketing. The substrate matters more than the underlying model. The model is the part you swap.
- Total-agents-shipped vanity metrics. What matters is how many agents reach production with full security review (industry-wide: about 14%).
- Analyst quadrant placement. Useful as a filter, useless as a decision. Buy from the vendor whose architecture answers the substrate questions, not the vendor with the best slide deck.
The shortest version of this guide
One line: buy substrate, not features. Features ship every quarter; substrate decides what is possible to ship in three years.
For the Atlas answer to each question above, read What Is Atlas?. For the cautionary tale that anchors this guide, read The Agent-Security Moment. For the lived proof that the substrate-first model holds over a year of real production, read How Legacy Went 8x in 12 Months on Atlas.