The Three Questions Your AI Vendor Can't Answer

The Three Questions Your AI Vendor Can't Answer
Photo by Simon Hurry / Unsplash

You're evaluating AI tools. The demos are impressive. The sales engineers are confident. The case studies are compelling. The pricing seems reasonable.

And you're about to buy something that won't work in your environment.

Not because the vendor is lying. Not because the technology is bad. But because the questions you're asking can't reveal whether the system will actually solve your problem.

Here's what's happening—and what to ask instead.

The Questions Everyone Asks (That Don't Help)

Standard vendor evaluation questions:

  • What's your accuracy on benchmark datasets?
  • What industries do you serve?
  • Do you have SOC 2 compliance?
  • Can you integrate with our existing systems?
  • What's your uptime guarantee?

These are reasonable questions. They're also almost useless.

They tell you whether the vendor is competent at basic operations. They don't tell you whether the AI will work once it encounters your actual data, your actual edge cases, your actual constraints.

Because vendors optimize their answers for questions they know you'll ask.

They have prepared responses. They have benchmark results. They have integration documentation. They've done this hundreds of times.

What they haven't done is figure out whether their system's fundamental assumptions match your reality.

Question 1: "What assumptions about data distribution are embedded in your model?"

What vendors will say:

  • "Our model is trained on diverse, representative data"
  • "We use industry-standard datasets"
  • "We continuously update with new data"

What this doesn't tell you:

  • Whether "diverse" means diverse in ways that matter for YOUR use case
  • What was systematically excluded from training data
  • Whether "representative" means representative of your deployment population
  • What happens when your data violates their training assumptions

Why this matters:

AI models compress patterns from training data. If your deployment environment has different distributional properties than the training environment, the model's predictions may be confidently wrong.

Example:

A clinical decision support tool trained on academic medical center data may perform well in benchmarks but fail in community hospital settings—not because the model is bad, but because patient populations, documentation practices, available diagnostics, and care patterns differ systematically.

The vendor's accuracy metrics don't reveal this because they measured performance on data similar to what the model saw during training.

What to actually ask:

"Can you describe the data generation process that produced your training set? What populations, contexts, or scenarios were underrepresented or absent? How do you detect when deployment data violates distributional assumptions you made during training?"

If they can't answer this specifically, they don't actually know when their model's predictions are trustworthy.

Question 2: "How does your system distinguish between confident-and-correct versus confident-and-wrong predictions?"

What vendors will say:

  • "Our model outputs confidence scores"
  • "You can set thresholds based on your risk tolerance"
  • "We provide uncertainty quantification"

What this doesn't tell you:

  • Whether high confidence scores actually correlate with accuracy on YOUR data
  • How the model behaves on inputs it's never seen before
  • Whether "uncertainty quantification" means the model knows what it doesn't know
  • What happens when the model confidently extrapolates beyond its training regime

Why this matters:

Models trained through optimization can produce high-confidence outputs for inputs that violate their training distribution. The confidence score reflects how well the input matches patterns the model learned—not whether the prediction is actually correct.

This is particularly dangerous in high-stakes domains where confident-but-wrong outputs get trusted.

Example:

A fraud detection system may flag legitimate transactions from a newly acquired customer segment with high confidence—not because these transactions are actually fraudulent, but because they don't match patterns the model learned from the historical customer base. The model is confidently applying learned patterns to a context where those patterns don't apply.

What to actually ask:

"How do you validate that confidence scores remain calibrated when deployed on data that differs from your training distribution? What mechanisms detect when the model is making confident predictions on out-of-distribution inputs? Can you show me examples where your system correctly indicated high uncertainty rather than producing a confident wrong answer?"

If they can't demonstrate this, their confidence scores are performance metrics, not epistemic indicators.

Question 3: "What happens when our operating context changes in ways you didn't anticipate?"

What vendors will say:

  • "We support continuous learning and model updates"
  • "Our system adapts to new patterns"
  • "You can retrain with your own data"

What this doesn't tell you:

  • Whether their "adaptation" means learning new patterns or just reinforcing existing ones
  • How you'll know when the model's assumptions have been violated
  • What mechanisms prevent drift between documented behavior and actual behavior
  • Who is accountable when the model fails in novel circumstances

Why this matters:

Deployment environments change. New regulations, new competitors, new customer behaviors, new edge cases. The model that worked six months ago may be systematically wrong today—and you won't know until failures accumulate.

Example:

A pricing optimization model trained during stable market conditions may continue to produce confident recommendations during supply chain disruption—but those recommendations are based on causal relationships that no longer hold. The model hasn't learned new patterns; it's applying old patterns to a changed reality.

What to actually ask:

"How do you detect when the causal relationships your model learned are no longer valid? What governance mechanisms ensure deployed behavior matches documented behavior as conditions change? If our context shifts in ways you didn't anticipate during development, how do we know to stop trusting the outputs?"

If they can't specify monitoring for context drift, not just performance drift, you're buying a system that will degrade silently.

Why Vendors Struggle With These Questions

It's not that they're hiding something.

It's that these questions reveal problems they haven't solved:

Most AI development optimizes for performance on known benchmarks. Vendors measure what they can measure: accuracy on test sets, speed, integration compatibility, uptime.

What they don't measure (because it's harder):

  • Distributional assumptions they're making
  • Conditions under which their model's logic breaks down
  • Epistemic validity outside training regime
  • Mechanisms for detecting when predictions shouldn't be trusted

This isn't unique to any one vendor. It's a structural feature of how AI systems are currently built and evaluated.

Models compress training data into representations that predict well on similar data. That's what they're optimized for. Asking them to also know when they're operating outside valid regime requires different architecture; architecture most vendors haven't built.

What Good Answers Actually Look Like

If a vendor CAN answer these questions well, here's what you'll hear:

On distributional assumptions:

  • Specific descriptions of training data limitations
  • Documented cases where the model is known to fail
  • Mechanisms for detecting when deployment data differs from training data
  • Honest acknowledgment of what populations/contexts weren't represented

On confidence calibration:

  • Evidence that confidence scores track accuracy on out-of-distribution inputs
  • Examples of the model correctly expressing uncertainty
  • Mechanisms for flagging extrapolation beyond training regime
  • Separation between "model confidence" and "epistemic validity"

On context change:

  • Monitoring for distributional shift, not just performance drift
  • Documented assumptions that must hold for predictions to be valid
  • Explicit triggers for human review when assumptions violated
  • Governance for when to stop trusting the system

If you hear vague reassurances instead of specific mechanisms, the vendor hasn't solved these problems.

What To Do With This Information

These questions won't disqualify every vendor.

Most won't have perfect answers. The goal isn't perfection, it's understanding where the gaps are so you can decide if they matter for your use case.

Some deployments can tolerate uncertainty about distributional assumptions.

Some can't.

The vendor's answers tell you:

  • Whether they understand their own system's limitations
  • How much epistemic risk you're inheriting
  • What monitoring you'll need to build yourself
  • Whether their confidence in the demo is warranted

Then you can make an informed decision:

  • Deploy with appropriate oversight
  • Build additional validation mechanisms
  • Limit deployment to contexts where assumptions hold
  • Or recognize the system won't work in your environment and walk away

What you can't do is assume the demo performance will transfer to production.

Not without understanding what assumptions make that transfer valid.

The Real Evaluation Question

Before asking about accuracy, integration, or pricing, ask:

"Under what conditions does this system's fundamental logic break down, and how will we know when we've encountered those conditions?"

If the vendor can't answer that, you're not evaluating an AI system.

You're buying a black box and hoping it works.


Need help evaluating whether an AI tool will actually work in your environment? Let's look at what questions your specific context requires. Contact me here.

Jen