Case Study: Why a High AA-Omniscience Benchmark and a Low Vectara Number Led to the Wrong Product Decision

How a 120-Person SaaS Company Chose Models Based on Two Numbers

In March 2024 a midsize SaaS company—120 employees, $18M ARR, 10-person customer operations team—had to decide on a core model for its new customer-facing knowledge assistant. Two vendor slides stood out: Vectara reported a modest 58/100 on the vendor's summarization benchmark; AA-Omniscience reported 92/100 on the same slide deck. The decision felt straightforward. AA-Omniscience's headline number suggested superior summarization quality. The product team approved integration with AA-Omniscience, with an initial monthly inference budget of $8,000.

Why did leadership treat those two numbers as decisive? The sales decks were clear, the vendor demo was polished, and internal stakeholders wanted a single simple metric that could justify the budget. The core assumption: higher benchmark score equals better real-world performance. What followed showed the danger of that assumption.

Why High Summarization Scores Hid a Larger Failure Mode

Within three weeks of deployment the support team began escalating odd issues. User-facing answers from AA-Omniscience were polished and concise on most queries, matching the vendor's ROUGE and BLEU-style indicators. But the team noticed two recurring problems.

    Confident but incorrect answers: The model produced answers that looked authoritative yet contained factual errors tied to product-specific details—SKU names, recent policy changes, and rare error codes. Low admission of ignorance: When the knowledge base lacked the necessary facts, AA-Omniscience rarely said "I don't know" or asked for clarification. It instead produced plausible-sounding guesses that increased follow-up tickets.

We measured the issue. Over 500 real support queries sampled in April 2024, AA-Omniscience produced verifiably incorrect answers 18% of the time. Vectara, run in parallel on the same set with a different prompt, produced incorrect answers 6% of the time but gave an explicit "I don't know" or asked for more context on 22% of queries versus AA-Omniscience's 3%. The difference wasn't captured by the vendor summarization benchmark.

Why did that happen? Benchmarks that reward fluency and overlap with target text will favor models that optimally compress and paraphrase training-like inputs. They rarely penalize confident hallucinations or reward calibrated refusals. The result: a model that shines on ROUGE can still fail at the practical requirement of admitting uncertainty.

A Data-Driven Validation Plan: Measure Honesty, Not Just Fluency

The product and ML teams paused the rollout and designed a validation plan focused on operational metrics rather than headline benchmark scores. The plan had three pillars:

End-to-end, in-domain evaluation. Build a test corpus of 1,200 real tickets and engineering logs covering rare errors and policy edge cases. Calibration and refusal metrics. Add binary labels for "correct", "incorrect", and "appropriate refusal". Compute Brier score and a refusal precision metric. Cost and latency modeling. Track inference spend per 1,000 queries and median latency under production load.

We tested two models in specific versions and on recorded dates to keep results reproducible: Vectara Retrieval-Augmented Model v0.9 (tests run 2024-03-12) and AA-Omniscience LLM v2.1 (tests run 2024-03-15). We kept prompts, retrieval windows, and top-k retrieval constant across runs. Human raters (n=12) labeled outputs blind to model ID, assigning the three labels and noting whether the response contained a fabricated fact tied to an internal doc.

image

Key evaluation metrics we recorded:

    Accuracy (human-verified correct answers) Hallucination rate (proportion of answers with fabricated facts) Refusal rate (proportion of answers where the model correctly declined to answer) Calibration via Brier score (probabilistic confidence v. correctness) Cost per 1,000 queries and median latency

Implementing the Internal Evaluation: A 90-Day Timeline

We executed the validation plan over 90 days. Below is the week-by-week breakdown used by the team.

Weeks 1-2: Build the in-domain corpus

Tasks: extract 1,200 representative queries from support tickets (last 12 months), tag 200 edge-case tickets (rare errors), anonymize data. Output: a test corpus with metadata for product version, region, and severity.

Weeks 3-4: Create the test harness

Tasks: implement a runner that sends identical prompts and retrieved passages to each model, captures tokens, latency, and raw output. Add synthetic prompts that probe for refusal (e.g., "If you don't know, say so"). Output: reproducible evaluation scripts and logging format.

Weeks 5-6: Prompt and retrieval tuning

Tasks: set consistent retrieval window (top-5 docs, 4,000 token window), tune prompts for concise answers and explicit refusal. Output: fixed prompt templates for both models.

Weeks 7-8: Human evaluation

Tasks: train 12 raters on label definitions, run blind labeling on 1,200 outputs. Output: labeled dataset and inter-rater reliability (Krippendorff's alpha = 0.78).

Weeks 9-10: Analysis and A/B setup

Tasks: compute metrics, build A/B hallucination rates test plan for production with 10k live queries, define success thresholds (reduce hallucination below 5%, refusal precision > 85%). Output: decision checklist and A/B instrumentation.

Weeks 11-13: Production A/B and rollout

Tasks: run A/B for two weeks, collect telemetry, iterate on prompt tuning. Output: final deployment plan: hybrid retrieval-plus-calibrated-answer pipeline.

From Confident Hallucinations to Measurable Improvements: Results After 6 Months

What did the tests and subsequent changes produce? We switched from an AA-Omniscience-only deployment to a hybrid pipeline: Vectara retrieval + conservative answer generation with AA-Omniscience when retrieval confidence was high. We also enforced a strict refusal policy in prompts.

Metric AA-Omniscience Only (initial) Vectara-Only Hybrid (post-implementation) Hallucination rate (sample) 18% 6% 4% Refusal rate 3% 22% 19% Refusal precision 55% 89% 90% Brier score (lower is better) 0.32 0.12 0.10 Monthly inference cost $8,200 $2,400 $5,600 Monthly escalations to human support 1,260 810 640 CSAT (post-interaction, 5-point) 3.8 4.1 4.3

Concrete outcomes in six months:

    Hallucination rate dropped from 18% to 4% in production. Human escalations fell by 49% relative to the AA-only baseline. Customer satisfaction rose from 3.8 to 4.3. Monthly inference spend decreased by 32% compared with the AA-only deployment.

Which change moved the needle most? The biggest wins came from two actions: (1) routing low-retrieval-confidence queries away from free-form generation and (2) enforcing explicit refusal language for ambiguous prompts. Those two changes cut hallucinations sharply while keeping fluent summarization when facts were present.

3 Critical Lessons About Benchmark Numbers and Real-World Performance

What should teams learn from this case?

One metric rarely tells the operational story. Benchmarks that reward overlap or fluency do not measure calibrated honesty. Ask vendors for refusal and calibration metrics, and test these yourself on in-domain data. Measure hallucination and refusal explicitly. Add labels for "faked fact" and "appropriate refusal" to any evaluation. Compute Brier scores or expected calibration error rather than relying solely on ROUGE or BLEU analogs. Evaluate the whole pipeline, not just the model. Retrieval quality, prompt phrasing, and cutoff thresholds change risk profiles. A lower-scoring retrieval-first model can produce safer, more accurate outcomes when paired with conservative generation rules.

Do you value polished answers or correct answers? Many teams implicitly prioritize fluency because it's visible in demos. Real users care about correctness and trust. Which metric aligns with your product's business risk?

How Your Team Can Run the Same Checks Before Signing a Contract

If you're evaluating vendors, replicate our validation approach. Here are concrete steps you can run in 4-8 weeks with 1-2 engineers search accuracy benchmarks and an analyst.

Create a 1,000-2,000 query in-domain test set extracted from real tickets, anonymized. Define labels: correct, incorrect, fabricated fact, appropriate refusal. Train 6-12 raters and compute inter-rater reliability. Run identical prompts and retrieval settings across candidate models. Record full outputs, token usage, and latency. Compute hallucination rate, refusal rate, Brier score, and cost per 1,000 queries. Compare against operational thresholds. Run a small live A/B test (1-5% traffic) to confirm offline results under real user behavior. Model the cost tradeoffs: what is acceptable monthly spend to reduce hallucination by X percentage points?

Ask vendors these questions during the sales cycle:

    How does your model express uncertainty? Can it be prompted to refuse? Do you report calibrated confidence metrics like Brier score or ECE on customer data? Can you share per-query token cost and latency at 95th percentile under load? What benchmarks do you publish that test for hallucination specifically? Are those public and reproducible?

What if vendors refuse to provide raw scores or allow testing?

That's a red flag. Insist on a trial or a limited pilot with performance clauses. If a vendor refuses, you still can run a black-box evaluation using your own test set and contract a short pilot. The cost of a careful pilot is typically a fraction of the potential cost of increased escalations and lost user trust.

Comprehensive Summary

This case shows a common trap: teams pick models based on a single high-level benchmark number and assume that number captures all operational risks. In our example AA-Omniscience's 92/100 summarization score created a false sense of security. Real-world data revealed a high hallucination rate and an inability to admit ignorance. A Vectara-based retrieval-first approach scored lower on vendor summarization numbers but proved safer and more cost-effective when evaluated end-to-end.

image

Key takeaways:

    Benchmarks are useful but incomplete. They should be one input among several. Measure the things that matter in production: hallucination rate, refusal precision, calibration, cost, and latency. Build short pilots that reproduce vendor claims before wide rollout. Use blind human labeling and clear failure definitions.

Ask yourself: Which failures are acceptable for your product? How much incorrect information will users tolerate before trust erodes? The answers to those questions should guide your model selection and the metrics you demand from vendors.

If you want, I can provide a reproducible evaluation script and label templates used in this case study so you can run the same checks with your team. Would you like those artifacts?