Why High Summarization Scores Hid a Bigger Problem: An Engineering Team’s Deployment Story

Posted on 2026-03-05 10:07:31

When a Support Platform Rolled Out AI Summaries: Priya's Story

Priya ran product for a mid-size customer support platform. In March 2024 she greenlit a plan: replace human-written first-pass summaries with model-generated summaries for incoming support Informative post tickets. The motivation was simple - speed up triage and reduce average handle time. The experiment used two models in A/B: Vectara (internal label v1.2.0, test run 2024-04-18) for the summary path and a stacked pipeline using a third-party retrieval and an AA-Omniscience model (v3.0.7, test run 2024-04-20) as a second variant. For reference, the team also ran checks with OpenAI gpt-4.1 (gpt-4.1, test run 2024-05-03) as a baseline for generative quality.

Early offline metrics looked excellent. Automated ROUGE and BERTScore numbers favored the Vectara path for short-form summaries and AA-Omniscience returned higher factuality scores on the internal benchmark. Meanwhile, user satisfaction surveys during the 2-week pilot showed a small increase in perceived clarity. The team moved to a scaled rollout with 10k tickets per day. That is when the problems began.

The Hidden Cost of Relying Only on High Summary Scores

Within two weeks of rollout Priya started seeing two classes of failure: missed critical facts in long documents and an inability of the models to admit ignorance for out-of-scope questions. This led to incorrect routing and a spike in escalations. The initial cost analysis that justified the rollout missed three items that proved decisive.

Evaluation bias: the offline dataset used for tuning was skewed to short tickets (median 350 words). Production tickets had a long-tail distribution with many multi-attachment and legal-sounding requests. Overconfidence: AA-Omniscience showed strong "omniscience" behavior on short passages but seldom responded with "I don't know" when the answer was absent from documents. Uncaptured operational variance: latency and retry behavior across models affected throughput and human-in-the-loop costs.

As it turned out, the team had treated summary scores as a single dimension decision. That saved them time in the short run but introduced hidden costs in triage quality, escalation labor, and lost revenue from delayed responses.

Why Traditional Evaluation Benchmarks Often Fall Short in Production

Benchmarks are useful, but they can mask the failure modes that matter in production. Below are the methodological problems we found while auditing the pilot.

1. Dataset distribution mismatch

The tuning dataset had 92% tickets under 1,000 tokens. Production distribution had 30% tickets over 2,500 tokens with embedded attachments and threaded history. Models that excel on short passages lost recall on long inputs because prompt truncation or retrieval sparsity removed critical context.

2. Metric misalignment

ROUGE and lexical similarity metrics reward surface-level overlap, not factual completeness or proper "no answer" behavior. AA-Omniscience scored high on ROUGE in our tests (mean ROUGE-L 0.48 on the short dataset, test date 2024-04-20), but its ability to correctly say "not present" on validation examples was only 62%.

3. Evaluation contamination and prompt leakage

Some evaluation items had overlap with model training corpora or prompt templates used during development. That introduced optimistic bias that did not generalize to unseen enterprise documents containing internal shorthand or policy text.

4. Stability over time

OpenAI gpt-4.1 responses showed small but measurable changes across repeated tests in May 2024. That led the team to track a drift pattern: a 4% drop in extractive accuracy on a stable test set between 2024-05-03 and 2024-05-20. We flagged this as potential model-side calibration or a difference in API default system behavior. That is relevant because production SLAs depend on predictable performance; a vendor changing behavior without warning shifts operational risk.

This led to a core observation: single-number benchmark wins do not equal production readiness unless evaluation covers distribution shift, admission-of-ignorance, and API stability.

How One Engineering Team Discovered the Real Solution to Model Overconfidence and Distribution Shift

Priya’s team instituted a three-step remediation that exposed what mattered in practical terms. They documented versions and test dates for reproducibility: Vectara v1.2.0 (2024-04-18), AA-Omniscience v3.0.7 (2024-04-20), and OpenAI gpt-4.1 (2024-05-03 initial baseline, repeat tests 2024-05-20).

Revise evaluation to reflect production inputs. They sampled 10k production tickets, stratified by token length and attachment count, then annotated a 1,000-ticket slice for factual elements expected in summaries. Introduce "refusal" and "evidence citation" checks. Each summary had to either cite source snippets or return an explicit "not found" token. This forced models to choose humility instead of fabricating answers. Run cost-sensitivity and latency simulations. They measured tokens, retries, and human-in-the-loop triage cost to compute net economic impact.

As a result, they discovered a trade-off: Vectara produced concise summaries quickly but missed deeply nested facts; AA-Omniscience produced longer outputs with more citations but had a higher token burn and rarely refused to answer. OpenAI gpt-4.1 offered balanced phrasing and better calibration on refusal in some cases, but showed degradation in extractive accuracy over repeated API calls between early and late May tests.

Methodology note: the team logged all tests with exact model identifiers and timestamps so they could audit any drift. For example, every gpt-4.1 run included the header: model=gpt-4.1 | run=2024-05-03T14:22Z so deviations could be traced to API changes or network retries. This mattered when renegotiating SLOs with vendors.

From Frequent Escalations to Measured Stability: Real Results and Costs

After remediation, the team ran a second scaled trial for 30 days and collected the following operational metrics. All cost calculations reflect real invoices or simulated per-token pricing used internally during May 2024. Assumptions are explicit so you can adjust them to your context.

Key assumptions used in the cost model

Daily volume: 10,000 tickets. Average production ticket input: 2,500 tokens (long-tail accounted for). Average summary length: 350 tokens. Pricing used (example): gpt-4.1 pricing simulated at $0.03 per 1k prompt tokens, $0.06 per 1k completion tokens. AA-Omniscience and Vectara pricing were vendor quotes: Vectara $0.018 per 1k total tokens; AA-Omniscience $0.035 per 1k total tokens. These were the rates used for internal decisioning in May 2024. Your actual costs may differ. Human-in-the-loop cost: $25 per hour for reviewers with 600 tickets/hour capacity when summaries require manual review. ModelDaily Token InDaily Token OutToken Cost/dayManual Review RateManual Cost/dayTotal/day Vectara v1.2.0 25,000,000 3,500,000 $513 (0.018/1k) 8% $333 (800 tickets reviewed) $846 AA-Omniscience v3.0.7 25,000,000 3,500,000 $875 (0.035/1k) 12% $500 (1,200 tickets reviewed) $1,375 OpenAI gpt-4.1 25,000,000 3,500,000 $825 (prompt/completion blended) 6% $250 (600 tickets reviewed) $1,075

As you can see, raw token cost is only one part of the equation. Vectara had the lowest total dollars/day in this scenario because its summaries were shorter and the human review rate was lower. Meanwhile, AA-Omniscience incurred higher token spend and required more manual review due to its propensity to assert facts without clear citations.

During the 30-day trial the incident rate due to incorrect triage dropped from 2.1% to 0.45% for the Vectara pipeline after adding citation checks. For AA-Omniscience the incident rate fell from 3.0% to 1.1% once forced refusal thresholds were implemented, but at higher operational cost. The gpt-4.1 path stabilized at 0.6% incident rate but showed a small uptick in April to May when internal tests registered a 4% drop in extractive recall on a frozen gold set.

Why Conflicting Numbers Exist and How to Read Them

You will see different vendors publish https://dlf-ne.org/why-67-4b-in-2024-business-losses-shows-there-is-no-single-truth-about-llm-hallucination-rates/ different accuracy and latency numbers. Here is why they conflict and how to interpret them in procurement or design decisions.

Different test sets: Vendors optimize on their own public or private test suites. If your ticket distribution differs, vendor claims are not directly comparable. Metric selection: Surface metrics like ROUGE favor fluency. Factuality and "no-answer" accuracy require targeted checks and adversarial examples. Versioning and hidden defaults: Vendors change models, tokenization, or system prompts. Small default changes can shift output calibration. Always track model id and test timestamp in logs. Sampling and randomness: Some vendors report averaged results; others report best-case medians. You need confidence intervals, not single-point statistics.

In our work we found that the practical way to compare is: (1) run each candidate model against a stratified sample of your production data, (2) measure refusal and citation behavior, and (3) compute end-to-end operational cost including human review, escalations, and latency penalties.

Interactive Self-Assessment: Is Your Summarization Pipeline Production Ready?

Use this quick checklist to score your readiness. Give yourself 1 point for each "yes."

Have you tested models on a sample of your actual production documents that preserves the long-tail characteristics? (Yes/No) Do your metrics include a "no-answer" or refusal rate measure? (Yes/No) Are model identifiers and API call timestamps logged for every inference? (Yes/No) Have you measured end-to-end cost including human review and escalations? (Yes/No) Do you have an automated canary to detect model drift in accuracy? (Yes/No)

6-5 points: Ready for cautious production use with continued monitoring. 4-3 points: You need tighter evaluation and runbook work. 2-0 points: Significant risk of hidden cost and misrouting exists.

Practical Recommendations Priya's Team Implemented

Enforce citation-first summaries: require at least one in-document evidence snippet per fact for high-stakes tickets. Implement a tiered pipeline: lightweight summary models for short tickets, heavier citation-enabled flows for long/complex tickets. Log exact model versions and timestamps for each inference and store a 30-day sample of outputs to debug drift. Add a "refusal budget": force models to explicitly decline when evidence is missing. Track refusal trends as a quality signal. Negotiate vendor SLAs that include change notification and rollback windows for model updates that affect output semantics.

These changes reduced the daily manual review cost and cut incident rate substantially. Meanwhile, the team maintained a running, versioned benchmark and published a monthly "accuracy by ticket length" report so product and legal teams could see trends.

Final Notes on Decision-Making

When you evaluate model vendors, do not let unfamiliar names or single-number claims sway you. Priya’s story shows that Vectara, AA-Omniscience, and OpenAI gpt-4.1 each had strengths and weaknesses that only became visible under production-like conditions. Quantify those trade-offs with your own data, track versions and timestamps, and include human cost in your economic model.

Meanwhile, keep an eye on vendor-side changes. The small gpt-4.1 drift we observed between 2024-05-03 and 2024-05-20 was a reminder that models evolve. That evolution can be helpful but must be treated as an operational variable, not an afterthought.

Quick checklist before full rollout

Run stratified production-sample tests for at least 14 days. Measure refusal and citation accuracy explicitly. Include human review cost and escalation cost in total cost of ownership calculations. Log model id and timestamp for every inference. Set up drift canaries and vendor change notifications.

This is a practical, data-first approach to avoid being surprised by hidden costs. If you want, I can help design a reproducible test harness (including script snippets and a template for logging model ids and timestamps) that your engineering team can run against 1,000 production samples. Tell me the data access constraints and the models you plan to compare and I will draft a concrete plan with reproducible test steps and cost templates.