06/11/2025 | Press release | Distributed by Public on 06/11/2025 15:49
Author: Jeff Wu
Key Contributors: Lily Sierra, Nour Alkhatib, Steven Gross, Jacquelene Obeid, Kyle Swint, Monta Shen, Gary Song, Riddhima Sejpal, Jatin Jain, Shishir Kumar Prasad, Ayesha Saleem
Millions of people rely on Instacart to fulfill their grocery needs. While most orders go smoothly, issues sometimes come up - like missing items, delayed deliveries, refund questions, or membership inquiries. In these moments, customers expect quick and consistent help, especially when something has gone wrong.
Traditionally, resolving these issues meant connecting customers with human agents - a process that often involved frustrating wait times, often at an inconvenient time for our customers. The rise of AI-powered chatbots has made it possible for us to build our own support chatbot to transform the customer experience by handling these requests faster and at scale. But building the chatbot is just one part of the challenge - what's even more critical is how we evaluate it. A chatbot can only be as good as our ability to measure whether it's actually helping customers in real conversations.
To meet this challenge, we developed the LLM-Assisted Chatbot Evaluation (LACE) framework - an automated system designed to evaluate chatbot interactions effectively and reliably. LACE helps us spot issues, monitor performance, and surface actionable insights that improve the quality of our customer support.
Chatbot Evaluation Flow with LACEIn the sections that follow, we'll walk through how we built and implemented LACE - from defining evaluation criteria, to building and validating the framework, to applying it in production to improve chatbot quality over time.
Creating an effective evaluation system begins with clearly defining what "good" looks like. We approached this challenge from the top down, identifying five key dimensions critical to successful chat interactions:
Each of these dimensions contains granular criteria addressing specific aspects of the user experience. For example, under "Answer Correctness," we evaluate contextual relevancy, factual correctness, consistency, and usefulness. Together, these create a comprehensive picture of whether the chatbot is providing truly helpful information.
We evaluate each chat session - defined as a full, multi-turn conversation between a customer and the support chatbot - against the set of binary (True/False) criteria across five key dimensions. Individual criterion scores are aggregated to generate a total score for that session, providing a holistic quality assessment. The diagram below illustrates how each dimension maps to specific, measurable criteria.
Evaluating chats via binary scoring across five key dimensionsOur evaluation framework wasn't built in isolation. Through internal pilot studies, collaboration with industry experts, and hands-on experience with LLM-based evaluation prompts, we crafted criteria that minimize ambiguity and align closely with our goals.
We learned that it's important to strike a balance between precision and practicality. While our criteria are designed to isolate distinct aspects of chat quality, in practice, LLMs can struggle to draw clean boundaries between closely related categories. This can lead to situations where multiple criteria are marked as failed, even when only one is clearly at fault. Rather than enforcing rigid separation, we prioritize identifying the primary issue affecting the interaction. To support this, we collect not only binary True/False scores but also rationale outputs from the LLM, which help surface the model's reasoning and guide future refinement.
Our experiments also revealed binary scoring to be more effective than granular scales. While a 1-10 scale might seem more precise, binary evaluations provide greater consistency, simplicity, and alignment with human judgment. This streamlined approach requires less extensive prompt engineering while maintaining robust performance - a lesson that reinforced our philosophy of seeking practical effectiveness over theoretical perfection.
With our criteria established, we explored three distinct LLM-based methods to ensure comprehensive assessment of chatbot performance:
Direct Prompting represents our baseline approach, where an LLM evaluates chatbot interactions based on our predefined criteria. For example, we might ask, "Did the chatbot provide answers that correctly reflect the context of the prior conversation?" and the LLM responds with a simple "yes" or "no" based on the criteria. While efficient, this method doesn't give the LLM room to refine its assessment, potentially limiting evaluation depth.
Agentic Evaluation via Reflection introduces a sophisticated two-step process. First, the LLM performs an initial evaluation of the conversation, just like it would in Direct Prompting. However, it then reflects on its initial assessment to identify potential biases or gaps, adjusting and refining its judgment. This self-reflection mechanism helps deliver more nuanced, accurate evaluations by allowing the model to catch and correct its own limitations 1, 2, 3.
Agentic Evaluation via Debate creates a virtual courtroom where three distinct agents - each simulated by an LLM - evaluate the chatbot's performance from different perspectives. The evaluation is conducted in three sequential steps:
The Customer and Support agents run independently and in parallel, without access to each other's assessments. The Judge sees both assessments and the original chat. This structured flow allows the Judge to synthesize multiple viewpoints while avoiding inter-agent bias, resulting in a more balanced evaluation that captures nuances human reviewers might miss 4.
Decoupling Evaluation from Structured Output
We structured our implementation to generate free-form text first, then convert it to structured JSON in a separate step. This design serves two purposes. First, it allows the model to focus on producing the highest-quality evaluation output without being constrained by JSON formatting, which can negatively affect performance due to restricted decoding 5. Second, while many models support structured outputs, their ability to produce reliable and consistently formatted JSON varies. We use JSON as the final output format because our evaluation requires multiple structured fields - for example, a True/False rating for each criterion along with a brief explanation. Representing this information in JSON makes post-processing much easier and more reliable. By decoupling the reasoning and formatting steps, we gain greater flexibility: we can use a strong reasoning model - such as o1-preview, which was our best-performing option at the time but lacked consistent JSON formatting capabilities - to generate the free-form explanation. Then, we can use a separate model or a simple rule-based parser to convert that output into structured JSON.
Example of Decoupled Evaluation and Formatting for a Single Criterion (Contextual Relevance)Prompt Formatting
To minimize performance variability, we paid special attention to how we structured our prompts and adopted industry best practices - preparing them in Markdown format with clearly organized sections to delineate content and convey instructions 6. This approach aligns with findings in existing literature showing that prompt formatting can significantly impact LLM output quality 7.
For our LLM-based evaluations to drive meaningful improvements, they must closely mirror human assessments. To achieve this alignment, we implemented an iterative validation process where human evaluators rated a carefully selected set of customer conversations using the same criteria as our LACE system. We then compared their ratings to those generated by LACE. When we identified misalignments, we used this feedback to refine our evaluation framework in two ways:
We repeated this human-LACE comparison and refinement cycle multiple times until we achieved strong alignment between human judgment and our automated evaluation system, ensuring our LLM-based assessments truly reflect real-world quality expectations. This process served to bootstrap the initial system and is reused as needed to validate updates and catch regressions.
Through iterative refinement, we observed that evaluation criteria - spanning across all five dimensions - naturally grouped into three categories based on complexity. Below, we highlight representative criteria from specific dimensions to illustrate the distinct challenges each group poses.
Simple Criteria rely on universal standards that don't require deep contextual understanding. For example, evaluating professionalism (which is a criteria under dimension compliance) - through tone and politeness assessments - proved straightforward to measure and improve. Using an agentic, debate-style approach, we achieved near-perfect accuracy for these criteria, precisely capturing our requirements better than alternative methods.
Context-Dependent Criteria present greater challenges, requiring the chatbot to interpret responses based on Instacart-specific behaviors and operational nuances. For example, assessing contextual relevancy (under answer correctness) means evaluating whether the chatbot correctly integrates conversation history while demonstrating understanding of Instacart's business model and processes.
These criteria demanded close attention to conversational details and interpretation of information that might contradict surface meaning. Consider a scenario where a user mentions, "The shopper said my card was declined. My payment method is [a digital wallet], and I have…" The chatbot must recognize that Instacart shoppers use company-authorized cards, not the customer's personal payment method - a crucial operational detail for contextually relevant responses. A failure to consider this, such as responding with a discussion about the specific digital wallet, would indicate a lack of contextual understanding.
To address these challenges, we embedded specialized knowledge into the chatbot's prompts through a static template. Evaluation results indicate the agentic debate-style approach is highly effective for capturing these nuances, achieving over 90% accuracy. Remaining errors typically stem from ambiguous scenarios or gaps in the embedded knowledge base, emphasizing the need for ongoing refinement.
Managing Instacart-specific knowledge poses challenges, as overloaded prompts can dilute effectiveness by reducing the signal-to-noise ratio. To address this, we are exploring dynamic prompt construction paired with real-time knowledge retrieval, similar to Retrieval-Augmented Generation (RAG). This approach ensures the chatbot retrieves relevant knowledge on demand, keeping prompts concise and focused while preserving strong evaluation performance.
Subjective Criteria, such as answer conciseness (under chat efficiency), proved most challenging to evaluate due to varying human preferences. What feels concise to one person might seem overly brief to another. We also observed that LLMs often apply stricter standards for brevity than humans do, creating misalignment between model outputs and human expectations.
Instead of removing these subjective measures entirely, we retain them as a high-level directional check - to flag any large or unexpected changes. Rather than invest significant effort in refining ambiguous evaluation criteria (a low-ROI path), we focus on directly improving the chatbot's behavior through prompt refinement and model fine-tuning. This practical approach consistently delivers better outcomes with equal or less effort.
With robust evaluation criteria and a reliable evaluation engine validated against human judgment, we're now able to operationalize quality monitoring at scale. LACE powers our continuous evaluation pipeline, enabling us to analyze chat logs and drive targeted improvements across our AI-powered customer support system.
Using stratified sampling based on topic distribution, LACE feeds into dashboards that let us:
To illustrate LACE's practical impact, consider this chat session that our system flagged for issues with answer correctness and efficiency:
[ASSISTANT]: Hello! I'm Instacart's virtual assistant. What can I help you with today?
[USER]: I have another question about this order
[ASSISTANT]: Could you please tell me more about your question or issue with the order? This will help me assist you better.
[USER]: I'm attempting to contact the shopper and I'm not getting a response.
[ASSISTANT]: To chat with your shopper while your order is in progress on the iOS app, you can follow these steps:
1. Open the Instacart app.
2. Go to **Orders**.
3. Tap the current order.
4. Tap the chat bubble icon in the upper right corner to send a message.
If you have in-app notifications turned on, you'll receive an alert when your shopper replies. Please ensure that notifications are enabled to not miss any updates.
[USER]: Representative
[ASSISTANT]: Our specialists are available to help, but there may be a wait. Would you like to see if I can help first, or get in touch with a specialist?
[USER]: Continue where we left off
[ASSISTANT]: Our specialists are available to help, but there may be a wait. Would you like to see if I can help first, or get in touch with a specialist?
LACE automatically identified two key issues in this interaction, based on evaluation criteria across our core dimensions:
By surfacing these dimension-level failures, LACE enables targeted improvements that make AI interactions more context-aware, efficient, and helpful - qualities that directly enhance customer experience.
Our dashboard provides multi-layered insights into chatbot performance, from top-level score trends over time to detailed breakdowns of criterion-level failures across conversation topics. A dedicated "Failed Chat" section highlights interactions that didn't meet our standards, categorizing them by failure reason and linking them to corresponding chat segments for deeper investigation.
In the example above, failures under the Answer Correctness and Chat Efficiency dimensions revealed a gap in how the model handled conversational continuity. By updating the assistant to better recognize and respond to continued user prompts without repetition, we reduced inefficient interactions and saw clearer, more helpful conversations as a result.
Building LACE has transformed how we approach chatbot development. By creating an automated evaluation framework that closely mirrors human judgment, we've established a feedback loop that enables rapid iteration and continuous improvement of our customer support systems.
The lessons we've learned - about balancing precision with pragmatism, leveraging agentic LLM evaluation techniques, and focusing efforts where they yield the greatest impact - extend beyond chatbot development. They illuminate broader principles about effective AI evaluation and deployment in real-world settings where customer experience remains paramount.
As we continue refining LACE and our chatbot systems, we're excited by the possibilities for creating even more responsive, helpful, and natural customer interactions. Through methodical evaluation and thoughtful application of advanced LLM capabilities, we're building support experiences that combine the efficiency of automation with the understanding and context-awareness customers expect.
Does Prompt Formatting Have Any Impact on LLM Performance?
Large Language Models are Better Reasoners with Self-Verification
Can LLMs Critique and Iterate on Their Own Outputs?
Self-Reflection Outcome is Sensitive to Prompt Construction
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
CO-STAR Framework
Instacart
Author
Instacart is the leading grocery technology company in North America, partnering with more than 1,800 national, regional, and local retail banners to deliver from more than 100,000 stores across more than 15,000 cities in North America. To read more Instacart posts, you can browse the company blog or search by keyword using the search bar at the top of the page.