How I Measure My Dify Chatbot Quality with Scenario Testing

7 min read
difychatbot-testingllmquality-assurance
How I Measure My Dify Chatbot Quality with Scenario Testing

How I Measure My Dify Chatbot Quality with Scenario Testing

What I did

I designed multi-turn conversation scenarios for a Dify chatbot, ran them automatically via the API, and measured response quality quantitatively.

If you've built chatbots with Dify, you've probably noticed this: single-turn Q&A works fine, but once users get into 3-4 turn conversations, quality drops noticeably. So I built automated tests — multi-turn scenarios with expected responses, fired against Dify's API — to catch these problems before they reach production.


Background: existing eval tools and the remaining gap

Dify has official integrations with several observability and evaluation tools. These tools aren't just for tracing — they also have evaluation capabilities.

ToolEvaluation features
LangSmithDatasets + Evaluators, LLM-as-Judge, human feedback
LangfuseDatasets, LLM-as-Judge, human feedback, custom scores
OpikLLM-as-Judge, 8 conversation-specific metrics, dataset evaluation
Arize AXLLM-as-Judge, Session Evals, human annotation
PhoenixLLM-as-Judge, Evaluator Hub

These tools can, for example, run an application against a dataset of {input, expected_output} pairs and compare scores before and after changes. However, none of them seem to support designing and executing multi-turn conversation scenarios to check quality end-to-end.


What I wanted

Here's what I was looking for:

  • Evaluate multi-turn conversations: Test entire conversation flows (not just single Q&A), including context retention and information consistency across turns
  • Design branching based on bot responses: Create scenarios where the user's next question depends on what the bot actually said in the previous turn
  • Score each turn with LLM-as-Judge: After running a scenario, automatically evaluate each turn's response on criteria like semantic accuracy and context retention
  • Run tests repeatedly and automatically: Define scenarios once, run them as many times as needed, so quality issues that single manual tests miss get caught through continuous testing
  • Auto-generate scenarios from Dify DSL: Writing scenarios shouldn't be the bottleneck — just paste a Dify app's flow definition (YAML) and have test scenarios generated from its structure

I originally built a tool to do all of this for my own use. After using it heavily, it turned out to be more broadly useful than expected, so I published it as ConvoProbe.

A note on the Dify community's approach to quality: I searched the Dify forum and GitHub Discussions to see how others handle chatbot quality. The results were surprising:

SearchCount
Forum posts about chatbot quality evaluation0
GitHub Discussions about testing/validation3
GitHub Issues about regressions after updates211
GitHub Issues about observability/tracing524

There's plenty of discussion about observability and regressions, but almost none about systematically evaluating quality.

What ConvoProbe does

1. Evaluate multi-turn conversations

ConvoProbe evaluates entire multi-turn conversations, not just individual Q&A pairs.

Single-turn tests can verify whether individual answers are correct. But in real chatbot usage, problems emerge at turn 3 or 4 — the bot loses context, mixes up information, or contradicts what it said earlier. ConvoProbe lets you verify things like "does the bot at turn 4 correctly reference what it said at turn 1?"

2. Design conversation scenarios visually

You build conversation structures in a GUI — much like designing flows in Dify itself. For each turn, you set the user's message and the expected response.

3. Design dynamic branching based on bot responses

Real conversations aren't linear. What the user asks next depends on what the bot just said.

ConvoProbe uses an LLM to evaluate the bot's response at runtime and dynamically determines which branch to follow. Static dataset evaluation can't express this kind of "output-dependent branching."

4. Auto-generate scenarios from Dify DSL

Paste your Dify app's DSL (the YAML flow definition) into ConvoProbe, and it analyzes the flow structure to auto-generate test scenarios.

No need to design scenarios from scratch. For existing Dify apps, you can start testing immediately. Generated scenarios can be run as-is or edited in the GUI.

5. Score each turn with LLM-as-Judge

When a scenario runs, each turn's response is automatically scored on the following criteria:

CriterionWhat it measures
Semantic alignmentDoes the actual response convey the expected meaning and information?
CompletenessDoes the actual response cover all key points from the expected answer?
AccuracyIs the information in the actual response factually correct?
RelevanceIs the actual response directly relevant to the question?

In addition, the entire conversation is evaluated holistically and displayed as an overall score.


What scenario testing reveals

Running multi-turn scenario tests surfaces quality problems that are otherwise hard to catch:

Quality degrades over multiple turns

A chatbot that looks fine on single-turn tests can fall apart after 3-4 turns. RAG-based chatbots are especially prone to this — as conversations progress, the bot's ability to determine which retrieved information is relevant starts to drift.

If you only test single turns, you'll miss this entirely.

Context loss is silent

When a bot "forgets" earlier conversation history, there's no crash or error. It just generates a plausible-sounding but incorrect response.

To verify whether "turn 4 correctly references turn 1," you need to intentionally design and execute that conversation flow as a test scenario.

Workflow updates cause regressions

Updating a Dify workflow — changing a system prompt, adjusting RAG retrieval parameters — can silently break conversation patterns that were working before.

Running the same scenarios before and after a change lets you catch degradation before it reaches production.


How ConvoProbe fits with existing tools

ConvoProbe isn't a replacement for Langfuse or LangSmith — it's complementary.

PhaseToolRole
During developmentConvoProbeRun scenario tests to verify it's safe to ship
Before releaseConvoProbeCompare scenario scores before/after changes (regression testing)
In productionLangfuse / LangSmith / OpikTracing, cost monitoring, post-hoc evaluation of real conversations
When issues surfaceConvoProbeCreate a scenario that reproduces the problem, fix, re-test

Langfuse helps you discover problems. ConvoProbe helps you prevent them from recurring.


Try it

ConvoProbe requires just a Dify API key and an LLM API key for evaluation to get started.

https://convoprobe.vercel.app

Ship chatbots that actually work. Test multi-turn conversations before your users do.

    How I Measure My Dify Chatbot Quality with Scenario Testing | ConvoProbe Blog