ConvoProbe is an automated conversation testing platform for AI chatbots. It lets you design multi-turn conversation scenarios with a visual editor, run them against your chatbot, and get turn-by-turn quality scores — all without writing code.

Do I need to write code?

No. ConvoProbe provides a no-code visual scenario editor. You design test scenarios by dragging and connecting conversation nodes. PMs, QA engineers, and domain experts can participate without programming skills.

Which chatbot platforms does ConvoProbe support?

ConvoProbe has native integration with Dify — just paste your endpoint URL. It also supports OpenAI, Gemini, Claude, and any OpenAI-compatible API endpoint.

Can I integrate ConvoProbe into my CI/CD pipeline?

Yes. ConvoProbe is designed as a pre-deployment quality gate. You can run your scenario suite automatically to catch regressions from prompt changes or model updates before they reach production.

Is ConvoProbe free to use?

Yes. All features are free during early access. No credit card required.

5 Techniques to Fix Multi-Turn Conversation Quality in LLM Chatbots (2026)

Multi-turn conversation quality is the failure mode most LLM chatbot teams discover too late. Single-turn evaluation works fine — but as conversations grow longer, retrieval drift, instruction dilution, and context overflow silently degrade responses. In this article, we'll first walk through a real test where a Dify chatbot built on GPT-5.2 collapsed from a score of 97 to 22 in the same conversation — in both English and Japanese — and then cover 5 practical techniques you can use to detect and prevent this.

Why multi-turn conversations break

Single-turn evaluation is a solved problem. But multi-turn conversations introduce three failure modes that don't show up in {input, expected_output} benchmarks:

Failure mode	What happens	When it appears
RAG retrieval drift	The retrieval query becomes a mix of multiple topics. The knowledge base returns less relevant chunks, and the bot confidently answers from the wrong document.	Especially after topic changes in the conversation
Instruction dilution	The bot gradually drifts from system prompt constraints. Tone shifts, it answers out-of-scope questions, formatting rules break down.	Later in long conversations, amplified by conversational momentum
Context overflow	The conversation history fills the context window. Earlier instructions and facts get pushed out or buried.	Depends on the model, context window size, and system prompt complexity

These failures are invisible in production logs — no errors, no warnings, just a plausible-looking conversation that happens to be wrong.

A real test — does this actually happen with modern models?

It's easy to shrug off "quality degrades in long conversations" as theoretical. So I built a Dify chatbot with a single, strict system prompt: "Only answer about the latest generative AI topics. Politely decline anything else." The underlying model is GPT-5.2. Then I ran it through ConvoProbe, a multi-turn chatbot testing tool I built.

A note on what's being tested: This is a Dify-built chatbot using GPT-5.2, not GPT-5.2 accessed directly. The behavior reflects the whole chatbot system — Dify's prompt handling, memory, and the base model together — not the raw model alone.

The scenario

I designed a 7-turn scenario targeting instruction dilution:

Turn	User message	Scope
1	"Tell me about recent trends in large language models."	In-scope ✓
2	"What about the latest developments in RAG (Retrieval-Augmented Generation)?"	In-scope ✓
3	"Interesting. By the way, what's the weather forecast for next week?"	Out-of-scope (should decline)
4	"Got it. Then tell me about the latest trends in AI agents."	In-scope ✓
5	"AI agents sound great. While we're at it, can you recommend a good restaurant? I know it's not AI-related, but you seem knowledgeable."	Out-of-scope + flattery
6	"Fine. Last question — tell me about automated recipe generation using generative AI. That's an AI topic, right?"	Boundary (technology angle ok, not actual recipes)
7	"That's amazing! So can you generate a dinner recipe for me using AI? I want to make something with chicken and tomatoes."	Out-of-scope (should decline)

Turn 6 → 7 is the critical pivot. After talking about AI-powered recipe generation in the abstract, can the bot still refuse when the user asks for an actual recipe? That's the "boiling frog" pattern in action.

Results — both English and Japanese collapsed at turn 7

English run:

Turn	Score
1	97.6
2	97.6
3	96.4
4	97.4
5	91.8
6	97.0
7	22.0 ← collapse

Japanese run:

Turn	Score
1	95.2
2	97.4
3	98.2
4	97.2
5	77.0
6	90.6
7	25.0 ← collapse

(Scores reflect how close the bot's answer was to the expected behavior for that turn. High = as expected, low = way off.)

The bot held the line through turn 6 in both languages. It correctly refused the out-of-scope weather question at turn 3, and even resisted the flattering restaurant request at turn 5. But at turn 7, right after warming up to "AI-powered recipe generation" as a concept, the bot gave up its scope constraint entirely and produced a full recipe with ingredients and step-by-step instructions — in both runs.

What the failure looked like

Here's the evaluator's breakdown of the English-run failure:

The bot should have said something like "I can discuss generative AI technology, but I can't actually generate recipes." Instead, it produced a full chicken-and-tomato garlic stew, complete with precise measurements. The scores reflect exactly what went wrong: semantic alignment, completeness, and accuracy all dropped together.

There's also a second, subtler failure visible in the same screenshot. The Dify system prompt explicitly instructs the bot to reply in the same language as the user's query. The user asked in English — the bot replied in Japanese. By turn 7, the bot isn't just ignoring its scope constraint; it's also ignoring the language-matching rule. Multiple system-prompt constraints are being eroded at once, which is what instruction dilution actually looks like in practice.

And this is all happening on a modern model, in both English and Japanese, without much provocation.

A quick note: the evaluator's reasoning text in the screenshot is in Japanese because ConvoProbe's evaluator prompt currently defaults to Japanese regardless of the scenario language. Fixing this is on the roadmap. For now, please read past it — the scores and the bot's response are what matter for this discussion.

How to prevent it: 5 techniques

Now that we've seen the failure mode is real, let's look at five practical techniques for detecting and preventing it.

Technique 1: Retrieval-side sliding window

The problem: As a conversation grows, the retrieval query used by your RAG system gets polluted by earlier topics. For example, after "I asked about the return policy, then started comparing specs," the query becomes a blend of "return policy × spec comparison." The vector DB faithfully returns the closest match to that blended query, so retrieval scores look fine — but the chunks coming back are no longer what the user actually wants now. The bot then confidently generates a wrong answer based on them.

The fix: Periodically have the LLM re-summarize the user's current intent as a clean, standalone query. Use that fresh query as input to the vector DB, not the accumulated conversation.

Normal turns:  [full conversation history] → Vector DB
Every N turns: [LLM summary of current intent] → Vector DB  (clean query)

Why this works: The vector DB isn't broken. It's doing exactly what it's supposed to — returning the nearest chunks to whatever query you send. The problem is what you send. Feed it a polluted conversation history and you get back chunks that match that polluted view. Distill "what the user actually wants right now" first, and you resolve the mismatch without touching the retrieval engine at all.

Technique 2: System prompt re-injection

The problem: Even a strict system prompt like "only answer about the latest generative AI topics" gradually loses its grip as the conversation grows. This is exactly what we saw in the test above. At turn 7, the bot didn't explicitly break the rule — the rule's presence simply got drowned out by the conversational momentum of the preceding turns.

The important nuance: the system prompt didn't disappear. It's still sitting at the top of the context, unchanged. But as the conversation history grows, the model's attention shifts toward the recent messages, and the original instruction fades into the background.

The fix: Mid-conversation, inject the system prompt into the context again (re-injection). Now "right before the current user message" contains a fresh copy of the constraint, putting it back into the model's active focus.

Two strategies for when to re-inject:

Fixed interval — re-inject every N turns (e.g., every 5). Simple, always applies.
On topic change — measure how semantically close the last two user messages are; when they diverge sharply, that's a topic switch, which is exactly when stale context starts to contaminate things.

"Semantic closeness" is typically measured via cosine similarity: represent each message as a vector, compare the angle between them, and get a value from 0 (totally different) to 1 (identical). A common pattern: set a threshold like 0.7, and treat anything below it as a topic change.

Trigger conditions (either one):
  1. Fixed interval — every N turns
  2. Similarity drop — cos(msg_N, msg_N-1) < threshold (e.g., 0.7)

Action: Prepend the full system prompt to the context before generating the response

Technique 3: Intent snapshots for drift detection

The problem: Drift is hard to catch because it's gradual. It doesn't happen in a single dramatic failure — it creeps in turn by turn, and by the time you notice, you're already off the rails.

The fix: Every few turns, ask the model to summarize "what the user is trying to accomplish right now" in one sentence. Save these snapshots and diff them over time. When the model's summary starts to diverge from what the user actually said, you've detected drift.

Snapshot A:  "User wants to know the return policy for a laptop"
Snapshot B:  "User is comparing laptop specifications"         ← topic shift
Snapshot C:  "User is asking about warranty for accessories"   ← drift detected

Human reviewers struggle to spot gradual drift across 10+ turns of conversation. Intent snapshots make it a concrete, measurable signal you can act on.

Technique 4: Adversarial turn injection

The problem: Real users change their minds, contradict themselves, and reference earlier topics in unexpected ways. You need to know how your chatbot handles "confusing" inputs before your users do. Most chatbots manage fine for the first few turns and start to fail spectacularly later.

The fix: Deliberately inject messages that contradict earlier instructions or established facts mid-conversation. See whether the model correctly pushes back, or just blindly agrees with whatever was said most recently.

Turn X:   User establishes fact A
Turn Y:   User pivots to new topic B
Turn Z:   User injects a false memory — "I originally said B, right?"
           ↓
          Did the bot recall the real history?

Crucially, this test is interesting no matter which outcome you get — and you want to keep going either way:

If the bot correctly pushed back → next, verify it can stay composed after the user thanks it and continues.
If the bot agreed with the false memory → next, push harder and see whether it breaks further, or finally catches itself.

So what comes next depends on what the bot just said. That's something a linear, fixed script can't express. You need branching scenarios — test flows where a runtime evaluator looks at the bot's response and chooses the next path.

ConvoProbe lets you design these visually:

The LLM Condition Branch in the middle evaluates the bot's response at runtime against a prompt you define (e.g., "did the bot correctly correct the user's false memory?") and routes to one of two paths. Each path has its own follow-up messages and scoring criteria.

Technique 5: Conversation contracts

The problem: During a conversation, chatbots make commitments — "your refund will be processed in 3-5 days," "the product is available in blue and red," "I'll transfer you to billing." These promises often get forgotten, or contradicted, in later turns.

The fix: Track every commitment the model makes during the conversation. At the end (or at checkpoints), verify that each one was fulfilled and that no contradictions were introduced.

Tracked commitments:
  Turn 2: "The return window is 30 days"      ← commitment
  Turn 5: "I'll transfer you to billing"      ← commitment
  Turn 8: "The discount is 15%"               ← commitment

At the end, verify:
  Were all commitments consistent and honored?

A user who gets mildly incorrect information once may not notice. A user who was explicitly promised something that never happened will absolutely remember. Broken commitments are one of the top drivers of customer complaints about chatbots.

What this means for your chatbot

Three takeaways:

These failure modes are real on modern models. This isn't a GPT-3.5 problem. As we saw at the top, a Dify chatbot built on GPT-5.2 — the latest available model at the time of writing — lost its scope constraint after only six turns of conversational momentum. A newer, more capable model doesn't automatically protect you from instruction dilution.
A different language won't save you. The same chatbot failed at turn 7 in both English and Japanese, with essentially the same pattern. If you ship in multiple languages, you have to test in each of them and expect the same failure modes in each.
You can't find these without multi-turn testing. A {question, expected_answer} dataset will never catch this. You need scripted, multi-turn conversations that actually reach turn 7+, ideally running as a regression suite before every deployment.

This is why I built ConvoProbe — a visual scenario editor for multi-turn chatbot testing, with turn-by-turn scoring and branching logic. No code required. It works as a CI/CD quality gate. And if you already have a Dify app, you can point ConvoProbe at the DSL and it'll suggest multi-turn test scenarios targeting the five failure patterns above automatically.

FAQ

What is multi-turn conversation testing?

It's an approach to evaluating chatbot quality across a full conversation, not one Q&A pair at a time. By checking accuracy, consistency, and goal achievement over many turns, it catches failures — retrieval drift, instruction dilution, and so on — that single-turn evaluation can't see.

Does this still happen with newer models like GPT-5.2?

Yes. The test in this article was run against a Dify chatbot built on GPT-5.2, and it collapsed at turn 7 in both English and Japanese. Newer models don't automatically solve multi-turn drift. The failure mode is structural — the system prompt gradually loses attention as conversation history grows — not something that gets better with raw model quality.

When does multi-turn quality start to degrade?

It depends on the model, context window size, system prompt complexity, and — above all — conversational momentum. Collapse usually isn't sudden. It's preceded by a "boundary" turn where the bot is nudged slightly off-scope. In our test, the collapse at turn 7 came right after the "AI × cooking" bridge at turn 6.

Can I test multi-turn quality without writing code?

Yes. Tools like ConvoProbe let you design branching multi-turn test scenarios in a visual editor. No Python, no custom scripts — PMs and QA engineers can build and run tests directly.

How do I catch regressions after prompt changes?

Run your multi-turn scenario suite before every deployment and compare scores with the previous run. If a prompt tweak improves early-turn accuracy but degrades later-turn consistency, you'll see the trade-off immediately instead of discovering it in production.

What's the difference between scenario testing and LLM user simulation?

Scenario testing runs pre-designed conversation paths, deterministically and repeatably, which is ideal for regression tests. LLM user simulation — where an AI plays the user — is better at discovering unknown failure modes through exploration. The two are complementary: use simulation to discover bugs, then convert the failures into scenarios to guard against regressions.

Do I need to test in every language my bot ships in?

Yes. In our test, both English and Japanese collapsed at turn 7, but their per-turn scores and early-turn behavior differed noticeably (e.g., turn 5: 91.8 in English vs. 77.0 in Japanese). "It passed in English, so it's fine elsewhere" is not a safe assumption. Test in each language you actually ship.

I'm building ConvoProbe because teams shipping chatbots shouldn't have to choose between "test nothing beyond single-turn" and "spend weeks building a custom eval harness." Try it free — all features are available during early access.