Skip to main navigation menu Skip to main content Skip to site footer

Instruction Tuning for Multi-Domain Dialogue Generation in LLMs

Abstract

This paper presents a systematic study on instruction tuning for large language models (LLMs) applied to multi-domain dialogue generation. While instruction tuning enhances zero-shot generalization, its performance across diverse application domains remains underexplored. We curate a multi-domain dataset covering healthcare, finance, legal consulting, travel planning, and education. Using this dataset, we fine-tune and evaluate three open-source LLMs—LLaMA 2-13B, Falcon-7B, and Mistral-7B—on instruction-based dialogue tasks. To assess semantic alignment between user intent and model response, we introduce the Task-Semantic Alignment Score (TSAS), a novel embedding-based evaluation metric. Experimental results show that Mistral-7B achieves the best balance of accuracy, coherence, and safety, outperforming other models across BLEU, ROUGE, MAUVE, and TSAS metrics. We further analyze failure modes such as hallucinations and instruction misinterpretation, and demonstrate that domain-aware tuning and alignment-sensitive metrics are essential for reliable deployment of LLMs in real-world, multi-domain settings.

pdf