Evaluation of Quantitative and Qualitative Metrics for Assessing Hallucination Phenomena in Large Language Models
Abstract
Hallucination phenomena in large language models have drawn considerable attention due to their impact on reliability, trustworthiness, and interpretability. Recent advances in transformer-based architectures have demonstrated remarkable capabilities in tasks such as language generation, question answering, and dialogue systems. However, the incidence of fabricated details during inference raises concerns regarding the internal mechanisms that guide these models toward erroneous responses. Uncertainties inherent in model training and data representation create conditions in which hallucinated elements appear, often veiled by fluency and coherence that obscure their inaccuracy. Researchers have proposed numerous strategies to identify and evaluate these outputs, leading to the emergence of a broad array of quantitative and qualitative metrics. Quantitative measurements focus on numerical or probabilistic characterizations, reflecting the extent to which token distributions deviate from reference truths. Qualitative assessments emphasize the interpretive dimension, shedding light on user perceptions, contextual expectations, and semantic coherence. This paper provides a systematic evaluation of these methodologies by examining different metric families, highlighting the conditions under which each approach offers robust insights into generative behavior. The synthesis presented here reveals methodological distinctions, reveals synergies among multiple evaluation frameworks, and showcases promising analytical pathways for furthering the accurate interpretation of model outputs. An integrated perspective on hallucination assessment could guide principled development and deployment of reliable language models.