Skip to main navigation menu Skip to main content Skip to site footer

Gradient-Guided Adversarial Sample Construction for Robustness Evaluation in Language Model Inference

Abstract

This study addresses the challenge of adversarial robustness faced by large language models in natural language inference tasks. It proposes a gradient-guided adversarial sample generation method. The method introduces an inference sensitivity scoring mechanism, which uses internal gradient information to precisely identify input regions most sensitive to reasoning outcomes. This enables the selection of efficient perturbation positions. At the same time, a semantics-preserving perturbation strategy is designed. It aims to achieve the attack objective while preserving the semantic consistency and contextual coherence of the original text. The method extracts embedding representations from the input text and constructs a perturbation priority ranking by combining gradient magnitude with semantic attention weights. High-quality adversarial samples are generated through dual constraints of semantic similarity and contextual consistency. Under various input conditions, including perturbation position strategies, text length, and multilingual scenarios, the method demonstrates strong attack efficiency, semantic preservation, and generalization stability. Experimental results show that the proposed approach significantly improves attack success rates while maintaining a low perturbation rate. The generated texts remain highly natural and readable. These findings validate the effectiveness and applicability of the proposed mechanisms in text-level adversarial sample construction.

pdf