Joint Cross-Modal Representation Learning of ECG Waveforms and Clinical Reports for Diagnostic Classification
Abstract
Electrocardiogram (ECG) diagnostic classification has high application value in clinical screening and triage. However, single waveform modeling often fails to fully utilize clinical semantic information and is prone to instability in discrimination under conditions of expression differences in text interpretation and waveform noise. This paper proposes a multimodal diagnostic classification framework for the joint input of 12-lead ECG waveforms and physician report text. It extracts waveform temporal morphological features and reports semantic representations through dual-path encoding and performs projection alignment in a shared semantic space. To achieve adaptive information integration, a gated fusion mechanism is designed to dynamically allocate modal contributions based on the joint state of the two representations, generating a shared representation for classification. Simultaneously, cross-modal consistency constraints are introduced to cross-verify the waveform and text at the diagnostic semantic level, reducing the risk of bias caused by heterogeneous information fusion. Finally, the model outputs the diagnostic category probability distribution through a lightweight classification head and is optimized in an end-to-end manner. Comparative experimental results show that the proposed method outperforms representative baseline methods on multiple evaluation metrics, demonstrating stronger discriminative ability and more stable overall performance, validating the effectiveness of joint representation learning, gated fusion, and consistency constraints in ECG multimodal diagnostic classification.