XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Abstract

Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language. However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features. In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using an LLM-based model, without relying on parallel emotional data. XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching alignment module minimizes pitch and duration mismatches to ensure natural prosody and speaker timbre blending, which helps preserve speaker characteristics and emotional coherence in the synthesized cross-lingual speech. Experimental results show that XEmoRAG synthesizes expressive and natural Thai speech using only Chinese reference audio, without requiring explicit emotion labels. These results highlight XEmoRAG's capability to achieve flexible and low-resource emotional transfer across languages.

Figure 1: Cross-lingual emotional speech synthesis system based on retrieval-augmented generation and flow matching. The user input includes a Chinese emotional reference speech, target Thai text, and a specified emotion intensity level (strong, normal, or weak).

Cross-Lingual Emotion Comparison

This comprehensive comparison highlights XEmoRAG's ability to transfer emotions cross-lingually, comparing it with two state-of-the-art baseline systems—Typhoon[1] and DelightfulTTS[2]—in the Chinese-to-Thai transfer task. For each test case, we present: (1) the original Chinese emotional reference audio, (2) the emotionally matched Thai utterance retrieved from our emotional database, and (3) synthesized outputs from all three systems for direct comparison. The evaluation spans four key emotions (anger, happiness, sadness, and fatigue), with emotion labels not seen across different textual contexts, thus showcasing XEmoRAG's consistent ability to preserve emotional authenticity while maintaining natural prosody in Thai.

Emotion Intensity Control

Demonstrate XEmoRAG's ability to control emotion intensity at three levels: weak, normal, and strong. We present: (1) the Chinese reference audio establishing the base emotion, followed by (2) three synthesized Thai versions demonstrating precisely controlled intensity levels (weak/normal/strong). The samples use identical textual content to isolate intensity effects, with emotional strength adjusted through our retrieval-augmented generation framework.

Flow Matching for Natural Prosody and Timbre

This ablation study validates our flow-matching module's contribution by comparing: (1) the original X-Codec2 decoder in Llasa[3] output with (2) our flow-matched version under identical input conditions. Each comparison pair includes: (a) the Chinese reference audio, (b) the retrieved Thai emotional prompt, and (c) both synthesis outputs. The results clearly demonstrate the flow-matching module's enhancements in naturalness, expressiveness, and timbre blending.

[1] K. Pipatanakul et al., "Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models", arXiv preprint arXiv:2412.13702, 2024.

[2] Y. Li et al., "Zero-shot emotion transfer for cross-lingual speech synthesis", IEEE ASRU Workshop, pp. 1-8, 2023. DOI: 10.1109/ASRU57964.2023.10389625

[3] Z. Ye et al., "Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis", arXiv preprint arXiv:2502.04128, 2025.