Qifu Technology's intelligent speech team has achieved another milestone with their multimodal emotion computing research paper "Qieemo: Multimodal Emotion Recognition Based on the ASR Backbone" being officially accepted by ASRU 2025, a flagship conference in the speech domain. This achievement makes Qifu Technology one of the few fintech companies to have research accepted at all three top global speech conferences (ICASSP, InterSpeech, ASRU), positioning the company in the first tier of global speech technology R&D.
As a flagship conference in the audio understanding domain, ASRU (IEEE Workshop on Automatic Speech Recognition and Understanding) is held biennially and represents the highest level of research in global audio understanding.
The core value of the paper accepted at ASRU 2025 lies in constructing a universally applicable theoretical framework rather than merely a task-specific model. From a mathematical modeling perspective, the paper innovatively builds a general feature fusion theoretical framework centered on ASR models as the core backbone, systematically demonstrating the essential contributions and key mechanisms of multi-level features from pre-trained ASR model encoders for downstream audio understanding tasks. This framework breaks away from conventional approaches of adding network layers or fine-tuning parameters on existing models, deeply exploring the essence of speech representation and the underlying logic of cross-modal applications, providing a novel and solid theoretical foundation for multimodal emotion recognition and broader speech understanding tasks.
The resulting Qieemo model implementation is based on this theoretical framework. Built upon widely available pre-trained ASR (Automatic Speech Recognition) model components, it extracts text-related speech posterior probability features and frame-aligned emotional features. Through proprietary multimodal fusion modules and cross-modal attention modules, it achieves efficient fusion of features from different layers of ASR models. Qieemo's design philosophy provides excellent transferability and scalability. Its core concept of utilizing deep, aligned features extracted from ASR backbone networks as the foundation for multimodal fusion applies not only to emotion computing but also provides powerful fundamental tools and new research paradigms for other downstream related tasks such as liveness detection and semantic understanding, and even cross-industry intelligent interaction scenarios in education, healthcare, entertainment, and other sectors. More importantly, during real-time interaction, Qieemo provides not only corresponding text information but also deeper emotional insights.
Qieemo innovatively enables machines to truly "understand" emotions in human speech! This breakthrough technology improves recognition accuracy by over 15% compared to traditional methods and achieves significant breakthroughs in complex scenarios—delivering a 4% relative improvement over the already SOTA single-modal solution MSMSER, enabling intelligent customer service to possess genuine "emotional understanding" capabilities for the first time, establishing a new "SOTA+" benchmark in the emotion computing field. This performance leap stems from deep insights into underlying speech features and their mechanisms rather than simple model complexity increases.
From a business value perspective, this technology can directly empower the entire financial services process: in intelligent customer service scenarios, real-time recognition of user emotional fluctuations enables dynamic adjustment of service strategies to improve user satisfaction; in credit assessment processes, combining speech emotional features with text information allows more precise judgment of user credit status and reduces risk costs. More importantly, the theoretical foundation and framework design established by Qieemo constructs a higher-performance, more adaptable underlying platform for intelligent voice interaction in finance and broader domains.
Unlike most fintech companies that rely on open-source technology or external partnerships, Qifu Technology insists on full-chain independent R&D in core artificial intelligence areas, continuously investing in cutting-edge fields such as speech recognition and emotion computing, forming a complete system from algorithm design to engineering implementation. Crucially, Qifu Technology's R&D path has chosen a deeper, more fundamental exploration route. While the industry generally focuses on stacking layers on existing neural network architectures or trying different combinations, Qifu Technology chooses to return to problem fundamentals, deeply investigating the underlying mathematical principles and mechanisms of speech signal processing, feature representation, and fusion. This persistent pursuit of basic theory and original frameworks provides significant advantages in technical depth, application flexibility, and long-term competitiveness.
Fei Haojun, Chief Algorithm Scientist at Qifu Technology, stated: "Completing the trifecta of top conferences is not an endpoint, but the starting point for Qifu Technology's speech technology ecosystem. The establishment of the Qieemo model marks a crucial step in building fundamental speech understanding capabilities. It not only serves our own financial scenarios, but its theoretical core and design philosophy have the potential to benefit peers and cross-industry applications. We will continue exploring the convergence point between speech technology and human-machine collaboration, persisting in innovation in basic theory and core frameworks, making fintech not only precise but also empathetic, and allowing the broader intelligent world to benefit from our deep understanding of underlying logic."
Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.