Transformer-Based Spiking Neural Networks for Multimodal Audiovisual Classification

  • Lingyue Guo
  • , Zeyu Gao
  • , Jinye Qu
  • , Suiwu Zheng
  • , Runhao Jiang
  • , Yanfeng Lu
  • , Hong Qiao

Research output: Contribution to journalArticlepeer-review

22 Scopus citations

Abstract

The spiking neural networks (SNNs), as brain-inspired neural networks, have received noteworthy attention due to their advantages of low power consumption, high parallelism, and high fault tolerance. While SNNs have shown promising results in uni-modal data tasks, their deployment in multimodal audiovisual classification remains limited, and the effectiveness of capturing correlations between visual and audio modalities in SNNs needs improvement. To address these challenges, we propose a novel model called spiking multimodel transformer (SMMT) that combines SNNs and Transformers for multimodal audiovisual classification. The SMMT model integrates uni-modal subnetworks for visual and auditory modalities with a novel spiking cross-attention module for fusion, enhancing the correlation between visual and audio modalities. This approach leads to competitive accuracy in multimodal classification tasks with low energy consumption, making it an effective and energy-efficient solution. Extensive experiments on a public event-based data set (N-TIDIGIT&MNIST-DVS) and two self-made audiovisual data sets of real-world objects (CIFAR10-AV and UrbanSound8K-AV) demonstrate the effectiveness and energy efficiency of the proposed SMMT model in multimodal audiovisual classification tasks. Our constructed multimodal audiovisual data sets can be accessed at https://github.com/Guo-Lingyue/SMMT.

Original languageEnglish
Pages (from-to)1077-1086
Number of pages10
JournalIEEE Transactions on Cognitive and Developmental Systems
Volume16
Issue number3
DOIs
StatePublished - 1 Jun 2024
Externally publishedYes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 7 - Affordable and Clean Energy
    SDG 7 Affordable and Clean Energy

Keywords

  • Audiovisual classification
  • audiovisual data sets
  • multimodal recognition
  • spiking neural network (SNN)

Fingerprint

Dive into the research topics of 'Transformer-Based Spiking Neural Networks for Multimodal Audiovisual Classification'. Together they form a unique fingerprint.

Cite this