TY - JOUR
T1 - Generalized Zero-Shot Learning Via Multi-Modal Aggregated Posterior Aligning Neural Network
AU - Chen, Xingyu
AU - Li, Jin
AU - Lan, Xuguang
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2022
Y1 - 2022
N2 - The visual-semantic gap between the visual space (visual features) and semantic space (semantic attributes) is one of the main problems in the Generalized Zero-Shot Learning (GZSL) task. The essence of this problem is that the structure of manifolds in these two spaces is inconsistent, which makes it difficult to learn embeddings that unify visual features and semantic attributes for similarity measurement. In this work, we tackle this problem by proposing a multi-modal aggregated posterior aligning neural network based on Wasserstein Auto-encoders (WAE) which learns a shared latent space for visual features and semantic attributes. The key to our approach is that the aggregated posterior distribution of the latent representations encoded from visual features of each class is encouraged to be aligned with a Gaussian distribution predicted by the corresponding semantic attribute in the latent space. On one hand, requiring the latent manifolds of visual features and semantic attributes to be consistent preserves the inter-class association between seen and unseen classes. On the other hand, the aggregated posterior of each class is directly defined as a Gaussian in the latent space, which provides a reliable way to synthesize latent features for training classification models. Using the AWA1, AWA2, CUB, aPY, FLO, and SUN benchmark datasets, we extensively conducted comparative evaluations to demonstrate the advantages of our method over state-of-the-art approaches.
AB - The visual-semantic gap between the visual space (visual features) and semantic space (semantic attributes) is one of the main problems in the Generalized Zero-Shot Learning (GZSL) task. The essence of this problem is that the structure of manifolds in these two spaces is inconsistent, which makes it difficult to learn embeddings that unify visual features and semantic attributes for similarity measurement. In this work, we tackle this problem by proposing a multi-modal aggregated posterior aligning neural network based on Wasserstein Auto-encoders (WAE) which learns a shared latent space for visual features and semantic attributes. The key to our approach is that the aggregated posterior distribution of the latent representations encoded from visual features of each class is encouraged to be aligned with a Gaussian distribution predicted by the corresponding semantic attribute in the latent space. On one hand, requiring the latent manifolds of visual features and semantic attributes to be consistent preserves the inter-class association between seen and unseen classes. On the other hand, the aggregated posterior of each class is directly defined as a Gaussian in the latent space, which provides a reliable way to synthesize latent features for training classification models. Using the AWA1, AWA2, CUB, aPY, FLO, and SUN benchmark datasets, we extensively conducted comparative evaluations to demonstrate the advantages of our method over state-of-the-art approaches.
KW - Aggregated posterior distribution alignment
KW - generalized zero-shot learning
KW - multi-modal neural network
UR - https://www.scopus.com/pages/publications/85098749855
U2 - 10.1109/TMM.2020.3047546
DO - 10.1109/TMM.2020.3047546
M3 - 文章
AN - SCOPUS:85098749855
SN - 1520-9210
VL - 24
SP - 177
EP - 187
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -