TY - GEN
T1 - High-level spatial modeling in convolutional neural network with application to pedestrian detection
AU - Liu, Feng
AU - Huang, Yongzhen
AU - Yang, Wankou
AU - Sun, Changyin
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/6/19
Y1 - 2015/6/19
N2 - Convolutional neural network (CNN) has achieved great success in many vision tasks. A key to this success is its ability to powerful automatically learns both high-level and low-level features. In general, low-level features have a small size of receptive fields and appear multiple times in different locations of objects, while high-level semantic features have a relatively large size of receptive fields and only appear once in a specific location of objects. However, traditional CNN treats these two kinds of features in the same manner, i.e., learning them by the convolution operation, which can be approximately considered as cumulating the probabilities that a feature appears in different locations. This strategy is reasonable for low-level features but not for high-level semantic ones, especially in the case of pedestrian detection, where a local feature can be shared by different locations but a semantic part, e.g., a head, only appears once for a human. To jointly model the spatial structure and appearance of high-level semantic features, we propose a new module to learn spatially weighted max pooling in CNN. The proposed method is evaluated on several pedestrian detection databases and the experimental results show that it achieves much better performance than traditional CNN.
AB - Convolutional neural network (CNN) has achieved great success in many vision tasks. A key to this success is its ability to powerful automatically learns both high-level and low-level features. In general, low-level features have a small size of receptive fields and appear multiple times in different locations of objects, while high-level semantic features have a relatively large size of receptive fields and only appear once in a specific location of objects. However, traditional CNN treats these two kinds of features in the same manner, i.e., learning them by the convolution operation, which can be approximately considered as cumulating the probabilities that a feature appears in different locations. This strategy is reasonable for low-level features but not for high-level semantic ones, especially in the case of pedestrian detection, where a local feature can be shared by different locations but a semantic part, e.g., a head, only appears once for a human. To jointly model the spatial structure and appearance of high-level semantic features, we propose a new module to learn spatially weighted max pooling in CNN. The proposed method is evaluated on several pedestrian detection databases and the experimental results show that it achieves much better performance than traditional CNN.
UR - https://www.scopus.com/pages/publications/84938334229
U2 - 10.1109/CCECE.2015.7129373
DO - 10.1109/CCECE.2015.7129373
M3 - 会议稿件
AN - SCOPUS:84938334229
T3 - Canadian Conference on Electrical and Computer Engineering
SP - 778
EP - 783
BT - 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering, CCECE 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2015 28th IEEE Canadian Conference on Electrical and Computer Engineering, CCECE 2015
Y2 - 3 May 2015 through 6 May 2015
ER -