Group emotion recognition in the wild has received much attention in computer vision community. It is a very challenge issue, due to interactions taking place between various numbers of people, different occlusions. According to human cognitive and behavioral researches, background and facial expression play a dominating role in the perception of group's mood. Hence, in this paper, we propose a novel approach that combined these two features for image-based group emotion recognition with feature correlation enhancement. The feature enhancement is mainly reflected in two parts. For facial expression feature extraction, we plug non-local blocks into Xception network to enhance the feature correlation of different positions in low-level, which can avoid the fast loss of position information of the traditional CNNs and effectively enhance the network's feature representation capability. For global scene information, we build a bilinear convolutional neural network (B-CNN) consisting of VGG16 networks to model local pairwise feature interactions in a translationally invariant manner. The experimental results show that the fused feature could effectively improve the performance.