Media aesthetic assessment is a key technique in computer vision, which is widely applied in computer game rendering, video/image classification. Low-level and high-level features fusion-based video aesthetic assessment algorithms have achieved impressive performance, which outperform photo- and motion-based algorithms, however, these methods only focus on aesthetic features of single-frame while ignore the inherent relationship between adjacent frames. Therefore, we propose a novel video aesthetic assessment framework, where structural cues among frames are well encoded. Our method consists of two components: aesthetic features extraction and structure correlation construction. More specifically, we incorporate both low-level and high-level visual features to construct aesthetic features, where salient regions are extracted for content understanding. Subsequently, we develop a structure correlation-based algorithm to evaluate the relationship among adjacent frames, where frames with similar structure property should have a strong correlation coefficient. Afterwards, a kernel multi-SVM is trained for video classification and high aesthetic video selection. Comprehensive experiments demonstrate the effectiveness of our method. (c) 2020 Elsevier Inc. All rights reserved.