Developing intelligent visual systems for next-generation smart classrooms has become an active area of research in computer vision. Advances in computer vision and deep learning technologies have enabled the development of such systems capable of automatically classifying students' behavior and providing feedback to teachers. Recently, some vision-based methods have been proposed for this purpose. However, most works do not integrate multiple visual cues like facial expressions and body poses, which can effectively improve classification accuracy. Moreover, these methods cannot be extended to get individual students' behavior feedback. This paper attempts to fill these research gaps by proposing a novel multiple visual cues-based automated system that monitors and reports individual students' and overall class behavior. First, the system detects and tracks each student from the input classroom video frames. Then, it extracts body pose, mobile proximity, and facial features using the Openpose and Py-Feat frameworks and combines them into a single feature vector. This vector is fed into the trained behavior model, classifying each student's behavior as "positive" or "negative." Subsequently, the individual labels are aggregated frame-by-frame to estimate the overall class behavior. The behavior model was developed by training a customized neural network architecture on our newly developed dataset, named "Classroom Spontaneous Student Behavior Dataset." The model trained on concatenated features achieved 91.30% and 90.80% training and validation accuracy, respectively, outperforming the models trained on individual features and other relevant methods. Additionally, we empirically analyzed the proposed system's computational complexity and demonstrated its output on a sample classroom video.