Recognition of human activity is an active research area. It uses the Internet of Things, Sensory methods, Machine Learning, and Deep Learning techniques to assist various application fields like home monitoring, robotics, surveillance, and healthcare. However, researchers face problems such as time complexity, more execution time of the model, and classification accuracy. This paper introduces a novel approach to overcome the issue as mentioned earlier by using the deep learning transformer model such as ViT(Vision Transformer), DieT(Data-efficient image Transformers), and SwinV2 transformer, which are used for image-based datasets (i.e., Standard40, MPII human pose) and VideoMAE transformer is used for video-based UCF101 and HMDB51 datasets. The approaches achieved remarkable accuracy in classifying human activities. Evaluations using the ViT, DeiT, and Swin transformer V2 with Stanford40 are 90.8%, 90.7%, and 88%; similarly, MPII Human Pose datasets show 87%, 85.6%, and 87.1%. In addition, this paper's method has achieved remarkable accuracies of 94.15% and 78.44%, respectively, when applying the VideoMAE transformer to video-based activity recognition on the UCF101 and HMDB51 datasets. These findings emphasize the efficacy of the attention-based transformer (i.e., ViT, DeiT, SwinV2, and VideoMAE) model and the novelty of earlier no-result evaluation on these various datasets with attention-based transformers.