Pretraining on a large dataset is the first stage of many computer vision tasks such as classification, detection, and segmentation. A conventional pretraining approach is performed on large datasets with human annotation. In this context, self-supervised learning, which uses unlabeled datasets to pretrain models, shows increasing promise for applications. Throughout the development of self-supervised learning, image-level contrastive representation learning has emerged as a highly effective approach for general transfer learning. However, it may lack specificity when applied to a specific downstream task, compromising performance in that task. Recently, an object-level self-supervised pretraining framework called SoCo is proposed for object detection. To achieve object-level pretraining, they adopt the traditional selective search algorithm to generate object proposals, which needs high space and time cost and also hinders end-to-end training to achieve global optimization. In this work, we propose an end-to-end object-level contrastive pretraining for detection, which obtains object proposals using the pretraining network itself. Specifically, we adopt the heat map from the features at the last backbone convolutional layer as semantic information to roughly localize objects, then generate promised proposals with center-suppressed sampling and multiple cropping strategies. The experimental results show that our method displays better performance with significantly less training space and time cost.