Multi-agent collaborative perception, an emerging technology in intelligent driving, has attracted considerable attention in recent years. Despite advancements in previous works, challenges remain due to inevitable localization errors, data sparsity, and bandwidth limitations. To address these challenges, a collaborative detection and tracking method, CoTrack, is proposed to balance perception effectiveness with communication efficiency. Specifically, a spatio-temporal aggregation module, consisting of a spatial cross-agent collaboration submodule and a temporal ego-agent enhancement submodule, is presented. The former dynamically integrates spatial semantics from multiple agents to alleviate feature misalignment caused by localization errors, while the latter captures the historical context of the ego-agent to compensate for the insufficiency of single-frame observations resulting from data sparsity. Additionally, an unsupervised feature compressor is designed to reduce communication volume. Furthermore, a two-stage online association strategy is developed to improve the matching success rate of detection-track pairs in collaborative tracking task. Experimental results on both simulated and real datasets demonstrate that CoTrack achieves state-of-the-art performance in collaborative 3D object detection and tracking tasks while maintaining robustness in harsh and noisy environments.