Referring video object segmentation (RVOS) is a hot research topic in the cross-media task spanning video and language. It aims to segment correlated entities in a given video with textual descriptions. Unlike conventional visual segmentation task that depends on pre-defined classes, the RVOS task is to understand the given expressions to locate and segment the referring entities without the help of pre-defined classes. Due to the randomness of the textual expressions and no pixel-wise masks serving as a reference, the RVOS task is more challenging than the conventional video segmentation task. Although RVOS is a new task in cross-modal understanding, it has essential application prospects for many tasks (e.g., security monitoring, vehicle tracking, person re-identification, and so on), thus increasing number of significant methods are being proposed consecutively. Specifically, the solutions are roughly divided into four categories according to the differences in research approaches, such as dynamic convolution based, attention based, multi-level information learning based and end-to-end sequence prediction based methods. Later, qualitative and quantitative performance comparisons are presented for analysis. Lastly, the paper summarizes several issues existing in current methods, and then some suggestions are proposed to further improve the performance of RVOS tasks in future work. © 2025 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.