Vision-based multi-target monitoring systems for bridge structures provide a comprehensive evaluation of structural safety. However, their application to field bridges has been constrained by challenges such as the trade-off between the field of view (FOV) and accuracy, as well as the impact of camera orientation and complex backgrounds on measurement effectiveness. This study introduces a robust monocular vision-based monitoring system (RMVMS) for multi-target displacement measurement. First, a system configuration determination method is developed to achieve an effective balance between FOV and accuracy. Next, a hybrid network structure, ConvTransNet, is introduced to mitigate the impact of complex background disturbance. Additionally, a novel multi-target displacement transformation model (MDTM) is proposed to correct errors arising from camera orientation. Moreover, a boundary loss function and an RMSProp learning rate schedule were implemented during training, enabling ConvTransNet to achieve optimal performance with a P-R threshold of 0.45. A 4-meter laboratory-scale bridge model test demonstrated the superiority of ConvTransNet over existing segmentation models on a custom dataset formatted according to Pascal VOC 2012 standards. MDTM effectively reduced orientation-induced errors from 17.93 % to 1.53 %. The efficiency and robustness of RMVMS were further validated on a tied arch bridge, achieving RMSE and NRMSE below 0.162 mm and 3.63 %, respectively, confirming its capability for precise multi-target displacement monitoring in field applications.