This study introduces the crossing intention and trajectory with the Transformer networks (CITraNet) prediction model to process multimodal input data of vulnerable road users (VRUs), such as pedestrians and bicyclists, whose behavior is inherently unpredictable. Unlike traditional approaches that rely on sequence transduction and Gaussian distribution-based models, CITraNet employs Transformer networks and Gumbel distribution. First, we utilize multi-head attention and feed-forward layers to extract hidden features from historically observed data to allow effective parallelization and handling of long-range dependencies. Second, CITraNet features an innovative Transformer-based Gumbel distribution network that significantly enhances the model's ability to accurately predict all possible trajectories using extreme value theory, which replaces the conventional Gaussian distribution models that struggle with discrete and non-linear data. The effectiveness and accuracy of CITraNet are validated on the Taiwan pedestrian (TaPed) dataset, as well as the publicly available JAAD and PIE datasets. The model's deterministic and stochastic trajectory predictions are assessed over short (0.5s), medium (1.0s), and long (1.5s) intervals, crucial for gauging predictive accuracy across varying durations. The results demonstrate that CITraNet outperforms previous benchmarks. Note to Practitioners-This study advances the capabilities of advanced driver assistance systems (ADAS) by introducing a vision-based, risk-aware method to predict the future crossing intentions and trajectories of pedestrians and bicyclists. While current ADAS technologies, such as pedestrian collision warning (PCW) systems, excel at detection, they fall short in predicting trajectories and intentions. This limitation can lead to inadequate braking response times and distances, increasing the risk of accidents. Our approach offers timely predictions that enable automatic braking systems to stop vehicles safely, thereby preventing collisions with an early collision warning. This innovation presents significant business opportunities in the development of hardware and software vision components for vehicle manufacturers, automotive component suppliers, and chip designers. However, a notable limitation of our current model is its exclusive focus on pedestrians as traffic agents. Future enhancements will aim to incorporate social interactions among all traffic participants, including vehicles and motorcyclists, to deliver a more comprehensive safety solution.