Human trajectories have explicit linkages to urban dynamics and functions. Therefore, learning urban region representations from human trajectories could help understand cities, and be used in various downstream tasks, e.g., land use classification and crime prediction, and ultimately assist city planning and management. However, previous works fall short in deeply mining the complete information embodied in human trajectories and neglect the data sparsity issue of some regions. We propose a multi-view graph contrastive learning (MVGCL) framework to learn urban region representations from human trajectories and the spatial adjacency between urban regions. First, we construct an outflow view based on human trajectories and devise data augmentation techniques to construct an inflow view. Then, we construct a spatial view based on the spatial adjacency between regions. Moreover, we use an MLP to extract spatial features from spatial view and model two graph encoders to learn complete region representation. Finally, we utilize a dual-multiplet loss function based on graph contrastive learning to maximize node consistency. Extensive experiments in Manhattan regions demonstrate that the performance of MVGCL outperforms the state-of-the-art methods.