Cross-situational word learning (CSWL), the ability to resolve word-referent ambiguity across encounters, is a powerful mechanism found in infants, children, and adults. Yet, we know little about what predicts individual differences in CSWL, especially when learning different mapping structures, such as when referents have a single name (1:1 mapping structure) or two names (2:1 mapping structure). Here, we investigated how multilingual experience and working memory skills (visuo-spatial and phonological) contributed to CSWL of 1:1 and 2:1 structures. Monolingual (n = 78) and multilingual (n = 106) adults completed CSWL tasks of 1:1 and 2:1 structures, a symmetry span task, and a listening span task. Results from path models showed that multilingualism predicted visuo-spatial working memory but not CSWL. Additionally, phonological working memory predicted accuracy on CSWL of 1:1 structure, but not 2:1 structure. Findings highlight the importance of considering language experience and cognitive skills together to better understand the factors that promote individual CSWL skills.