The paper adopts a mixed-method approach (online and eyetracking experiments) to investigate which image-text relation in multimodal texts, namely image-subordinate-to-text (IST) and textsubordinate-to-image (TSI), creates a strong visual or verbal mental representation in the case of second language (L2) learners. Thirtyeight Hungarian L2 learners with B1 English language proficiency attended the online experiment to read and respond to the multimodal texts with IST and TSI relations. Furthermore, during the eye-tracking experiment, second language learners' (N=9) gaze patterns were examined while reading IST and TSI multimodal texts. The initial study results reveal that while the semantic gap between the image and text encourages more intermodal interactions and longer eye fixations, redundancy, involving the duplication of information via image and text, also develops a strong mental model of the meaning. The present research may contribute to the development of a more comprehensive model of L2 multimodal and multimedia learning.