The recent coronavirus disease (COVID-19) has become a pandemic and has affected the entire globe. During the pandemic, we have observed a spike in cases related to mental health, such as anxiety, stress, and depression. Depression significantly influences most diseases worldwide, making it difficult to detect mental health conditions in people due to unawareness and unwillingness to consult a doctor. However, nowadays, people extensively use online social media platforms to express their emotions and thoughts. Hence, social media platforms are now becoming a large data source that can be utilized for detecting depression and mental illness. However, the existing approaches often overlook data sparsity in tweets and the multimodal aspects of social media. In this article, we propose a novel multimodal framework that combines textual, user-specific, and image analysis to detect depression among social media users. To provide enough context about the user's emotional state, we propose the following: 1) an extrinsic feature by harnessing the URLs present in tweets and 2) extracting textual content present in images posted in tweets. We also extract five sets of features belonging to different modalities to describe a user. In addition, we introduce a deep learning model, the visual neural network (VNN), to generate embeddings of user-posted images, which are used to create the visual feature vector for prediction. We contribute a curated COVID-19 dataset of depressed and nondepressed users for research purposes and demonstrate the effectiveness of our model in detecting depression during the COVID-19 outbreak. Our model outperforms the existing state-of-the-art methods over a benchmark dataset by 2%-8% and produces promising results on the COVID-19 dataset. Our analysis highlights the impact of each modality and provides valuable insights into users' mental and emotional states.