Atrial fibrillation (AF) is a type of cardiac arrhythmia with a worldwide prevalence of more than 37 million among the adult population. This elusive disease is a major risk factor for ischemic stroke, along with increased rates of significant morbidity and eventual mortality. It is clinically diagnosed using medical-grade electrocardiogram (ECG) sensors in ambulatory settings. The recent emergence of consumer-grade wearables equipped with photoplethysmography (PPG) sensors has exhibited considerable promise for non-intrusive continuous monitoring in free-living conditions. However, the scarcity of large-scale public PPG datasets acquired from wearable devices hinders the development of intelligent automatic AF detection algorithms unaffected by motion artifacts, saturated ambient noise, inter- and intra-subject differences, or limited training data. In this work, we present a deep learning framework that leverages convolutional layers with a bidirectional long short-term memory (CNN-BiLSTM) network and an attention mechanism for effectively classifying raw AF rhythms from normal sinus rhythms (NSR). We derive and feed heart rate variability (HRV) and pulse rate variability (PRV) features as auxiliary inputs to the framework for robustness. A larger teacher model is trained using the MIT-BIH Arrhythmia ECG dataset. Through transfer learning (TL), its learned representation is adapted to a compressed student model (32x smaller) variant by using knowledge distillation (KD) for classifying AF with the UMass and MIMIC-III datasets of PPG signals. This results in the student model yielding average improvements in accuracy, sensitivity, F1 score, and Matthews correlation coefficient of 2.0%, 15.05%, 11.7%, and 9.85%, respectively, across both PPG datasets. Additionally, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) to confer a notion of interpretability to the model decisions. We conclude that through a combination of techniques such as TL and KD, i.e., pre-trained initialization, we can utilize learned ECG concepts for scarcer PPG scenarios. This can reduce resource usage and enable deployment on edge devices.