Facial action unit (AU) detection constitutes precise measurements for facial appearance variances, holding great significance within the realms of affective computing, human-computer interaction, and negotiation. Subject-invariant AU detection remains a challenge primarily due to the distribution variations among individuals. More importantly, the inherent subtlety and localized nature of facial action frequently give rise to the dominance of interference factors, particularly those related to individual identity. To tackle these issues, we propose a novel knowledge-driven hierarchical feature alignment (KHFA) framework, which aims to investigate the multifaceted consistency within representations of facial actions. AUs unambiguously define facial appearance variations induced by specific groups of facial muscle movements. At the same time, the intrinsic physiological interconnections between these facial muscles impose substantial constraints on the correlations between different AUs. Therefore, KHFA presents a dual classwise alignment scheme to ensure a harmonious balance between consistency within the same class and coherence across different categories. Furthermore, the similarity in the sample-level AU combinations reflects the semantic proximity of global features within the feature space. KHFA integrates an intersample relationship to enhance the coherence of semantic information across samples via a multilabel alignment scheme. Finally, a hybrid attention mechanism equipped with an importance-aware feature fusion layer is proposed to capture nuanced spatial features that are specific to individual AUs and to adeptly embed AU correlations. Extended experiments conducted on two benchmark datasets, BP4D and DISFA, reveal that KHFA outperforms state-of-the-art methods, underscoring the effectiveness and superiority of our approach.