CMVC plus : A Multi-View Clustering Framework for Open Knowledge Base Canonicalization Via Contrastive Learning

被引:0
作者
Yang, Yang [1 ]
Shen, Wei [2 ]
Shu, Junfeng [2 ]
Liu, Yinan [3 ]
Curry, Edward [1 ]
Li, Guoliang [4 ]
机构
[1] Univ Galway, Insight SFI Res Ctr Data Analyt, Galway H91 AEX4, Ireland
[2] Nankai Univ, Coll Comp Sci, DISSec, Tianjin 300350, Peoples R China
[3] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110819, Peoples R China
[4] Tsinghua Univ, Dept Comp Sci, Beijing 100190, Peoples R China
基金
爱尔兰科学基金会; 中国国家自然科学基金;
关键词
Contrastive learning; Clustering algorithms; Knowledge based systems; Organizations; Electronic mail; Data mining; Ontologies; Information retrieval; Indexes; Training; Open knowledge base canonicalization; multi-view clustering; contrastive learning;
D O I
10.1109/TKDE.2025.3543423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Open information extraction (OIE) methods extract plenty of OIE triples < <noun phrase, relation phrase, noun phrase> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. In order to leverage these two views of knowledge jointly, we propose CMVC+, a novel unsupervised framework for canonicalizing OKBs without the need for manually annotated labels. Specifically, we propose a multi-view CHF K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering the clustering quality in a fine-grained manner. Furthermore, we propose a novel contrastive learning module to refine the learned view-specific embeddings and further enhance the canonicalization performance. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.
引用
收藏
页码:2296 / 2310
页数:15
相关论文
共 50 条
  • [1] Angeli G, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P344
  • [2] Banko M, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2670
  • [3] Multi-view clustering
    Bickel, S
    Scheffer, T
    [J]. FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 19 - 26
  • [4] Bojanowski Piotr, 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI DOI 10.1162/TACLA00051
  • [5] Bollacker Kurt D., 2008, P 2008 ACM SIGMOD IN, P1247
  • [6] Bordes A., 2013, Advances in neural information processing systems, P2787
  • [7] Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
  • [8] Chen T, 2020, PR MACH LEARN RES, V119
  • [9] Corro L.D., 2013, PROC INT C WORLD WID, P355, DOI 10.1145/2488388
  • [10] Dash S, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P10379