A Framework for Large-Scale Synthetic Graph Dataset Generation

被引:0
|
作者
Darabi, Sajad [1 ]
Bigaj, Piotr [1 ]
Majchrowski, Dawid [1 ]
Kasymov, Artur [1 ,2 ]
Morkisz, Pawel [1 ,3 ]
Fit-Florea, Alex [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95050 USA
[2] Jagiellonian Univ, Fac Math & Comp Sci, Doctoral Sch Exact & Nat Sci, PL-31007 Krakow, Poland
[3] AGH Univ Krakow, Fac Appl Math, PL-30059 Krakow, Poland
关键词
Generators; Mathematical models; Biological system modeling; Deep learning; Iron; Complexity theory; Synthetic data; Stochastic processes; Social networking (online); Learning systems; Big data applications; graph neural networks (GNNs); machine learning; synthetic data; NETWORKS;
D O I
10.1109/TNNLS.2025.3540392
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there has been increasing interest in developing and deploying deep graph learning algorithms for various tasks, such as fraud detection and recommender systems. However, there is a limited number of publicly available graph-structured datasets, most of which are small compared with production-sized applications or limited in their application domain. In this work, we tackle this shortcoming by proposing a synthetic graph generation tool that enables scaling datasets to production-size graphs with trillions of edges and billions of nodes. The proposed method comprises a series of parametric models that can either be randomly initialized or fit to proprietary datasets. These models can then be released to researchers to study graph methods on the synthetic data, facilitating prototype development and novel applications. We demonstrate the generalizability of the framework across various datasets, mimicking their structural and feature distributions, as well as the ability to scale them to varying sizes, demonstrating their usefulness for benchmarking and model development. Code can be found on GitHub.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] A Hybrid Deep Learning Model for Predicting Depression Symptoms From Large-Scale Textual Dataset
    Almutairi, Sulaiman
    Abohashrh, Mohammed
    Razzaq, Hasanain Hayder
    Zulqarnain, Muhammad
    Namoun, Abdallah
    Khan, Faheem
    IEEE ACCESS, 2024, 12 : 168477 - 168499
  • [32] A large-scale container dataset and a baseline method for container hole localization
    Diao, Yunfeng
    Tang, Xin
    Wang, He
    Taylor, Emma Christophine Florence
    Xiao, Shirui
    Xie, Mengtian
    Cheng, Wenming
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2022, 19 (03) : 577 - 589
  • [33] Introduction and Analysis of a Large-Scale Benchmark Automatic Vehicle Identification Dataset
    He, Zhaocheng
    Chen, Kaiying
    Chen, Xinyu
    Sun, Weiwei
    INTERNATIONAL CONFERENCE ON TRANSPORTATION AND DEVELOPMENT 2018: CONNECTED AND AUTONOMOUS VEHICLES AND TRANSPORTATION SAFETY, 2018, : 35 - 43
  • [34] Dataset Generation Patterns for Evaluating Knowledge Graph Construction
    Schroeder, Markus
    Jilek, Christian
    Dengel, Andreas
    SEMANTIC WEB: ESWC 2021 SATELLITE EVENTS, 2021, 12739 : 27 - 32
  • [35] MANNET: A LARGE-SCALE MANIPULATED IMAGE DETECTION DATASET AND BASELINE EVALUATIONS
    Singh, Aditya
    Chhabra, Saheb
    Majumdar, Puspita
    Singh, Richa
    Vatsa, Mayank
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1780 - 1784
  • [36] PESTD: a large-scale Persian-English scene text dataset
    Rashtehroudi, Atefeh Ranjkesh
    Akoushideh, Alireza
    Shahbahrami, Asadollah
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (22) : 34793 - 34808
  • [37] MSVD-Turkish: A Large-Scale Dataset for Video Captioning in Turkish
    Citamak, Begum
    Kuyu, Menekse
    Erdem, Aykut
    Erdem, Erkut
    2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
  • [38] Large-scale multiview 3D hand pose dataset
    Gomez-Donoso, Francisco
    Orts-Escolano, Sergio
    Cazorla, Miguel
    IMAGE AND VISION COMPUTING, 2019, 81 : 25 - 33
  • [39] PESTD: a large-scale Persian-English scene text dataset
    Atefeh Ranjkesh Rashtehroudi
    Alireza Akoushideh
    Asadollah Shahbahrami
    Multimedia Tools and Applications, 2023, 82 : 34793 - 34808
  • [40] A large-scale container dataset and a baseline method for container hole localization
    Yunfeng Diao
    Xin Tang
    He Wang
    Emma Christophine Florence Taylor
    Shirui Xiao
    Mengtian Xie
    Wenming Cheng
    Journal of Real-Time Image Processing, 2022, 19 : 577 - 589