A Framework for Large-Scale Synthetic Graph Dataset Generation

被引:0
|
作者
Darabi, Sajad [1 ]
Bigaj, Piotr [1 ]
Majchrowski, Dawid [1 ]
Kasymov, Artur [1 ,2 ]
Morkisz, Pawel [1 ,3 ]
Fit-Florea, Alex [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95050 USA
[2] Jagiellonian Univ, Fac Math & Comp Sci, Doctoral Sch Exact & Nat Sci, PL-31007 Krakow, Poland
[3] AGH Univ Krakow, Fac Appl Math, PL-30059 Krakow, Poland
关键词
Generators; Mathematical models; Biological system modeling; Deep learning; Iron; Complexity theory; Synthetic data; Stochastic processes; Social networking (online); Learning systems; Big data applications; graph neural networks (GNNs); machine learning; synthetic data; NETWORKS;
D O I
10.1109/TNNLS.2025.3540392
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there has been increasing interest in developing and deploying deep graph learning algorithms for various tasks, such as fraud detection and recommender systems. However, there is a limited number of publicly available graph-structured datasets, most of which are small compared with production-sized applications or limited in their application domain. In this work, we tackle this shortcoming by proposing a synthetic graph generation tool that enables scaling datasets to production-size graphs with trillions of edges and billions of nodes. The proposed method comprises a series of parametric models that can either be randomly initialized or fit to proprietary datasets. These models can then be released to researchers to study graph methods on the synthetic data, facilitating prototype development and novel applications. We demonstrate the generalizability of the framework across various datasets, mimicking their structural and feature distributions, as well as the ability to scale them to varying sizes, demonstrating their usefulness for benchmarking and model development. Code can be found on GitHub.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] SORDI.ai: large-scale synthetic object recognition dataset generation for industries
    Chafic Abou Akar
    Jimmy Tekli
    Joe Khalil
    Anthony Yaghi
    Youssef Haddad
    Abdallah Makhoul
    Marc Kamradt
    Multimedia Tools and Applications, 2025, 84 (17) : 18263 - 18304
  • [2] UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images
    He, Boyong
    Li, Xianjiang
    Huang, Bo
    Gu, Enhui
    Guo, Weijie
    Wu, Liaoni
    REMOTE SENSING, 2021, 13 (24)
  • [3] Large-Scale Synthetic Urban Dataset for Aerial Scene Understanding
    Gao, Qian
    Shen, Xukun
    Niu, Wensheng
    IEEE ACCESS, 2020, 8 (08): : 42131 - 42140
  • [4] SPREAD: A large-scale, high-fidelity synthetic dataset for multiple forest vision tasks
    Feng, Zhengpeng
    She, Yihang
    Keshav, Srinivasan
    ECOLOGICAL INFORMATICS, 2025, 87
  • [5] Large-Scale Generation and Validation of Synthetic PMU Data
    Idehen, Ikponmwosa
    Jang, Wonhyeok
    Overbye, Thomas J.
    IEEE TRANSACTIONS ON SMART GRID, 2020, 11 (05) : 4290 - 4298
  • [6] STAR: A First-Ever Dataset and a Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery
    Li, Yansheng
    Wang, Linlin
    Wang, Tingzhu
    Yang, Xue
    Luo, Junwei
    Wang, Qi
    Deng, Youming
    Wang, Wenbin
    Sun, Xian
    Li, Haifeng
    Dang, Bo
    Zhang, Yongjun
    Yu, Yi
    Yan, Junchi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 1832 - 1849
  • [7] BjTT: A Large-Scale Multimodal Dataset for Traffic Prediction
    Zhang, Chengyang
    Zhang, Yong
    Shao, Qitan
    Feng, Jiangtao
    Li, Bo
    Lv, Yisheng
    Piao, Xinglin
    Yin, Baocai
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (11) : 18992 - 19003
  • [8] Socially CompliAnt Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation
    Karnan, Haresh
    Nair, Anirudh
    Xiao, Xuesu
    Warnell, Garrett
    Pirk, Soren
    Toshev, Alexander
    Hart, Justin
    Biswas, Joydeep
    Stone, Peter
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (04): : 11807 - 11814
  • [9] Synthetic dataset generation system for vehicle detection
    Oric, Mihaela
    Galic, Vlatko
    Novoselnik, Filip
    SOFTWARE IMPACTS, 2025, 23
  • [10] Recommending the Most Confusing Images to the Annotators via Confusion Graph for the Large-Scale Face Dataset Annotation
    Zhao, Lei
    Qiao, Peng
    Dou, Yong
    ELEVENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2019), 2020, 11373