HLS-Based Large Scale Self-Organizing Feature Maps

被引：0

作者：

Porrmann, Florian ^{[1
]}

Hagemeyer, Jens ^{[1
]}

Porrmann, Mario ^{[2
]}

机构：

[1] Bielefeld Univ, Cognitron & Sensor Syst Grp, CITEC, D-33615 Bielefeld, Germany

[2] Osnabruck Univ, Inst Comp Sci, Comp Engn Grp, D-49090 Osnabruck, Germany

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Neurons; Computer architecture; Clustering algorithms; Training; Graphics processing units; Statistical analysis; Space exploration; Heterogeneous networks; Field programmable gate array; hardware acceleration; machine learning; reconfigurable architectures; reconfigurable computing; heterogeneous computing; heterogeneous architectures; self-organizing feature maps; optimization; design space exploration;

D O I：

10.1109/ACCESS.2024.3471471

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Self-Organizing Map (SOM) algorithm is a clustering algorithm used in a wide variety of application domains. Over the last few decades, it has been accelerated using various hardware architectures, including FPGAs, CPUs, and GPUs. This publication presents an High-Level Synthesis-based implementation that utilizes multiple processing elements to realize a high-performance system architecture. An extensive design space exploration was conducted to evaluate the performance range of the architecture. For this, vector dimensions ranging from 8 up to 512 and map sizes from 16x 16 to 512x512 were used. The evaluation was performed using two different AMD/Xilinx UltraScale+ FPGA systems, the VCU128 PCIe-based accelerator card and the ZCU106 stand-alone evaluation kit. From the achieved results, it can be seen that the performance scales nearly linearly for a given vector dimension when the map size is increased. In addition, the energy efficiency for both FPGAs was analyzed, revealing that while the ZCU106 is less powerful in terms of raw compute power, it requires up to 4x less power and, depending on the configuration, can be 2x more energy efficient compared to the VCU128. One of the main reasons for this is that it does not require a dedicated host system but utilizes its internal ARM cores. Finally, a comparison against state-of-the-art SOM implementations was conducted. The proposed design achieves a speed-up of up to 458.7, 1,630.4 , and 4.9 compared to other CPU, GPU, and FPGA realizations, respectively.

引用

页码：142459 / 142474

页数：16

共 46 条

[1]

A. M. Devices, 2021, UltraScale Architecture Memory Resources User Guide (UG573)

[2]

Advanced Micro Devices, 2024, AMD Vitis Model Composer

[3] PyTorch 2: Faster Machine Learning Through Dynamic Python']Python Bytecode Transformation and Graph Compilation [J].

Ansel, Jason ;

Yang, Edward ;

He, Horace ;

Gimelshein, Natalia ;

Jain, Animesh ;

Voznesensky, Michael ;

Bao, Bin ;

Bell, Peter ;

Berard, David ;

Burovski, Evgeni ;

Chauhan, Geeta ;

Chourdia, Anjali ;

Constable, Will ;

Desmaison, Alban ;

DeVito, Zachary ;

Ellison, Elias ;

Feng, Will ;

Gong, Jiong ;

Gschwind, Michael ;

Hirsh, Brian ;

Huang, Sherlock ;

Kalambarkar, Kshiteej ;

Kirsch, Laurent ;

Lazos, Michael ;

Lezcano, Mario ;

Liang, Yanbo ;

Liang, Jason ;

Lu, Yinghai ;

Luk, C. K. ;

Maher, Bert ;

Pan, Yunjie ;

Puhrsch, Christian ;

Reso, Matthias ;

Saroufim, Mark ;

Siraichi, Marcos Yukio ;

Suk, Helen ;

Suo, Michael ;

Tillet, Phil ;

Wang, Eikan ;

Wang, Xiaodong ;

Wen, William ;

Zhang, Shunting ;

Zhao, Xu ;

Zhou, Keren ;

Zou, Richard ;

Mathews, Ajit ;

Chanan, Gregory ;

Wu, Peng ;

Chintala, Soumith .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 2, 2024, :929-947

[4] An automatic tool to analyze and cluster macromolecular conformations based on self-organizing maps [J].