Design Space Exploration of the Turbo Decoding Algorithm on GPUs

被引:12
作者
Lee, Dongwon [1 ]
Wolf, Marilyn [1 ]
Kim, Hyesoon [2 ]
机构
[1] Georgia Inst Technol, Sch ECE, Atlanta, GA 30332 USA
[2] Georgia Inst Technol, Sch Comp Sci, Atlanta, GA 30332 USA
来源
PROCEEDINGS OF THE 2010 INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURES AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES '10) | 2010年
关键词
GPU; Turbo decoding; Design space exploration; CODES; IMPLEMENTATION;
D O I
10.1145/1878921.1878953
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread block. In Turbo decoding, a degree of radix affects computational complexity and memory access patterns in both algorithmic and implementation viewpoints. Second, computations of branch metrics (BMs) and state metrics (SMs) have a different degree of parallelism, which affects the mapping method of computational tasks to GPU threads. Finally, we can easily adjust the number of sub-frames per thread block to balance the occupancy and memory access traffic. Experimental results show that the radix-4 algorithm with the SM-centric mapping method shows the best performance at four sub-frames per thread block. According to our analysis, two factors - the occupancy and shared memory bank conflicts - differentiate the performance of different cases in the design space. We show further performance improvements by optimizing a kernel operation (max*) and applying the MAX-Log-Maximum A Posteriori (MAP) algorithm. A performance bottleneck at the finally optimized case is global memory access latency. Since the most optimized performance is comparable to that of the other programmable platforms, the GPU can be considered as another type of coprocessor for Turbo decoding implementations in mobile devices.
引用
收藏
页码:217 / 226
页数:10
相关论文
共 17 条
[1]  
3rd Generation Partnership Project (3GPP), 1999, 3 GENERATION PARTNER
[2]   OPTIMAL DECODING OF LINEAR CODES FOR MINIMIZING SYMBOL ERROR RATE [J].
BAHL, LR ;
COCKE, J ;
JELINEK, F ;
RAVIV, J .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1974, 20 (02) :284-287
[3]  
Boutillon E., 2007, P IEEE JUN, P1201
[4]  
Ebel W. J., 1999, TECHNICAL REPORT, DOI Alexandria Research Institute-Virginia Polytechnic Institute and State University
[5]   REDUCED COMPLEXITY SYMBOL DETECTORS WITH PARALLEL STRUCTURES FOR ISI CHANNELS [J].
ERFANIAN, J ;
PASUPATHY, S ;
GULAK, G .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1994, 42 (2-4) :1661-1671
[6]   How GPUs Can Outperform ASICs for Fast LDPC Decoding [J].
Falcao, Gabriel ;
Silva, Vitor ;
Sousa, Leonel .
ICS'09: PROCEEDINGS OF THE 2009 ACM SIGARCH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2009, :390-399
[7]  
HAGENAUER J, 1989, DALLAS GLOBECOM 89, VOLS 1-3, P1680, DOI 10.1109/GLOCOM.1989.64230
[8]   New schemes in clustered VLIW processors applied to turbo decoding [J].
Ituero, Pablo ;
Lopez-Vallejo, Marisa .
IEEE 17TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS, PROCEEDINGS, 2006, :291-+
[9]  
Kienle F., 2003, ADV RADIO SCI, V1, P259
[10]   Area-efficient high-throughput MAP decoder architectures [J].
Lee, SJ ;
Shanbhag, NR ;
Singer, AC .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2005, 13 (08) :921-933