Joint matrix quantization of face parameters and LPC coefficients for low bit rate audiovisual speech coding

被引：6

作者：

Girin, L ^{[1
]}

机构：

[1] Univ Grenoble 3, CNRS, INPG, Inst Commun Parlee, F-38031 Grenoble, France

来源：

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 2004年 / 12卷 / 03期

关键词：

audiovisual; lip parameters; low-bit-rate speech coding; LPC parameters; matrix quantization; speech processing;

D O I：

10.1109/TSA.2003.822626

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

A key problem for videophony, that is telephony including the processing of images of the speaker's face in addition to acoustic speech, concerns signal compression for transmission. In such systems, audio and video compression are separately achieved by using both audio and video coders. In this paper, an audio-visual approach to this problem is considered, since we claim that the fundamental property of coherence (redundancy) between the two modalities of speech should be exploited by coding systems. We consider the framework of parametric analysis, modeling and synthesis of talking faces, which allows efficient representation of video information. Thus, we propose to jointly encode several face parameters, namely lip shape geometric descriptors, together with sets of audio coefficients, namely quite usual LPC parameters. The definition of an audiovisual distance between vectors of concatenated audio and video parameters allows to generate audiovisual single stage vector and matrix quantizers by using the generalized Lloyd algorithm. Calculation of video and audio mean distortion measures shows a significant gain in quantization accuracy and/or resolution compared to separate video and audio quantization. An alternative sub-optimal tree-like structure for audiovisual joint coding is also tested and yields interesting results while decreasing the computational complexity of the quantization process.

引用

页码：265 / 276

页数：12

共 30 条

[1]

BALLY G, 2001, P EUR TUT RES WORKSH

[2]

BEOIT C, 1992, TALKING MACHINES THE, P485

[3]

BERSTEIN LE, 1996, P ICSLP, P1477

[4]

Chen TH, 2001, IEEE SIGNAL PROC MAG, V18, P9

[5] Audio/video and synthetic graphics/audio for mixed media [J].

Doenges, PK ;

Capin, TK ;

Lavagetto, F ;

Ostermann, J ;

Pandzic, IS ;

Petajan, ED .

SIGNAL PROCESSING-IMAGE COMMUNICATION, 1997, 9 (04) :433-463

[6]

ELISEI F, 2001, P COMPR REPR SIGN AU, P145

[7]

ELISEI F, 2001, P AUD VIS SPEECH PRO, P90

[8] Audio-visual enhancement of speech in noise [J].

Girin, L ;

Schwartz, JL ;

Feng, G .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06) :3007-3020

[9]

GOFF BLB, 1995, P EUR C SPEECH COMM, P291

[10]

GRAY RM, 1992, VECTOR QUANTIZATION

← 1 2 3 →