Audio content analysis for online audiovisual data segmentation and classification

被引：238

作者：

Zhang, T ^{[1
]}

Kuo, CCJ ^{[1
]}

机构：

[1] Univ So Calif, Integrated Media Syst Ctr, Los Angeles, CA 90089 USA

来源：

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 2001年 / 9卷 / 04期

关键词：

audio analysis; audio indexing; audio segmentation; audiovisual content parsing; information filtering and retrieval; multimedia database management;

D O I：

10.1109/89.917689

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications, An approach to automatic segmentation and classification of audiovisual data based on audio content analysis is proposed, The audio signal from movies or TV programs is segmented and classified into basic types such as speech, music, song, environmental sound, speech with music background, environmental sound with music background, silence, etc. Simple audio features Including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks are extracted to ensure the feasibility of real-time processing. A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features. Experimental results show that the proposed scheme achieves an accuracy rate of more than 90% in audio classification.

引用

页码：441 / 457

页数：17

共 30 条

[1]

Albert S. Bregman, 1990, AUDITORY SCENE ANAL, P411, DOI [DOI 10.1121/1.408434, DOI 10.7551/MITPRESS/1486.001.0001]

[2]

[Anonymous], MASTER HDB ACOUSTICS

[3]

[Anonymous], THESIS STANFORD U ST

[4]

Boreczky JS, 1998, INT CONF ACOUST SPEE, P3741, DOI 10.1109/ICASSP.1998.679697

[5] COMPUTATIONAL AUDITORY SCENE ANALYSIS [J].

BROWN, GJ ;

COOKE, M .

COMPUTER SPEECH AND LANGUAGE, 1994, 8 (04) :297-336

[6] A fully automated content-based video search engine supporting spatiotemporal queries [J].

Chang, SF ;

Chen, W ;

Meng, HJ ;

Sundaram, H ;

Zhong, D .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1998, 8 (05) :602-615

[7] Real-time fundamental frequency estimation by least-square fitting [J].

Choi, A .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1997, 5 (02) :201-205

[8]

DOVAL B, 1991, INT CONF ACOUST SPEE, P3657, DOI 10.1109/ICASSP.1991.151067

[9]

Ellis D. P. W., 1996, THESIS MIT CAMBRIDGE

[10]

FLICKNER M, 1995, IEEE COMPUT, V28, P23, DOI DOI 10.1109/2.410146

← 1 2 3 →