Statistical Models for Text Segmentation

被引:24
作者
Doug Beeferman
Adam Berger
John Lafferty
机构
[1] Carnegie Mellon University,School of Computer Science
来源
Machine Learning | 1999年 / 34卷
关键词
exponential models; text segmentation; maximum entropy; inductive learning; natural language processing; decision trees; language modeling;
D O I
暂无
中图分类号
学科分类号
摘要
This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.
引用
收藏
页码:177 / 210
页数:33
相关论文
共 22 条
  • [1] Berger A.(1996)A maximum entropy approach to natural language processing Computational Linguistics 22 39-71
  • [2] Della Pietra S.(1995)Informedia digital video library Communications of the ACM 38 57-58
  • [3] Della Pietra V.(1997)Inducing features of random fields IEEE Transactions on Pattern Analysis and Machine Intelligence 19 380-393
  • [4] Christel M.(1997)TextTiling: Segmenting text into multi-paragraph subtopic passages Computational Linguistics 23 33-64
  • [5] Kanade T.(1993)Disambiguation of cue phrases Computational Linguistics 19 501-530
  • [6] Mauldin M.(1987)Estimation of probabilities from sparse data for the language model component of a speech recognizer IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-35 400-401
  • [7] Reddy R.(1990)A cache-based natural language model for speech recognition IEEE Transactions on Pattern Analysis and Machine Intelligence 12 570-583
  • [8] Sirbu M.(1997)Discourse segmentation by human and automated means Computational Linguistics 23 103-139
  • [9] Stevens S.(1996)A maximum entropy approach to adaptive statistical language modeling Computer Speech and Language 10 187-228
  • [10] Wactlar H.(undefined)undefined undefined undefined undefined-undefined