Bayesian Browsing Model: Exact Inference of Document Relevance from Petabyte-Scale Data

被引:16
作者
Liu, Chao [1 ]
Guo, Fan [2 ]
Faloutsos, Christos [2 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Algorithms; Experimentation; Performance; Bayesian models; click log analysis; Web search; DATA STREAMS;
D O I
10.1145/1857947.1857951
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A fundamental challenge in utilizing Web search click data is to infer user-perceived relevance from the search log. Not only is the inference a difficult problem involving statistical reasonings but the bulky size, together with the ever-increasing nature, of the log data imposes extra requirements on scalability. In this paper, we propose the Bayesian Browsing Model (BBM), which performs exact inference of the document relevance, only requires a single pass of the data (i.e., the optimal scalability), and is shown effective. We present two sets of experiments to evaluate the model effectiveness and scalability. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the state-of-the-art competitor by 29.2% in log-likelihood while being 57 times faster. On the second click log set, spanning a quarter of petabyte, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct query-URL pairs.
引用
收藏
页数:26
相关论文
共 44 条
  • [1] Aggarwal G, 2008, LECT NOTES COMPUT SC, V5385, P621, DOI 10.1007/978-3-540-92185-1_68
  • [2] Agichtein E., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P3, DOI 10.1145/1148170.1148175
  • [3] Agichtein E., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P19, DOI 10.1145/1148170.1148177
  • [4] [Anonymous], 2008, P 2008 INT C WEB SEA, DOI [10.1145/1341531, DOI 10.1145/1341531.1341545]
  • [5] [Anonymous], 2002, P ACM SIGKDD KDD 200, DOI 10.1145/775047.775067
  • [6] [Anonymous], 2006, Pattern recognition and machine learning
  • [7] [Anonymous], P 31 ANN INT ACM SIG
  • [8] [Anonymous], 2008, P 17 INT C WORLD WID
  • [9] Babcock B., 2002, PODS, P1, DOI [DOI 10.1145/543613.543615, 10.1145/543613.543615]
  • [10] Baeza-Yates R, 2005, LECT NOTES COMPUT SC, V3408, P7