An Intelligent Data-Centric Web Crawler Service for API Corpus Construction at Scale

被引:1
作者
Assefi, Mehdi [1 ]
Bahrami, Mehdi [2 ]
Arora, Sarthak [3 ]
Taha, Thiab R. [1 ]
Arabnia, Hamid R. [1 ]
Rasheed, Khaled M. [1 ]
Chen, Wei-Peng [2 ]
机构
[1] Univ Georgia, Athens, GA 30602 USA
[2] Fujitsu Res Amer Inc, Sunnyvale, CA USA
[3] Univ Southern Calif, Los Angeles, CA 90007 USA
来源
2022 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (IEEE ICWS 2022) | 2022年
关键词
Web API; Web Crawler; machine-learning;
D O I
10.1109/ICWS55610.2022.00064
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The number of web APIs is growing rapidly. API adoption is increasing across all industries with executives prioritizing investments in the API economy. Each API provider offers API documentation which includes complex descriptions. In order to collect and understand the applications and operations of diverse APIs, software engineers read lengthy and complicated API documentations. Understanding the variety of API documentations is a labor intensive and error-prone process. In this paper, we introduce a data-centric web crawler service to collect, analyze, and construct a large corpus of API documentations. The generated API Corpus can be used in machine programming (i.e., code generation, code search). The proposed API web-crawler intelligently harvests more than 2.8M API documentation pages where it uses a machine-learning-based approach with an accuracy of 91.32% to select only web API pages (REST). We also conducted an extensive and end-to-end real-world evaluation, where the proposed API web-crawler not only collects a sheer number of API pages, but also successfully validates 1,222 APIs out of 1,521 target APIs with a success rate of 80.34%.
引用
收藏
页码:385 / 390
页数:6
相关论文
共 19 条
[1]  
[Anonymous], 2001, ACM T INTERNET TECHN, DOI DOI 10.1145/383034.383035
[2]  
[Anonymous], 2001, IEEE Data Eng. Bull.
[3]  
[Anonymous], 2000, Ph.D. thesis
[4]  
Bahrami M, 2020, IEEE SYS MAN CYBERN, P1994, DOI [10.1109/smc42975.2020.9282884, 10.1109/SMC42975.2020.9282884]
[5]   API Learning: Applying Machine Learning to Manage the Rise of API Economy [J].
Bahrami, Mehdi ;
Park, Junhee ;
Liu, Lei ;
Chen, Wei-Peng .
COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, :151-154
[6]  
Castillo Carlos, 2005, SIGIR Forum, V39, P55
[7]  
Conneau A, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P1107
[8]  
Deng Kaiying, 2020, International Journal of Wireless and Mobile Computing, V18, P332
[9]   Mercator: A scalable, extensible Web crawler [J].
Heydon A. ;
Najork M. .
World Wide Web, 1999, 2 (4) :219-229
[10]  
Johnson R, 2015, Arxiv, DOI arXiv:1412.1058