Skip to Main content Skip to Navigation
Conference papers

Adaptive Web Crawling through Structure-Based Link Classification

Muhammad Faheem 1, 2 Pierre Senellart 1, 2 
1 DIG - Data, Intelligence and Graphs
LTCI - Laboratoire Traitement et Communication de l'Information
Abstract : Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content-rich webpages. We propose ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map (limiting the number of URLs retrieved) and learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.
Document type :
Conference papers
Complete list of metadata
Contributor : Admin Télécom Paristech Connect in order to contact the contributor
Submitted on : Tuesday, January 26, 2016 - 9:12:12 AM
Last modification on : Tuesday, October 19, 2021 - 11:14:16 AM


  • HAL Id : hal-01261960, version 1



Muhammad Faheem, Pierre Senellart. Adaptive Web Crawling through Structure-Based Link Classification. ICADL (International Conference on Asian Digital Libraries), Dec 2015, Seoul, South Korea. pp.39-51. ⟨hal-01261960⟩



Record views