HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Adaptive Web Crawling through Structure-Based Link Classification

Muhammad Faheem 1, 2 Pierre Senellart 1, 2
1 DIG - Data, Intelligence and Graphs
LTCI - Laboratoire Traitement et Communication de l'Information
Abstract : Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content-rich webpages. We propose ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map (limiting the number of URLs retrieved) and learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.
Document type :
Conference papers
Complete list of metadata

https://hal-imt.archives-ouvertes.fr/hal-01261960
Contributor : Admin Télécom Paristech Connect in order to contact the contributor
Submitted on : Tuesday, January 26, 2016 - 9:12:12 AM
Last modification on : Tuesday, October 19, 2021 - 11:14:16 AM

Identifiers

  • HAL Id : hal-01261960, version 1

Collections

Citation

Muhammad Faheem, Pierre Senellart. Adaptive Web Crawling through Structure-Based Link Classification. ICADL (International Conference on Asian Digital Libraries), Dec 2015, Seoul, South Korea. pp.39-51. ⟨hal-01261960⟩

Share

Metrics

Record views

69