Adaptive Web Crawling through Structure-Based Link Classification

Abstract :

Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content-rich webpages. We propose ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map (limiting the number of URLs retrieved) and learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.

Type de document :
Communication dans un congrès
ICADL (International Conference on Asian Digital Libraries), Dec 2015, Seoul, South Korea. ICADL (International Conference on Asian Digital Libraries), pp.39-51, 2015
Domaine :
Liste complète des métadonnées

https://hal-imt.archives-ouvertes.fr/hal-01261960
Contributeur : Admin Télécom Paristech <>
Soumis le : mardi 26 janvier 2016 - 09:12:12
Dernière modification le : jeudi 11 janvier 2018 - 06:23:39

Identifiants

  • HAL Id : hal-01261960, version 1

Citation

Muhammad Faheem, Pierre Senellart. Adaptive Web Crawling through Structure-Based Link Classification. ICADL (International Conference on Asian Digital Libraries), Dec 2015, Seoul, South Korea. ICADL (International Conference on Asian Digital Libraries), pp.39-51, 2015. 〈hal-01261960〉

Partager

Métriques

Consultations de la notice

105