Adaptive Web Crawling through Structure-Based Link Classification

Muhammad Faheem; Pierre Senellart

Communication Dans Un Congrès Année : 2015

Adaptive Web Crawling through Structure-Based Link Classification

(1, 2) , (1, 2, 3)

1
2
3

Muhammad Faheem

Fonction : Auteur

Data, Intelligence and Graphs

Département Informatique et Réseaux

Pierre Senellart

Fonction : Auteur
PersonId : 11778
IdHAL : pierre-senellart
ORCID : 0000-0002-7909-5369
IdRef : 124713769

Data, Intelligence and Graphs

Département Informatique et Réseaux

Image & Pervasive Access Lab

Résumé

Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content-rich webpages. We propose ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map (limiting the number of URLs retrieved) and learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.

Domaines

Web

Admin Télécom Paristech : Connectez-vous pour contacter le contributeur

https://imt.hal.science/hal-01261960

Soumis le : mardi 26 janvier 2016-09:12:12

Dernière modification le : mardi 7 novembre 2023-11:06:04

Dates et versions

hal-01261960 , version 1 (26-01-2016)

Identifiants

HAL Id : hal-01261960 , version 1

Citer

Muhammad Faheem, Pierre Senellart. Adaptive Web Crawling through Structure-Based Link Classification. ICADL (International Conference on Asian Digital Libraries), Dec 2015, Seoul, South Korea. pp.39-51. ⟨hal-01261960⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM CNRS PARISTECH IPAL SORBONNE-UNIVERSITE LTCI INFRES DIG SU-SCIENCES

88 Consultations

0 Téléchargements

Adaptive Web Crawling through Structure-Based Link Classification

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager