Intelligent Crawling of Web Applications for Web Archiving

Abstract :

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.

Type de document :
Communication dans un congrès
WWW, Apr 2012, Lyon, France. ACM, pp.127-131, 2012
Domaine :
Liste complète des métadonnées

Littérature citée [22 références]  Voir  Masquer  Télécharger
Contributeur : Admin Télécom Paristech <>
Soumis le : mercredi 26 février 2014 - 09:58:16
Dernière modification le : jeudi 11 janvier 2018 - 06:23:38
Document(s) archivé(s) le : lundi 26 mai 2014 - 11:26:13


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-00952017, version 1


Muhammad Faheem. Intelligent Crawling of Web Applications for Web Archiving. WWW, Apr 2012, Lyon, France. ACM, pp.127-131, 2012. 〈hal-00952017〉



Consultations de la notice


Téléchargements de fichiers