FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Abstract :

Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called FOREST, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate FOREST with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.

Type de document :
Communication dans un congrès
WebDB, May 2015, Melbourne, Australia. pp.55-61, 2015
Domaine :
Liste complète des métadonnées

https://hal-imt.archives-ouvertes.fr/hal-01178402
Contributeur : Admin Télécom Paristech <>
Soumis le : lundi 20 juillet 2015 - 03:37:34
Dernière modification le : samedi 3 mars 2018 - 15:12:01

Identifiants

  • HAL Id : hal-01178402, version 1

Citation

Marilena Oita, Pierre Senellart. FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths. WebDB, May 2015, Melbourne, Australia. pp.55-61, 2015. 〈hal-01178402〉

Partager

Métriques

Consultations de la notice

63