FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Marilena Oita; Pierre Senellart

Communication Dans Un Congrès Année : 2015

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

(1) , (2, 3, 4)

1
2
3
4

Marilena Oita

Fonction : Auteur

Laboratoire Traitement et Communication de l'Information

Pierre Senellart

Fonction : Auteur
PersonId : 11778
IdHAL : pierre-senellart
ORCID : 0000-0002-7909-5369
IdRef : 124713769

Data, Intelligence and Graphs

Département Informatique et Réseaux

Image & Pervasive Access Lab

Résumé

Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called FOREST, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate FOREST with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.

Domaines

Web

Admin Télécom Paristech : Connectez-vous pour contacter le contributeur

https://imt.hal.science/hal-01178402

Soumis le : lundi 20 juillet 2015-03:37:34

Dernière modification le : mardi 7 novembre 2023-11:06:04

Dates et versions

hal-01178402 , version 1 (20-07-2015)

Identifiants

HAL Id : hal-01178402 , version 1

Citer

Marilena Oita, Pierre Senellart. FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths. WebDB, May 2015, Melbourne, Australia. pp.55-61. ⟨hal-01178402⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM CNRS PARISTECH IPAL SORBONNE-UNIVERSITE LTCI INFRES DIG SU-SCIENCES

46 Consultations

0 Téléchargements

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager