DOM Based Content Extraction via Text Density

Abstract

Besides main contents, most web pages also consist of navigational panels, advertisements, copyrights and disclaimer notices. These additional contents, being typically not related to the main subject, may hamper the performance of web data mining, and thus are noises need to be removed properly. In this paper, we present Content Extraction via Text Density (CETD) – a fast, accurate and general method to extract contents from diverse web pages and keep their original structures using their DOM nodes’ text density. For this purpose, we introduce two concepts to measure the importance of the nodes: Text Density and Composite Text Density. And in order to extract intact contents, we propose a technique called DensitySum instead of Data Smoothing. The approach is evaluated on the CleanEval benchmark and random selected pages from well-known websites, where various web domains and styles are tested. By comparing against several alternative methods, the average F₁-scores of our method is 8.79% higher than the best one of others.

Data

The data for this project has several parts:
(1) Development and evaluation data sets from the CleanEval competition, which contains nearly 1000 random documents that are cleaned;
(2) The data set we gathered from servalweb sites. This data set is separated into two non-overlapping sets.
The Big 5: Ars Technica, BBC, Yahoo!, New York Times, Wiki.
The Chaos: which were chosen randomly from Goolge News and the best-known blog platform such as WordPress and Blogger.

All these pages' contents were labeled manually using web browser and saved them into text files in UTF-8 as the gold standard.

The data set is available online, for free but only for research purposes. You can download the dataset here: CETD Dataset

Code

Download the code (licensed under the GPL V3.0)

Paper

Sun, Fei and Song, Dandan and Liao, Lejian. DOM Based Content Extraction via Text Density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval.

Citation

@inproceedings{Fei:DOM,
author = {Sun, Fei and Song, Dandan and Liao, Lejian},
title = {DOM Based Content Extraction via Text Density},
booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval},
year = {2011},
publisher = {ACM},
pages = {245--254},
location = {Beijing, China}
}

Fei Sun

DOM Based Content Extraction via Text Density

Abstract

Data

Code

Paper