Automated information extraction from web pages
Master Thesis
Author
Giannios, Stergios
Γιαννιός, Στέργιος
Date
2022-02Advisor
Petasis, GeorgiosΠετάσης, Γεώργιος
View/ Open
Abstract
The continuous growth of the world wide web (WWW) has resulted in enormous
amounts of information. Specific data, contained in webpages, can be extracted and
leveraged in numerous applications. A semi-automatic/automatic approach to
retrieving data from webpages is needed since manual extraction is very timeconsuming and does not scale well. However, because of the heterogeneity and semistructured nature of webpages, the automatic extraction of data is a non-trivial task.
The task of web information extraction (WIE) is most commonly addressed with
wrapper induction (WI). In WI, the goal is to learn a set of extraction rules by using
manually labelled examples. The primary issue with WI is that the learned rules are
frequently incapable of dealing with even slight variations in a webpage's template, and
cannot generalize to other websites. In this thesis, the WIE problem is reframed as an
object detection task. For this purpose, a dataset was built, with news articles that were
collected and annotated. A state-of-the-art detector, YOLOv5, was used to extract
specific attributes such as the news articles’ title, metadata, author, date, main image,
text, and keywords. The model yielded 90% mAP (over all classes) in stratified (based
on website domain) 5-fold cross-validation. One-shot learning capabilities of the
model were also explored by using transfer learning to fine-tune the model to unseen
news websites in English but also in another language (Greek) achieving 79% mAP and
90% mAP respectively. A dataset with books’ product details from Amazon.in, with
extracting targets the books’ title, author, and price was used to compare our approach
with a state-of-the-art approach where a previous version of YOLO (version 2) was
utilized. The mAP of our approach yielded 95% mAP compared to the state-of-art
approach which yielded 74% mAP.