Εμφάνιση απλής εγγραφής

dc.contributor.advisorPetasis, Georgios
dc.contributor.advisorΠετάσης, Γεώργιος
dc.contributor.authorGiannios, Stergios
dc.contributor.authorΓιαννιός, Στέργιος
dc.date.accessioned2022-03-14T09:45:10Z
dc.date.available2022-03-14T09:45:10Z
dc.date.issued2022-02
dc.identifier.urihttps://dione.lib.unipi.gr/xmlui/handle/unipi/14218
dc.identifier.urihttp://dx.doi.org/10.26267/unipi_dione/1641
dc.format.extent53el
dc.language.isoenel
dc.publisherΠανεπιστήμιο Πειραιώςel
dc.titleAutomated information extraction from web pagesel
dc.typeMaster Thesisel
dc.contributor.departmentΣχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτωνel
dc.description.abstractENThe continuous growth of the world wide web (WWW) has resulted in enormous amounts of information. Specific data, contained in webpages, can be extracted and leveraged in numerous applications. A semi-automatic/automatic approach to retrieving data from webpages is needed since manual extraction is very timeconsuming and does not scale well. However, because of the heterogeneity and semistructured nature of webpages, the automatic extraction of data is a non-trivial task. The task of web information extraction (WIE) is most commonly addressed with wrapper induction (WI). In WI, the goal is to learn a set of extraction rules by using manually labelled examples. The primary issue with WI is that the learned rules are frequently incapable of dealing with even slight variations in a webpage's template, and cannot generalize to other websites. In this thesis, the WIE problem is reframed as an object detection task. For this purpose, a dataset was built, with news articles that were collected and annotated. A state-of-the-art detector, YOLOv5, was used to extract specific attributes such as the news articles’ title, metadata, author, date, main image, text, and keywords. The model yielded 90% mAP (over all classes) in stratified (based on website domain) 5-fold cross-validation. One-shot learning capabilities of the model were also explored by using transfer learning to fine-tune the model to unseen news websites in English but also in another language (Greek) achieving 79% mAP and 90% mAP respectively. A dataset with books’ product details from Amazon.in, with extracting targets the books’ title, author, and price was used to compare our approach with a state-of-the-art approach where a previous version of YOLO (version 2) was utilized. The mAP of our approach yielded 95% mAP compared to the state-of-art approach which yielded 74% mAP.el
dc.corporate.nameΕθνικό Κέντρο Έρευνας Φυσικών Επιστημών «Δημόκριτος»el
dc.contributor.masterΤεχνητή Νοημοσύνη - Artificial Intelligenceel
dc.subject.keywordWeb information extractionel
dc.subject.keywordObject detectionel
dc.subject.keywordDeep neural networksel
dc.date.defense2022-02


Αρχεία σε αυτό το τεκμήριο

Thumbnail

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Εμφάνιση απλής εγγραφής


Βιβλιοθήκη Πανεπιστημίου Πειραιώς
Επικοινωνήστε μαζί μας
Στείλτε μας τα σχόλιά σας
Created by ELiDOC
Η δημιουργία κι ο εμπλουτισμός του Ιδρυματικού Αποθετηρίου "Διώνη", έγιναν στο πλαίσιο του Έργου «Υπηρεσία Ιδρυματικού Αποθετηρίου και Ψηφιακής Βιβλιοθήκης» της πράξης «Ψηφιακές υπηρεσίες ανοιχτής πρόσβασης της βιβλιοθήκης του Πανεπιστημίου Πειραιώς»