Grid based hybrid search for spatio-textual data

Σασάτη, Ηλίας; Sasati, Ilias

dc.contributor.advisor	Δουλκερίδης, Χρήστος
dc.contributor.advisor	Doulkeridis, Christos
dc.contributor.author	Σασάτη, Ηλίας
dc.contributor.author	Sasati, Ilias
dc.date.accessioned	2025-07-16T09:54:15Z
dc.date.available	2025-07-16T09:54:15Z
dc.date.issued	2025-06
dc.identifier.uri	https://dione.lib.unipi.gr/xmlui/handle/unipi/17962
dc.format.extent	76	el
dc.language.iso	en	el
dc.publisher	Πανεπιστήμιο Πειραιώς	el
dc.rights	Αναφορά Δημιουργού 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/gr/	*
dc.title	Grid based hybrid search for spatio-textual data	el
dc.type	Master Thesis	el
dc.contributor.department	Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων	el
dc.description.abstractEN	In this thesis, we present a new approach to the approximate similarity search problem over spatio-textual data, where queries involve both geographic locations and semantically rich text. Unlike traditional approaches that rely on exact keyword matching, our method leverages semantic vector representation (word embeddings) that captures the underlying meaning of textual content. We have developed an algorithm that efficiently processes spatio-textual queries in high dimensional vector space, combining spatial proximity with semantic similarity. Building a combined index is challenging, especially when one needs to prioritize either textual or spatial relevance. Traditionally, such combined indexes rely on keyword search, which limits contextual understanding. But what if we wanted to answer semantic queries such as: “Tweets about people getting new jobs in tech” in California? We address this by developing a grid-based similarity search algorithm, and we use geo-tagged data from Twitter to evaluate its performance. Our approach consists of three steps. In the first step, we divide the data spatially into a uniform grid, whose resolution and implications are examined. In the second step, we build a graph-based index for the semantic vectors using the FAISS library, based on the Hierarchical Navigable Small Worlds (HNSW) algorithm. Finally, we address the pruning properties of the algorithm for efficient search by avoiding checking the entire dataset. Experimental results show that our algorithm maintains high recall even when spatial relevance is weighted more heavily than textual content, despite the fact that the underlying index is primarily optimized for text. We demonstrate that our method can achieve recall of up to 80–85%, with a 20x improvement in execution time compared to a linear scan of the dataset.	el
dc.contributor.master	Πληροφοριακά Συστήματα και Υπηρεσίες	el
dc.subject.keyword	KNN	el
dc.subject.keyword	FAISS	el
dc.subject.keyword	Vector search	el
dc.subject.keyword	ANN	el
dc.subject.keyword	Hybrid search	el
dc.subject.keyword	Spatio-textual queries	el
dc.subject.keyword	Approximate Nearest Neighbors	el
dc.subject.keyword	Hierarchical Navigable Small Worlds	el
dc.subject.keyword	FAISS Library	el
dc.subject.keyword	Similarity search	el
dc.subject.keyword	k-Nearest Neighbors	el
dc.date.defense	2025-07-07

Αρχεία σε αυτό το τεκμήριο

Name:: Sasati_me2327.pdf
Μέγεθος:: 2.994Mb
Τύπος:: PDF
Description:: Μεταπτυχιακή διατριβή

Προβολή/Άνοιγμα

Αυτό το τεκμήριο εμφανίζεται στις ακόλουθες συλλογές

Τμήμα Ψηφιακών Συστημάτων
Department of Digital Systems

Εμφάνιση απλής εγγραφής

Εκτός από όπου διευκρινίζεται διαφορετικά, το τεκμήριο διανέμεται με την ακόλουθη άδεια:
Αναφορά Δημιουργού 3.0 Ελλάδα