Grid based hybrid search for spatio-textual data

Master Thesis
Συγγραφέας
Σασάτη, Ηλίας
Sasati, Ilias
Ημερομηνία
2025-06Επιβλέπων
Δουλκερίδης, ΧρήστοςDoulkeridis, Christos
Προβολή/ Άνοιγμα
Λέξεις κλειδιά
KNN ; FAISS ; Vector search ; ANN ; Hybrid search ; Spatio-textual queries ; Approximate Nearest Neighbors ; Hierarchical Navigable Small Worlds ; FAISS Library ; Similarity search ; k-Nearest NeighborsΠερίληψη
In this thesis, we present a new approach to the approximate similarity search problem over spatio-textual data, where queries involve both geographic locations and semantically rich text. Unlike traditional approaches that rely on exact keyword matching, our method leverages semantic vector representation (word embeddings) that captures the underlying meaning of textual content. We have developed an algorithm that efficiently processes spatio-textual queries in high dimensional vector space, combining spatial proximity with semantic similarity. Building a combined index is challenging, especially when one needs to prioritize either textual or spatial relevance. Traditionally, such combined indexes rely on keyword search, which limits contextual understanding. But what if we wanted to answer semantic queries such as: “Tweets about people getting new jobs in tech” in California?
We address this by developing a grid-based similarity search algorithm, and we use geo-tagged data from Twitter to evaluate its performance. Our approach consists of three steps. In the first step, we divide the data spatially into a uniform grid, whose resolution and implications are examined. In the second step, we build a graph-based index for the semantic vectors using the FAISS library, based on the Hierarchical Navigable Small Worlds (HNSW) algorithm. Finally, we address the pruning properties of the algorithm for efficient search by avoiding checking the entire dataset.
Experimental results show that our algorithm maintains high recall even when spatial relevance is weighted more heavily than textual content, despite the fact that the underlying index is primarily optimized for text. We demonstrate that our method can achieve recall of up to 80–85%, with a 20x improvement in execution time compared to a linear scan of the dataset.