Advancing GeoSPARQL Query Generation on YAGO2Geo: Leveraging Large Language Models and Automated URI Injection from Natural Language Questions

Graduate Thesis uoadl:3420253 53 Read counter

Unit:
Department of Informatics and Telecommunications
Πληροφορική
Deposit date:
2024-10-19
Year:
2024
Author:
KAKALIS EFSTRATIOS-PASCHALIS
Supervisors info:
Μανώλης Κουμπαράκης, Καθηγητής, Τμήμα πληροφορικής και τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Σέργιος-Ανέστης Κεφαλίδης, Συνεργαζόμενος Ερευνητής, Τμήμα πληροφορικής και τηλεπικοινωνιών, Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών
Original Title:
Advancing GeoSPARQL Query Generation on YAGO2Geo: Leveraging Large Language Models and Automated URI Injection from Natural Language Questions
Languages:
English
Translated title:
Advancing GeoSPARQL Query Generation on YAGO2Geo: Leveraging Large Language Models and Automated URI Injection from Natural Language Questions
Summary:
Question Answering (QA) over Knowledge Bases, particularly Knowledge Graphs (KGs),
has become an essential task in Natural Language Processing. This task allows users
to retrieve precise information from structured datasets by asking questions in natural
language. However, querying KGs typically requires the use of complex query languages
like SPARQL, which necessitates a deep understanding of both the KG’s structure and
ontology. For non-expert users, generating accurate queries in these technical formats
can be highly challenging. To make Knowledge Graphs more accessible, it is essential to
develop interfaces that allow users to interact with KGs through simple, natural language
questions, without needing to understand SPARQL. In this study, we build on this concept
by addressing the challenge of geospatial Question Answering, specifically focusing on
generating GeoSPARQL queries that correspond to any given natural language question.
This thesis investigates the development of an end-to-end system for generating GeoSPARQL queries from natural language inputs leveraging LLMs. This study targets the
YAGO2geo Knowledge Graph. The motivation behind this thesis is that traditional methods for query generation struggle with fixed vocabularies and complex KG structures, particularly for geospatial data. To address these challenges, this work focuses on leveraging
open-source LLMs, with a particular emphasis on the Mistral 7B model, and introduces
novel URI-injection techniques to enhance the accuracy and efficiency of SPARQL query
generation.
The study evaluates several state-of-the-art (SOTA) LLMs of various sizes, comparing
open-source models to proprietary models. Through named-entity disambiguation, finetuning, and prompt engineering, the thesis demonstrates how injecting relevant URIs during query generation can significantly improve model performance, particularly in cases
where knowledge about specific entities is sparse. We produce a fine-tuned Mistral model,
trained on a carefully processed train set, that shows substantial improvements in query
accuracy, outperforming larger, more resource-intensive models.
The ma in contributions of this work are:
• A systematic evaluation of existing LLMs for SPARQL query generation: We
will conduct three distinct evaluations and present them in detail, drawing conclusions on the current capabilities of various LLMs in this specific task. Our findings
are verified by cross-examining the results from multiple evaluation metrics.
• A novel prompt-engineering framework for geospatial question-answering: ”URIinjection,” designed to enhance LLM performance in SPARQL query generation
without expensive fine-tuning, making it versatile and easy to apply across multiple
tasks and ontologies.
• A fine-tuned and quantized Mistral v0.2 7b model, matching the state-of-the-art
accuracy on the GeoQuestions1089 dataset, while maintaining computational efficiency through 4-bit precision.
Main subject category:
Technology - Computer science
Keywords:
Natural Language Processing, Large Language Models, Knowledge Graphs, Question Answering, Artificial Intelligence
Index:
Yes
Number of index pages:
2
Contains images:
Yes
Number of references:
50
Number of pages:
70
Thesis_UoA.pdf (1 MB) Open in new window