LMUK-Geo: A Dataset and LLM-Driven Geoparsing Approach for UK Local News

This repository contains the code used in our forthcoming journal article "Towards Efficient and Accessible Geoparsing of UK Local Media: A Benchmark Dataset and LLM-based Approach" which introduces the LMUK-Geo dataset and a novel, scalable, accessible, and robust LLM-driven geoparsing approach for UK local news. The dataset itself is available on Harvard Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGVXIU

Introduction

This project addresses the need for improved geoparsing of local media, focusing on the under-explored geographic context of the UK. We introduce two key contributions:

The Local Media UK Geoparsing (LMUK-Geo) dataset: A novel, annotated gold standard corpus of 182 UK local news articles, enabling the development and evaluation of tailored geoparsing models. The dataset is hosted on Harvard Dataverse.
A novel LLM-driven geoparsing approach: This approach tackles the challenges of location disambiguation and contextual understanding in UK local news.

This repository provides the code used for our LLM-driven geoparsing approach, facilitating reproducibility and further research in this area. The LMUK-Geo dataset must be downloaded from Harvard Dataverse before running the code.

Dataset

The LMUK-Geo dataset consists of 182 UK local news articles sourced from the UKTwitNewsCor corpus. It is annotated with toponyms (GPE, LOC, FAC) and their corresponding Local Authority Districts (LADs). The annotation process involved toponym recognition using Prodigy and SpaCy, candidate generation using Ordnance Survey Open Names and OpenStreetMap Nominatim, and manual disambiguation using LabelStudio. The dataset is designed to capture fine-grained locations prevalent in local news, excluding larger-scale geographic references.

Location: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGVXIU
Format: JSON and CSV.

Prompt-Based Geoparsing Approach

Our approach utilises open-source LLMs (Gemma2, Llama3.1, Qwen2, Mistral) via the Ollama framework within an R environment. We explored different prompting strategies, including contextual toponym disambiguation from a knowledge base and few-shot LAD classification. We also investigated the impact of varying metadata context (outlet name, outlet coverage district, other toponyms in the article) on LLM geoparsing performance. Majority voting was used to enhance robustness.

Evaluation

We evaluated our approach using both classification metrics (Accuracy) and distance metrics (Mean Error Distance, Accuracy@20km, Accuracy@161km). We compared our results against a baseline provided by Hu et al, 2024. We considered different evaluation scenarios, including handling edge cases and using majority voting.

Repository Structure

The repository is organised as follows:

scripts/: Contains the R scripts used for creating the dataset, geoparsing, analysis, and evaluation.
files/: Contains the data files used in the project, including (but not limited to) raw data, intermediate processed data, and output files.

Citation

If you use this dataset or code in your research, please cite the following:

@article{bisiani_2025_lmuk,
  title={Towards Efficient and Accessible Geoparsing of UK Local Media: A Benchmark Dataset and LLM-based Approach},
  author={Bisiani, S., Gulyas, A., and Bahareh Heravi},
  journal={Computational Humanities Research (Submitted)},
  year={2025}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
files		files
scripts		scripts
.gitattributes		.gitattributes
LMUK-Geo.csv		LMUK-Geo.csv
LMUK-Geo.json		LMUK-Geo.json
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LMUK-Geo: A Dataset and LLM-Driven Geoparsing Approach for UK Local News

Introduction

Dataset

Prompt-Based Geoparsing Approach

Evaluation

Repository Structure

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LMUK-Geo: A Dataset and LLM-Driven Geoparsing Approach for UK Local News

Introduction

Dataset

Prompt-Based Geoparsing Approach

Evaluation

Repository Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages