Moniepoint was honoured to host Dr. Oladipupo Sennaike, from the Department of Computer Sciences at the University of Lagos (UNILAG), for a recent instalment of the Moniepoint Research Talks. Dr. Sennaike presented his research on automatically discovering connections within vast collections of public data, demonstrating a robust method for improving how we interact with open data globally.
The Context: Why Open Data Portals Matter
Open data portals are online platforms that provide access to collections of open, freely available data intended for reuse. Globally, there are over 2,600 such portals, including major examples like data.gov for the U.S. government and data.gov.uk for the U.K., as well as the Nigeria Data Portal and the World Bank OpenData platform.
These portals are critical for several reasons:
Transparency & Accountability: They enable citizens to access information on government spending, budgets, and public projects. Cxxzxxz z
Innovation & Entrepreneurship: Developers and startups leverage open data to build new fintech, transport, health, and civic-tech products.
Research & Education: Academics and analysts use these datasets for modelling, policy research, and AI training. Furthermore, data collected from these portals provides the evidence needed to ensure that policies are based on real-world evidence.
Better Public Services: Governments can improve planning for sectors like transport, health, and agriculture by harnessing insights derived from this data.
Open Data Portals generally operate on a generic three-layer architecture: the User Interface (containing the Catalogue UI, Search UI, Analytics, and APIs), Services (handling the data catalogue, metadata management, and search), and Storage (managing files, structured data, and indices).
The Core Problem: Current Limitations in Search
Despite their importance, existing open data platforms offer only basic search capabilities and simple filtering. A critical limitation is the lack of recommendations for related datasets. Understanding which datasets are related (or semantically connected) is essential for users who want to integrate or "mash-up" data for comprehensive analysis or to create new data-driven services.
Given the scale of these catalogues data.gov, for example, hosts over 350,000 datasets, and data.gov.uk has over 55,000, manually specifying the relatedness relationships across them is simply infeasible.
The Research Objective: Automating Semantic Relatedness
Dr. Sennaike’s primary goal was to determine the implicit semantic relatedness of datasets published within a catalogue. The proposed solution was to use a Self-Organising Map (SOM) to generate automated dataset recommendations during search operations.
The work acknowledged other established methods for measuring relatedness, such as Lexical resource-based approaches, Explicit Semantic Analysis (ESA), and Latent Dirichlet Allocation (LDA) for topic modelling. Still, it focused on the SOM's unsupervised capabilities.
Leveraging Kohonen’s Self-Organising Maps (SOM)
A Self-Organising Map (SOM) is a type of unsupervised artificial neural network. Unlike supervised learning, where a model is trained on labelled data (and its accuracy is easily measured by comparing its output to the correct label), unsupervised learning can learn connections from unlabeled data. The SOM acts similarly to clustering algorithms by grouping similar concepts, but its clusters do not have hard boundaries.
The core functionality of the SOM algorithm is dimensional reduction: it projects high-dimensional input data (with many attributes or features) onto a low-dimensional space, typically a two-dimensional map, while preserving topological order. This preservation means that related data points are placed close to one another on the resulting map.
The SOM consists of nodes or units arranged in a regular rectangular or hexagonal grid. Each node is associated with an n-dimensional model vector (or weights) that approximates the set of input data, where n is the dimension of the input space (the number of features).
During training, data items are presented to the nodes in parallel. The nodes compete, and the node with the best-matching model vector, usually determined by the Euclidean distance metric, emerges as the winner. The model vectors (weights) of this winning unit and its neighbours are then adjusted (or strengthened) to move closer to the input data, enabling learning. This adjustment to the surrounding nodes is critical, as it defines the neighbourhood size and helps preserve the topological order.
The SOM Model Development Process
The research utilised a five-step methodology to implement the SOM-based recommendation engine:
1. Data Preparation and Extraction
The study focused on 255 open datasets from the Dublin City Open Data Platform (DubLinked). This catalogue includes diverse information ranging from pedestrian footfall indices to commercial lease registrars and sculptures in public parks.
Data was extracted from the DubLinked CKAN platform instance using its REST API. The extraction captured essential metadata, including the dataset's Title, Organisation, Theme, and Tags. Furthermore, Named Entity Recognition (NER) was employed to extract specific entities, such as people, organisations, and locations, from the content and metadata, thereby enriching the dataset description.
[TABLE: Extracted Dataset Features, including Title, Organization, Theme, Notes, Tag, Resource Fields, Location, Person, and Organization]
2. Data Transformation
Since the SOM requires numeric vector input, the text-based metadata and features were transformed into a Document Term matrix using the Term Frequency–Inverse Document Frequency (tf-idf) metrics. This common NLP technique converts variable-length text documents into a fixed-size numeric representation, ensuring regularity across all datasets. The resulting input matrix for the SOM was 255 by 1241 (representing the 255 datasets and 1,241 terms/features extracted).
3. Model Selection, Validation, and Results
Since there was no labelled data to rely on, the researchers experimented with different map sizes to find the optimal SOM configuration. Selection was based on two key metrics:
Topological Error: This measures the proportion of input samples in which the first- and second-best matching units are not neighbours on the map. A low error suggests that topologically close items on the map are genuinely similar in the source data.
Quantisation Error: This is the average distance between the input data and the model vector of its best-matching unit. This indicates how accurately the map represents the overall training data.
For validation, the process relied on domain experts who examined the categories discovered by the map. A category was defined by the data items falling onto a specific node, plus those within a specified radius around that node. Across all testing cases, these experts successfully inferred the underlying concept addressed by each map category, even when the neighbourhood radius was increased.
The resulting SOM visualisations revealed clear groupings of semantically related datasets:
Further analysis using Word Clouds based on the metadata and node features helped visualise the concepts captured by clusters of datasets. For example, one node might clearly display keywords such as Culture, Heritage, and Arts, confirming that the topic is being grouped.
Application and Conclusion
The developed SOM-based model effectively computed semantic relatedness among datasets, providing strong evidence for the efficacy of the unsupervised approach. This method reveals opportunities for innovation implicit in the data catalogue by making hidden connections explicit.
When incorporated into a search interface, the model generated highly relevant recommendations:
A user searching for the "Luas Stops" dataset (Dublin’s tram network) received related resources like "Traffic Volumes," "Real-time passenger Information (RTPI) for Dublin Bus," and "Commercial Bus Services in Ireland".
A search for the "Parks" dataset yielded connections to "Art in the Parks," "DLR Landscape Maintenance," and "Trees".
Challenges and Future Work
While the SOM approach proved robust, the research identified ongoing challenges. Specifically, poor-quality metadata and data could negatively affect the model's effectiveness, as it relies heavily on textual descriptions and entity recognition extracted from this metadata.
Future work will focus on integrating the model with other tools to enable the seamless integration of compatible datasets and applying the model to much larger-scale data catalogues beyond the initial 255 used in this study.