Skip to content

Separate TC ingestion data structures from RDF model#349

Draft
ggVGc wants to merge 17 commits into
masterfrom
valter/tc-station-source-types
Draft

Separate TC ingestion data structures from RDF model#349
ggVGc wants to merge 17 commits into
masterfrom
valter/tc-station-source-types

Conversation

@ggVGc
Copy link
Copy Markdown
Contributor

@ggVGc ggVGc commented Mar 2, 2026

This is an investigation into separating the data structures used in TC importers, such as EtcMetaSource, AtcMetaSource etc. and the data model for interacting with RDF data.
The current design makes it hard to know which fields produced by importers are actually used when triples are later created for insertion, and overall makes the importer implementations messy and hard to follow.

My core concern is showcased here, where we construct a TcStation[T] with a lot of dummy data, and duplicated fields for funding and networks.
There are several other similar instances stemming from the same core problem, which is that the same data models are used during importing and when interacting with stored data, for example in RdfReader.

The two use-cases are quite different, and the available information is also different. Importers do not care about existing data, and should just parse incoming information into a data model for further processing. The insertion work is handled by IcosMetaFlow, where triples are created by diffing instances of TcState, which is the data model produced by importers currently.

A clear example of where the data models clash is UriResource, which holds a URI. URIs are created by RdfMaker, based on cpId and tcId values, where the prefix of the URI is defined by CpVocab, which maps our data types to IRI prefixes. So, importers concern themselves with IDs, but not with final URIs, since those are part of our data storage model and unrelated to the data source being consumed. Hence, because of the reuse of data types, all the URI fields in importers are set to dummy values.

There is another motivating example in EtcMetaSource where we use CpVocab to build a URI for an organization, which is looked up here in order to create a TcInstrument, which gets turned into a triple here, which finally ends up only using the cpId field to construct an Organization URI here

Followup
This PR, as it stands, is more of an initial examination showing that the separation of these data models make sense. I have introduced some sloppily named types covering most of the cases in the importers where we previously used unsuitable data types and done my best to keep things working as before.
I am not sure yet how the proper change should look, but I believe a first step is to define the importer data models in terms of the logic in RdfMaker, since that is where the triples are finally created, and any data that is not used there is useless to produce during import.
I also believe every usage of CpVocab in the importers is suspicious, and if we fully separate the data models, it should disappear, but I may be wrong about this.

Comment thread src/main/scala/se/lu/nateko/cp/meta/metaflow/icos/AtcMetaSource.scala Outdated
Comment thread src/main/scala/se/lu/nateko/cp/meta/metaflow/icos/EtcMetaSource.scala Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant