Separate TC ingestion data structures from RDF model#349
Draft
ggVGc wants to merge 17 commits into
Draft
Conversation
ggVGc
commented
Mar 3, 2026
ggVGc
commented
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an investigation into separating the data structures used in TC importers, such as
EtcMetaSource,AtcMetaSourceetc. and the data model for interacting with RDF data.The current design makes it hard to know which fields produced by importers are actually used when triples are later created for insertion, and overall makes the importer implementations messy and hard to follow.
My core concern is showcased here, where we construct a
TcStation[T]with a lot of dummy data, and duplicated fields forfundingandnetworks.There are several other similar instances stemming from the same core problem, which is that the same data models are used during importing and when interacting with stored data, for example in RdfReader.
The two use-cases are quite different, and the available information is also different. Importers do not care about existing data, and should just parse incoming information into a data model for further processing. The insertion work is handled by IcosMetaFlow, where triples are created by diffing instances of
TcState, which is the data model produced by importers currently.A clear example of where the data models clash is UriResource, which holds a
URI.URIs are created by RdfMaker, based oncpIdandtcIdvalues, where the prefix of theURIis defined by CpVocab, which maps our data types toIRIprefixes. So, importers concern themselves with IDs, but not with finalURIs, since those are part of our data storage model and unrelated to the data source being consumed. Hence, because of the reuse of data types, all theURIfields in importers are set to dummy values.There is another motivating example in EtcMetaSource where we use
CpVocabto build aURIfor an organization, which is looked up here in order to create aTcInstrument, which gets turned into a triple here, which finally ends up only using thecpIdfield to construct an OrganizationURIhereFollowup
This PR, as it stands, is more of an initial examination showing that the separation of these data models make sense. I have introduced some sloppily named types covering most of the cases in the importers where we previously used unsuitable data types and done my best to keep things working as before.
I am not sure yet how the proper change should look, but I believe a first step is to define the importer data models in terms of the logic in
RdfMaker, since that is where the triples are finally created, and any data that is not used there is useless to produce during import.I also believe every usage of
CpVocabin the importers is suspicious, and if we fully separate the data models, it should disappear, but I may be wrong about this.