Skip to content

Latest commit

 

History

History
131 lines (101 loc) · 6.02 KB

File metadata and controls

131 lines (101 loc) · 6.02 KB

URI patterns for SILKNOW data

This is the documentation of the URI design pattern used by the SILKNOW Knowledge Graph. See originally the issue #42. The following SPARQL query provides the number of instances of each type (results):

SELECT DISTINCT count(?s) as ?nb ?type
WHERE {
  ?s a ?type .
  FILTER (contains(str(?type), "erlangen") || contains(str(?type), "forth"))
}
GROUP BY ?type
ORDER BY DESC(?nb)

Main entities

Pattern:

http://data.silknow.org/<group>/<uuid>
# e.g. http://data.silknow.org/production/d4ec41ba-a4d3-3ebb-ba07-8567f1add9cb/

The <group> is taken from this table

Class Group
D1_Digital_Object object
E22_Man-Made_Object object
E73_Information_Object object
E89_Propositional_Object object
E38_Image image
E36_Visual_Item image
E12_Production production
E8_Acquisition event
E10_Transfer_of_Custody event
E9_Move event
E11_Modification event
E78_Collection collection
E14_Condition_Assessment assessment
E53_Place place OR http://sws.geonames.org/ {id of place}
E31_Document document
E40_Legal_Body actor
E21_Person actor
E39_Actor actor

For the metadata predictions (either based on text or image) we created the following two groups, whereas the general pattern is the same as above:

Class Group
prov:Activity activity
rdf:Statement statement

Secondary entities

This group includes entities that cover specific information about the main entities. The URI is realized appending a suffix to the parent main entity.

Pattern if only one instance per main entity is expected:

<uri of the main entity>/<suffix>
# i.e. http://data.silknow.org/object/0359045f-21e1-3488-8bfc-5620ab7ea6d7/dimension/measurement

Pattern if multiple instance per main entity are possible:

<uri of the main entity>/<suffix>/<progressive int>
# i.e. <http://data.silknow.org/object/0359045f-21e1-3488-8bfc-5620ab7ea6d7/type/3>

Patterns specifically for the E54_Dimension class:

The <suffix> is taken from this table:

Class Group Suffix
E17_Type_Assignment object /type/{progressive int}
T19_Object_Domain_Assignment object /domain/{progressive int}
T35_Object_Type_Assignment object /type/{progressive int}
E54_Dimension object /dimension/{1 or 2}
T24_Pattern_Unit object /pattern/{1 or 2}
S4_Observation object /observation/{progressive int}
E16_Measurement object /dimension/measurement
E3_Condition_State object /assessment/{progressive int}
E30_Right object /right/
E30_Right object /image/right/
E52_Time-Span production /time/{progressive int}
E7_Activity production /activity/{progressive int}

UUID and seed generation

The UUID is computed deterministically starting from a seed string. A real UUID taken from an example above looks like this: d4ec41ba-a4d3-3ebb-ba07-8567f1add9cb

The seed is usually generated based on:

  • source (e.g. 'unipa', 'met', ...)
  • class (e.g. 'E12_Production', ...)
  • the id of the current object (filename or crawler ID, which is often the same)
  • Hash function: SHA-1

There are some exceptions to this rule, in order to allow automatic cross-source alignment:

  • For D1 Digital Object, the filename (without extension) and the ID of the JSON file are used.
  • For Places and Actors, we use the label ('Rome' e.g.) instead of the id plus the class, but not the source.
  • For Collection, it's the same case as for Places and Actors, but we also use the source addition.
  • For the Images we use a concatenation of id (internal ID of museum record) + "$$$" + imgCount (How many images does the object have) + this.localFilename
  • For E11 Modification the seed is either the value of the original field, or (if it can be parsed from it) the author or the year of the modification.
  • For rdf:Statement and prov:Activity the UUID is always the same per prediction. The seed is either the according object UUID + the image file name and the predicted class (image-based predictions) or the object UUID + the predicted value + the confidence score + the file name of the prediction csv (text-based).

Examples:

  • For most classes: [source]+[class]+[record_internal_id]
  • For Place (E53) and Actors (E39): [class]+[label]
  • For Collections (E78): [source]+[class]+[label]
  • For Images (E38): [record_internal_id] + "$$$" + [img_count_of_record] + [local_filename]

D1 Digital Object and rdfs:label

We use the property "rdfs:label" to store a combination of the actual internal identifier of a record and the former filename.

  • Example:
    ID_T.125-1992_filename_O65533.json (VAM), where "T.125-1992" is the internal identifier and "O65533.json" is the filename. "ID", "filename" and three "_" get added to every rdfs:label to make it easier to read.

  • ID:
    The ID is taken either from the internal museum / dataset ID, if it has a field with such a number. Or a generated ID from the crawling process.

  • Filename:
    It's the filename of the record before it's conversion, which is usually in the common JSON format into which we pre-process most data. In the case of Garin we convert Excel files, that have been provided to us by them (*.xls). Eventual language information from the filename (like "en" or "es") is removed.