Skip to content

Use OAI-PMH endpoint instead of API to retrieve datasets #2

@utsmok

Description

@utsmok

Hi, there, I got linked to this interesting project by my colleague Efe!
I see you're using the Pure API to retrieve the datasets by grabbing publications -> linked datasets -> datasets themselves.
Did you know you can use the public OAI-PMH endpoint of your repository to harvest this data directly, without API keys or rate limits?
The endpoint is here for the VU: https://research.vu.nl/ws/oai?verb=ListRecords&metadataPrefix=oai_cerif_openaire&set=datasets:all

I'm using the metadataPrefix oai_cerif_openaire here because this includes the internal pure uuid in each entry, which could be used to retrieve more detailed/non public info from the API if needed, plus if you retrieve the publications as well you can use the uuid for matching them up with the related_to field.

Most institutes around the world have their own OAI-PMH endpoints, especially in Europe in order to facilitate OpenAIRE harvesting; but not all support the same functionality. You can check using the base function calls to get the available sets of records & metadataformats, here for the VU endpoint:
https://research.vu.nl/ws/oai?verb=ListSets
https://research.vu.nl/ws/oai?verb=ListMetadataFormats

Unfortunately, in my experience not many repos supply datasets as a separate item, nor do they always include detailed metadata, but yours (and ours at https://ris.utwente.nl/ws/oai) do!

This all uses the ancient but well documented OAI-PMH protocol. You can read more about the OpenAIRE specs for institute repos here , the (also ancient) CERIF specifications here , and the standard metadataformat (dublin core) specs are here

I'm working on a more general harvester/aggregrator for research metadata, the source can be found here, and I did a short talk recently for the OpenAlex community meetup, which you can view here. Feel free to let me know if I can help out somewhere!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions