Data Sources

This project uses data from various sources that are openly licensed or in the public domain. Below are the sources and their respective information:

arXiv

Description: arXiv is a free distribution service and an open-access archive for scholarly articles in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. All arXiv articles are available under various open licenses or are in the public domain.

API documentation link:

API information:

No API key required
Query limit: No official limit, but requests should be made responsibly
Data available through Atom XML format
Supports search by fields: title (ti), author (au), abstract (abs), comment (co), journal reference (jr), subject category (cat), report number (rn), id, all (searches all fields), and submittedDate (date filter)
Metadata includes licensing information for each paper

CC Legal Tools

Description: A .txt file provided by Timid Robot containing all legal tool paths.

API documentation link:

google_custom_search/legal-tool-paths.txt: a list of all current Creative Commons (CC) legal tool paths
data/prioritized-tool-urls.txt: a prioritized list of all current CC legal tool URLs

API information:

No API key required
No query limits

Europeana

Description: The Europeana Search API provides access to digital cultural heritage metadata records aggregated from museums, libraries, and archives across Europe. This project uses the API to fetch aggregated counts of cultural heritage records by data provider, rights statement, and theme.

Official API Documentation:

Search API Documentation
- Themes are listed in the Search API Request Parameter accordion

API information:

API key required
Minimum 0.003 seconds between queries
Query parameters allow:
- Full-text searching (query)
- Retrieving metadata facets (profile=facets)
- Filtering by data provider, rights statement, and theme
Data available through JSON format
Offset-based pagination

GCS (Google Custom Search) JSON API

Description: The Custom Search JSON API allows user-defined detailed query and access towards related query data using a programmable search engine.

Admin links:

API documentation links:

Custom Search JSON API Reference | Programmable Search Engine | Google Developers
Google API Python Client Library
- Google API Client Library for Python Docs | google-api-python-client
  - Reference documentation for the core library googleapiclient.
    - See: googleapiclient.discovery > build
  - Library reference documentation by API
    - See Custom Search v1 cse()
Method: cse.list | Custom Search JSON API | Google Developers
XML API reference appendices

API information:

API key required
Query limit: 100 queries per day
Data available through JSON format

Notes:

The data from Google Custom Search will only cover 50+ general, most significant categories of CC License for data collection quota constraint. As an additional note, the order of precedence of license the collected data's first column is sorted due to intermediate data analysis progress.

GitHub

Description: A development platform for hosting and managing code.

API documentation link:

GitHub REST API v3

API information:

API key not required but recommended by GitHub
Query limit: 60 requests per hour if unauthenticated, 5000 requests per hour if authenticated
Data available through JSON format

Openverse

Description: Openverse is a search engine for openly licensed media, including images and audio. It provides access to over 700 million works from more than 20 sources, all of which are under Creative Commons licenses or in the public domain. The API allows querying for media by source, license type, and other parameters. Because anonymous Openverse API access returns a maximum of ~240 result count per source-license combination, the openverse_fetch.py script currently provides approximate counts. It does not include pagination or license_version breakdown.

API documentation link:

API information:

No API key required for basic access
Query limit: Rate-limited to prevent abuse (anonymous access provides ~240 results per source-license combination)
Data available through JSON format
Supports filtering by source, license, media type (images, audio)
Media types: images, audio
Supported licenses: by, by-nc, by-nc-nd, by-nc-sa, by-nd, by-sa, cc0, nc-sampling+, pdm, sampling+

Wikipedia

Description: The Wikipedia API allows users to query statistics of pages, categories, revisions from a public API endpoint. We have included two urls in the project: The WIKIPEDIA_BASE_URL AND WIKIPEDIA_MATRIX_URL. The WIKIPEDIA_BASE_URL provides access to articles, categories, and metadata from the English version of Wikipedia. It runs on the MediaWiki Action API, but this instance only provides English Wikipedia data. Then the WIKIPEDIA_MATRIX_URL provides access to information of all wikimedia projects including the different language edition of wikipedia. It runs on the Meta-Wiki API.

API documentation link: WIKIPEDIA_BASE_URL documentation WIKIPEDIA_BASE_URL reference page WIKIPEDIA_MATRIX_URL documentation WIKIPEDIA_MATRIX_URL reference page

API information:

No API key required
Query limit: It is rate-limited only to prevent abuse
Data available through XML or JSON format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Sources

arXiv

CC Legal Tools

Europeana

GCS (Google Custom Search) JSON API

GitHub

Openverse

Wikipedia

FilesExpand file tree

sources.md

Latest commit

History

sources.md

File metadata and controls

Data Sources

arXiv

CC Legal Tools

Europeana

GCS (Google Custom Search) JSON API

GitHub

Openverse

Wikipedia