Skip to content

Latest commit

 

History

History
170 lines (130 loc) · 7.46 KB

File metadata and controls

170 lines (130 loc) · 7.46 KB

Data Sources

This project uses data from various sources that are openly licensed or in the public domain. Below are the sources and their respective information:

arXiv

Description: arXiv is a free distribution service and an open-access archive for scholarly articles in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. All arXiv articles are available under various open licenses or are in the public domain.

API documentation link:

API information:

  • No API key required
  • Query limit: No official limit, but requests should be made responsibly
  • Data available through Atom XML format
  • Supports search by fields: title (ti), author (au), abstract (abs), comment (co), journal reference (jr), subject category (cat), report number (rn), id, all (searches all fields), and submittedDate (date filter)
  • Metadata includes licensing information for each paper

CC Legal Tools

Description: A .txt file provided by Timid Robot containing all legal tool paths.

API documentation link:

API information:

  • No API key required
  • No query limits

Europeana

Description: The Europeana Search API provides access to digital cultural heritage metadata records aggregated from museums, libraries, and archives across Europe. This project uses the API to fetch aggregated counts of cultural heritage records by data provider, rights statement, and theme.

Official API Documentation:

API information:

  • API key required
  • Minimum 0.003 seconds between queries
  • Query parameters allow:
    • Full-text searching (query)
    • Retrieving metadata facets (profile=facets)
    • Filtering by data provider, rights statement, and theme
  • Data available through JSON format
  • Offset-based pagination

GCS (Google Custom Search) JSON API

Description: The Custom Search JSON API allows user-defined detailed query and access towards related query data using a programmable search engine.

Admin links:

API documentation links:

API information:

  • API key required
  • Query limit: 100 queries per day
  • Data available through JSON format

Notes:

  • The data from Google Custom Search will only cover 50+ general, most significant categories of CC License for data collection quota constraint. As an additional note, the order of precedence of license the collected data's first column is sorted due to intermediate data analysis progress.

GitHub

Description: A development platform for hosting and managing code.

API documentation link:

API information:

  • API key not required but recommended by GitHub
  • Query limit: 60 requests per hour if unauthenticated, 5000 requests per hour if authenticated
  • Data available through JSON format

Openverse

Description: Openverse is a search engine for openly licensed media, including images and audio. It provides access to over 700 million works from more than 20 sources, all of which are under Creative Commons licenses or in the public domain. The API allows querying for media by source, license type, and other parameters. Because anonymous Openverse API access returns a maximum of ~240 result count per source-license combination, the openverse_fetch.py script currently provides approximate counts. It does not include pagination or license_version breakdown.

API documentation link:

API information:

  • No API key required for basic access
  • Query limit: Rate-limited to prevent abuse (anonymous access provides ~240 results per source-license combination)
  • Data available through JSON format
  • Supports filtering by source, license, media type (images, audio)
  • Media types: images, audio
  • Supported licenses: by, by-nc, by-nc-nd, by-nc-sa, by-nd, by-sa, cc0, nc-sampling+, pdm, sampling+

Wikipedia

Description: The Wikipedia API allows users to query statistics of pages, categories, revisions from a public API endpoint. We have included two urls in the project: The WIKIPEDIA_BASE_URL AND WIKIPEDIA_MATRIX_URL. The WIKIPEDIA_BASE_URL provides access to articles, categories, and metadata from the English version of Wikipedia. It runs on the MediaWiki Action API, but this instance only provides English Wikipedia data. Then the WIKIPEDIA_MATRIX_URL provides access to information of all wikimedia projects including the different language edition of wikipedia. It runs on the Meta-Wiki API.

API documentation link: WIKIPEDIA_BASE_URL documentation WIKIPEDIA_BASE_URL reference page WIKIPEDIA_MATRIX_URL documentation WIKIPEDIA_MATRIX_URL reference page

API information:

  • No API key required
  • Query limit: It is rate-limited only to prevent abuse
  • Data available through XML or JSON format