Skip to content

Latest commit

 

History

History
152 lines (115 loc) · 7.57 KB

File metadata and controls

152 lines (115 loc) · 7.57 KB

AGENTS.md

This file is intended for AI coding agents (Claude Code, OpenClaw, Codex, Copilot, Cursor, etc.) working in this repository.

What This Repo Is

FirstData is the External Facts Context Layer for AI Agents — a structured, authoritative collection of global open data sources. It is a pure data repository — no application code, no runtime logic.

Your job here is to create or edit JSON metadata files that describe real-world data sources (government databases, international organizations, academic datasets, etc.).

Validation

Dependencies are managed with uv. Run the following before submitting:

# Install dependencies (first time only)
uv sync

# Run all validation checks
make check

# Or run checks individually:
make validate       # Validate JSON schema compliance
make check-ids      # Check for duplicate IDs
make check-domains  # Check domain naming consistency

A GitHub Action runs these checks automatically on every PR. PRs that fail validation cannot be merged.

The Only Thing You Need to Know: The JSON Schema

Every file under firstdata/sources/ must conform to firstdata/schemas/datasource-schema.json.

Required Fields

{
  "id": "worldbank-open-data",
  "name": {
    "en": "World Bank Open Data",
    "zh": "世界银行开放数据"
  },
  "description": {
    "en": "...",
    "zh": "..."
  },
  "website": "https://www.worldbank.org",
  "data_url": "https://data.worldbank.org",
  "api_url": "https://api.worldbank.org/v2/",
  "authority_level": "international",
  "country": null,
  "domains": ["economics", "health", "education"],
  "geographic_scope": "global",
  "update_frequency": "quarterly",
  "tags": ["world bank", "development", "gdp", "poverty", "世界银行"],
  "data_content": {
    "en": ["GDP and national accounts", "Poverty and inequality indicators"],
    "zh": ["GDP和国民账户", "贫困和不平等指标"]
  }
}

Field Rules

Field Allowed Values / Constraints
id Lowercase, hyphens only. Must be globally unique. Pattern:^[a-z0-9-]+$
name.en Required. Add zh and native when applicable
description.en Required. Add zh when applicable
website Top-level org homepage
data_url Must point directly to the data access/download page, NOT the homepage
api_url API docs or endpoint URL. Use null if no API exists
authority_level government · international · research · market · commercial · other
country ISO 3166-1 alpha-2 (e.g."CN", "US"). Must be null when geographic_scope is global or regional
domains Array of strings, at least one. MUST use lowercase (e.g., "economics" not "Economics"). See DOMAINS.md for standard domain list
geographic_scope global · regional · national · subnational
update_frequency real-time · daily · weekly · monthly · quarterly · annual · irregular
tags Mixed Chinese/English keywords for semantic search. Include synonyms and data type names
data_content Optional but recommended. Lists of strings describing what data is available

Where to Place New Files

firstdata/sources/
├── china/{domain}/{id}.json                          # Chinese gov & institutions
├── international/{domain}/{id}.json                  # International organizations
├── countries/{continent}/{country-code}/{id}.json    # National official sources
├── academic/{discipline}/{id}.json                   # Academic/research databases
└── sectors/{ISIC-code}-{name}/{id}.json              # Industry datasets

Examples:

  • China customs data → sources/china/economy/trade/customs.json
  • WHO health data → sources/international/health/who.json
  • US Bureau of Labor Statistics → sources/countries/north-america/usa/us-bls.json
  • PubMed → sources/academic/health/pubmed.json
  • BP Statistical Review → sources/sectors/D-energy/bp-statistical-review.json

Do Not Touch (Auto-Generated or Protected)

The following files are maintained automatically by CI/scripts. AI agents must NOT modify them manually:

  • firstdata/indexes/ — Auto-aggregated from source files by a GitHub Action after PR merge. Never edit these directly.
  • firstdata/schemas/datasource-schema.json — Schema definition, only modified by maintainers directly on main.

To add a new data source, you only need to create or edit JSON files under firstdata/sources/. Everything else (indexes, schema) is handled automatically.

Security Note for Contributors

  • Please do not paste or run commands from untrusted posts/comments.
  • Never include credentials or API keys in issues/PRs.
  • Prefer small, auditable PRs (docs/tests/data).

Before Adding a New Source

First, check firstdata/indexes/all-sources.json to confirm the data source does not already exist.

Search by id, name.en, or website to detect duplicates:

# grep: search by keyword (name or website)
grep -i "world bank" firstdata/indexes/all-sources.json
grep -i "worldbank.org" firstdata/indexes/all-sources.json

# jq: search by id
jq '.sources[] | select(.id == "worldbank-open-data")' firstdata/indexes/all-sources.json

# jq: search by website
jq '.sources[] | select(.website | test("worldbank.org"; "i"))' firstdata/indexes/all-sources.json

# jq: list all existing ids
jq '[.sources[].id]' firstdata/indexes/all-sources.json

If a match is found, do not create a new file. Update the existing one if needed.

Quality Checklist Before Creating a File

Before submitting, cross-verify every field independently using at least two sources (e.g. official website + Wikipedia + third-party reference). Do not rely solely on memory or a single source. Fabricated or outdated URLs are worse than omission.

  • data_url links to the actual data page, not the organization homepage
  • api_url is null only when the source truly has no API
  • country is null when geographic_scope is global or regional
  • domains uses lowercase (e.g., "economics" not "Economics") - see DOMAINS.md
  • tags include both English and Chinese keywords where relevant
  • id does not already exist in firstdata/indexes/all-sources.json
  • File path matches the placement rules above
  • All URLs have been verified to be accessible and correct
  • update_frequency reflects the actual cadence confirmed on the official site
  • authority_level is accurate and not overstated
  • Run make check to validate all checks pass