Here’s the full summary for a new chat:
Project: CongressWatch
Live site: congresswatch.vercel.app
Repo: github.com/OpenSourcePatents/Congresswatch
Stack: Plain HTML/JS frontend, Python data pipelines, GitHub Actions, Vercel hosting, JSON flat file database (upgrading to Supabase eventually)
What it does:
Pulls public government data on all 538 Congress members and generates an anomaly score (0-100) based on campaign finance, stock trades, voting record, missed votes, PAC money ratio, and bill authorship patterns. Every number links to its original government source.
Repo structure:
∙ .github/workflows/fetch-data.yml — daily workflow, runs Python inline (NOT fetch.py), pulls all members from Congress.gov API, saves to data/members.json
∙ .github/workflows/fetch-finance.yml — daily workflow, runs fetch_finance.py from repo root
∙ fetch_finance.py — enriches members.json with FEC, GovTrack, EDGAR, computes anomaly scores
∙ fetch.py — exists in repo root but is NOT called by any workflow, ignore it
∙ data/members.json — main data file, single source of truth for frontend
∙ data/stats.json — summary stats
∙ index.html — entire frontend, single file
GitHub Secrets set:
∙ CONGRESS_API_KEY — Congress.gov API
∙ FEC_API_KEY — FEC open API
members.json schema:
{
"id": "bioguideId",
"name": "First Last",
"party": "Democratic|Republican|Independent|Unknown",
"state": "StateName",
"district": "number string or empty",
"chamber": "Senate|House",
"photo_url": "https://bioguide.congress.gov/bioguide/photo/X/XXXXXXX.jpg",
"term_start": "YYYY-01-01",
"score": 0,
"flags": [],
"updated": "ISO timestamp",
"total_raised": 0,
"total_raised_display": "$0M",
"pac_contributions": 0,
"individual_contributions": 0,
"cash_on_hand": 0,
"top_donors_list": [],
"top_donors": "",
"edgar_trade_count": 0,
"total_trades": 0,
"missed_votes_pct": 0,
"votes_with_party_pct": 0,
"govtrack_id": "",
"ideology_score": null,
"leadership_score": null,
"fec_candidate_id": "",
"fec_committee_id": ""
}
Current chamber counts: 106 Senate, 432 House (slight overcount on Senate, acceptable)
What’s working:
∙ Congress.gov member fetch, chamber inference fixed (district = House, no district + not territory = Senate, terms loop removed)
∙ GovTrack voting stats pulling correctly (no key needed)
∙ FEC pipeline running but with a known bug
∙ Frontend fully built, reads all fields from members.json correctly
∙ Anomaly scoring engine running
Known bugs to fix:
1. FEC not populating for many members — is_active_candidate: True in fetch_fec_candidate() filters out members who haven’t run recently. Fix: remove that param.
2. EDGAR trade search broken — URL is hardcoded garbage, not actually searching by member name. Needs a real implementation.
3. EDGAR name matching problem (expert advice) — Congress members file under legal name variations that don’t match Congress.gov or FEC profiles. Need to build a name normalization layer first, then use that to query EDGAR full-text search. Skipping this means silently wrong results for many members.
4. Net worth and salary charts are fake — estimated from score, not real data. No real net worth source yet.
Pipelines still to build:
∙ SEC EDGAR Form 4 — proper implementation with name normalization layer first, then full-text search query. Check efts.house.gov before building from scratch.
∙ House/Senate eFD financial disclosures — Senate and House file in different formats. House disclosures are PDFs, meaning scraping + PDF parsing are two separate problems. Open source projects exist that have already done some of this work.
∙ LegiScan — bill text for NLP similarity engine
∙ GovTrack — full voting records (stats are pulling, full vote-by-vote record is not)
∙ OpenSecrets — career finance totals
NLP Bill Similarity Engine (most important feature, not started):
Expert advice received: do NOT use keyword/word matching. Use sentence-transformers to convert each bill into a mathematical embedding (vector representation of meaning), then do cosine similarity clustering to compare them. This catches coordinated ghost-writing from lobbying orgs even when language has been slightly modified between versions given to different politicians. TF-IDF was the original plan, sentence-transformers is the better approach.
Goal: flag bills where similarity exceeds 80% threshold as potentially coordinated authorship.
Frontend tabs per member profile:
Overview, Votes, Finance, Stocks, Travel, Patterns, Donors, Compare
All tabs are built in the frontend. Data fields for Finance, Stocks, and Donors are wired up and will populate automatically once the backend pipelines are complete.
Roadmap after pipelines:
Full native app development (iOS/Android), also free.
Here’s the full summary for a new chat:
Project: CongressWatch
Live site: congresswatch.vercel.app
Repo: github.com/OpenSourcePatents/Congresswatch
Stack: Plain HTML/JS frontend, Python data pipelines, GitHub Actions, Vercel hosting, JSON flat file database (upgrading to Supabase eventually)
What it does:
Pulls public government data on all 538 Congress members and generates an anomaly score (0-100) based on campaign finance, stock trades, voting record, missed votes, PAC money ratio, and bill authorship patterns. Every number links to its original government source.
Repo structure:
∙ .github/workflows/fetch-data.yml — daily workflow, runs Python inline (NOT fetch.py), pulls all members from Congress.gov API, saves to data/members.json
∙ .github/workflows/fetch-finance.yml — daily workflow, runs fetch_finance.py from repo root
∙ fetch_finance.py — enriches members.json with FEC, GovTrack, EDGAR, computes anomaly scores
∙ fetch.py — exists in repo root but is NOT called by any workflow, ignore it
∙ data/members.json — main data file, single source of truth for frontend
∙ data/stats.json — summary stats
∙ index.html — entire frontend, single file
GitHub Secrets set:
∙ CONGRESS_API_KEY — Congress.gov API
∙ FEC_API_KEY — FEC open API
members.json schema:
{
"id": "bioguideId",
"name": "First Last",
"party": "Democratic|Republican|Independent|Unknown",
"state": "StateName",
"district": "number string or empty",
"chamber": "Senate|House",
"photo_url": "https://bioguide.congress.gov/bioguide/photo/X/XXXXXXX.jpg",
"term_start": "YYYY-01-01",
"score": 0,
"flags": [],
"updated": "ISO timestamp",
"total_raised": 0,
"total_raised_display": "$0M",
"pac_contributions": 0,
"individual_contributions": 0,
"cash_on_hand": 0,
"top_donors_list": [],
"top_donors": "",
"edgar_trade_count": 0,
"total_trades": 0,
"missed_votes_pct": 0,
"votes_with_party_pct": 0,
"govtrack_id": "",
"ideology_score": null,
"leadership_score": null,
"fec_candidate_id": "",
"fec_committee_id": ""
}
Current chamber counts: 106 Senate, 432 House (slight overcount on Senate, acceptable)
What’s working:
∙ Congress.gov member fetch, chamber inference fixed (district = House, no district + not territory = Senate, terms loop removed)
∙ GovTrack voting stats pulling correctly (no key needed)
∙ FEC pipeline running but with a known bug
∙ Frontend fully built, reads all fields from members.json correctly
∙ Anomaly scoring engine running
Known bugs to fix:
1. FEC not populating for many members — is_active_candidate: True in fetch_fec_candidate() filters out members who haven’t run recently. Fix: remove that param.
2. EDGAR trade search broken — URL is hardcoded garbage, not actually searching by member name. Needs a real implementation.
3. EDGAR name matching problem (expert advice) — Congress members file under legal name variations that don’t match Congress.gov or FEC profiles. Need to build a name normalization layer first, then use that to query EDGAR full-text search. Skipping this means silently wrong results for many members.
4. Net worth and salary charts are fake — estimated from score, not real data. No real net worth source yet.
Pipelines still to build:
∙ SEC EDGAR Form 4 — proper implementation with name normalization layer first, then full-text search query. Check efts.house.gov before building from scratch.
∙ House/Senate eFD financial disclosures — Senate and House file in different formats. House disclosures are PDFs, meaning scraping + PDF parsing are two separate problems. Open source projects exist that have already done some of this work.
∙ LegiScan — bill text for NLP similarity engine
∙ GovTrack — full voting records (stats are pulling, full vote-by-vote record is not)
∙ OpenSecrets — career finance totals
NLP Bill Similarity Engine (most important feature, not started):
Expert advice received: do NOT use keyword/word matching. Use sentence-transformers to convert each bill into a mathematical embedding (vector representation of meaning), then do cosine similarity clustering to compare them. This catches coordinated ghost-writing from lobbying orgs even when language has been slightly modified between versions given to different politicians. TF-IDF was the original plan, sentence-transformers is the better approach.
Goal: flag bills where similarity exceeds 80% threshold as potentially coordinated authorship.
Frontend tabs per member profile:
Overview, Votes, Finance, Stocks, Travel, Patterns, Donors, Compare
All tabs are built in the frontend. Data fields for Finance, Stocks, and Donors are wired up and will populate automatically once the backend pipelines are complete.
Roadmap after pipelines:
Full native app development (iOS/Android), also free.