Skip to content

Total project overview! Status report alpha1 #5

@OpenSourcePatents

Description

@OpenSourcePatents

Here’s the full summary for a new chat:

Project: CongressWatch
Live site: congresswatch.vercel.app
Repo: github.com/OpenSourcePatents/Congresswatch
Stack: Plain HTML/JS frontend, Python data pipelines, GitHub Actions, Vercel hosting, JSON flat file database (upgrading to Supabase eventually)

What it does:
Pulls public government data on all 538 Congress members and generates an anomaly score (0-100) based on campaign finance, stock trades, voting record, missed votes, PAC money ratio, and bill authorship patterns. Every number links to its original government source.

Repo structure:
∙ .github/workflows/fetch-data.yml — daily workflow, runs Python inline (NOT fetch.py), pulls all members from Congress.gov API, saves to data/members.json
∙ .github/workflows/fetch-finance.yml — daily workflow, runs fetch_finance.py from repo root
∙ fetch_finance.py — enriches members.json with FEC, GovTrack, EDGAR, computes anomaly scores
∙ fetch.py — exists in repo root but is NOT called by any workflow, ignore it
∙ data/members.json — main data file, single source of truth for frontend
∙ data/stats.json — summary stats
∙ index.html — entire frontend, single file

GitHub Secrets set:
∙ CONGRESS_API_KEY — Congress.gov API
∙ FEC_API_KEY — FEC open API

members.json schema:

{
"id": "bioguideId",
"name": "First Last",
"party": "Democratic|Republican|Independent|Unknown",
"state": "StateName",
"district": "number string or empty",
"chamber": "Senate|House",
"photo_url": "https://bioguide.congress.gov/bioguide/photo/X/XXXXXXX.jpg",
"term_start": "YYYY-01-01",
"score": 0,
"flags": [],
"updated": "ISO timestamp",
"total_raised": 0,
"total_raised_display": "$0M",
"pac_contributions": 0,
"individual_contributions": 0,
"cash_on_hand": 0,
"top_donors_list": [],
"top_donors": "",
"edgar_trade_count": 0,
"total_trades": 0,
"missed_votes_pct": 0,
"votes_with_party_pct": 0,
"govtrack_id": "",
"ideology_score": null,
"leadership_score": null,
"fec_candidate_id": "",
"fec_committee_id": ""
}

Current chamber counts: 106 Senate, 432 House (slight overcount on Senate, acceptable)

What’s working:
∙ Congress.gov member fetch, chamber inference fixed (district = House, no district + not territory = Senate, terms loop removed)
∙ GovTrack voting stats pulling correctly (no key needed)
∙ FEC pipeline running but with a known bug
∙ Frontend fully built, reads all fields from members.json correctly
∙ Anomaly scoring engine running

Known bugs to fix:
1. FEC not populating for many members — is_active_candidate: True in fetch_fec_candidate() filters out members who haven’t run recently. Fix: remove that param.
2. EDGAR trade search broken — URL is hardcoded garbage, not actually searching by member name. Needs a real implementation.
3. EDGAR name matching problem (expert advice) — Congress members file under legal name variations that don’t match Congress.gov or FEC profiles. Need to build a name normalization layer first, then use that to query EDGAR full-text search. Skipping this means silently wrong results for many members.
4. Net worth and salary charts are fake — estimated from score, not real data. No real net worth source yet.

Pipelines still to build:
∙ SEC EDGAR Form 4 — proper implementation with name normalization layer first, then full-text search query. Check efts.house.gov before building from scratch.
∙ House/Senate eFD financial disclosures — Senate and House file in different formats. House disclosures are PDFs, meaning scraping + PDF parsing are two separate problems. Open source projects exist that have already done some of this work.
∙ LegiScan — bill text for NLP similarity engine
∙ GovTrack — full voting records (stats are pulling, full vote-by-vote record is not)
∙ OpenSecrets — career finance totals

NLP Bill Similarity Engine (most important feature, not started):
Expert advice received: do NOT use keyword/word matching. Use sentence-transformers to convert each bill into a mathematical embedding (vector representation of meaning), then do cosine similarity clustering to compare them. This catches coordinated ghost-writing from lobbying orgs even when language has been slightly modified between versions given to different politicians. TF-IDF was the original plan, sentence-transformers is the better approach.
Goal: flag bills where similarity exceeds 80% threshold as potentially coordinated authorship.

Frontend tabs per member profile:
Overview, Votes, Finance, Stocks, Travel, Patterns, Donors, Compare
All tabs are built in the frontend. Data fields for Finance, Stocks, and Donors are wired up and will populate automatically once the backend pipelines are complete.

Roadmap after pipelines:
Full native app development (iOS/Android), also free.​​​​​​​​​​​​​​​​

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions