PCCA — Personal Content Curation Agent

Core problem:

People cannot read all the content from everyone they follow.
People usually care about specific interests; most followed content is irrelevant to those interests.
With follows spread across many platforms, people miss the few updates that are actually relevant.

PCCA solves this by:

Finding truly relevant content for the user's specific subjects across X, YouTube, LinkedIn, Reddit, Apple Podcasts, Spotify, Substack, Medium, and other sources.
Collecting posts, videos, transcripts, podcasts, and metadata into a local SQLite database.
Delivering key ideas as compact Briefs to the user's Telegram.

scenarios.md is the product source of truth. This README is how to run the current implementation today.

Quick Start (≈ 5 minutes)

You need: macOS or Windows, Python 3.10+, a Telegram bot token from @BotFather, and Google Chrome (or any Chromium-family browser already logged in to the platforms you want to follow).

# 1. clone, isolate, install
git clone <this repo> pcca && cd pcca
python3 -m venv .venv && source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -e ".[dev]"
playwright install chromium

# 2. seed config
cp .env.example .env
# open .env, paste your Telegram bot token after PCCA_TELEGRAM_BOT_TOKEN=

# 3. launch the wizard
pcca run-desktop

The wizard handles everything else through four tabs:

Config — paste the Telegram bot token, set timezone and Brief time. Leaving the token field blank later preserves the saved token.
Use — start the local agent, send /start to your Telegram bot, then describe your first subject in free-form English. Thin one-liners become drafts; the wizard asks for more detail before saving.
Sources — choose a platform and click Get Sources. PCCA imports follows/subscriptions from your already logged-in normal browser session and asks for inline session repair only if needed.
Sources — prune the list if needed, click Monitor Pending Sources, then click Get Content to collect fresh items.
Use — click Get Brief next to a subject, or send /briefs in Telegram. Get Brief automatically rebuilds when new content or changed preferences require it.

That's it. Briefs arrive as separate Telegram messages with 👍 / 👎 / 🔖 / 🚫 / 📖 More buttons on each.

What you do day to day

In Telegram, with the bot:

You want to…	Do
Refresh all sources and all subjects	Tap Update Briefs or send `/update_briefs`
Get already-scored Briefs for one subject	Tap Get Briefs or send `/briefs`
React to a Brief	Tap 👍 / 👎 / 🔖 / 🚫 on the Brief message
Expand a Brief	Tap 📖 More
Give specific feedback	Reply to the Brief with text — "less hype like this", "no cursor content", etc.
Create another subject	Describe it in free form: "I want a separate stream for Ukrainian Sole Proprietor regulations."
Pause/rename/tune a subject	Tap Edit Subjects
Refine a subject	"Refine Vibe Coding: include release notes; exclude motivation"
List sources	"List sources for Vibe Coding"
See setup checklist	`/setup`

If a session expires (you logged out somewhere), PCCA marks the source as needs_reauth and the wizard surfaces it. As long as you stay logged in to the platform in your normal browser, PCCA auto-refreshes its cookies before each scrape. Use Get Sources again to trigger inline session repair when needed.

When something goes wrong

Bot stopped responding. Check .env for PCCA_TELEGRAM_BOT_TOKEN. If it's empty, paste the token back and restart the agent. Logs at .pcca/logs/pcca.log will say Telegram service will be disabled if the token is missing.
No items collected. Use the wizard's Sources → Get Content action, then check the Debug → Logs tab. Sources flagged needs_reauth need session repair from the Sources tab.
Briefs feel stale after preference change. Use /briefs; it now rebuilds automatically when preferences changed since the last delivered Brief.
A feature says a package is missing or silently falls back. Run pcca doctor; if anything is missing, run pip install -e ".[dev]" from the repo root and try again.

For deeper debugging, run pcca debug-bundle — it writes a redacted zip with logs and DB summaries (no raw cookies).

Reference

CLI commands

The wizard wraps these; you only need them for headless / debug use.

pcca run-desktop              # PyWebView wizard (default entry point)
pcca run-agent                # long-lived agent (Telegram bot + non-nightly scheduler)
pcca nightly-once             # scheduled one-shot collection (launchd entry point)
pcca run-nightly-once         # manual/debug one-shot collection
pcca install-launchd          # macOS: schedule nightly-once with wake support
pcca uninstall-launchd        # macOS: remove nightly launchd schedule
pcca run-briefs-once          # one-shot Brief delivery
pcca rebuild-briefs-once      # force-recompute today's Briefs
pcca capture-session --platform x [--browser auto|chrome|arc|brave|edge]
pcca import-follows --subject "Subject Name" --platform x [--limit 150]
pcca youtube-rebackfill-transcripts --clean-livechat-junk
pcca youtube-rebackfill-published-at # fill missing YouTube dates from RSS
pcca audit-content-quality [--clean] # find/flag JS dumps, link lists, marketing spam
pcca audit-sources --platform linkedin # source crawl health and empty-result streaks
pcca doctor                   # verify installed runtime dependencies
pcca debug-bundle             # redacted local support bundle

pcca capture-session and pcca import-follows accept any of these platforms: x, linkedin, youtube, spotify, substack, medium, apple_podcasts.

Configuration (`.env`)

# Telegram bot — required
PCCA_TELEGRAM_BOT_TOKEN=          # from @BotFather

# Scheduling
PCCA_TIMEZONE=UTC
PCCA_NIGHTLY_CRON=0 1 * * *       # nightly content collection
PCCA_MORNING_CRON=30 8 * * *      # only used when DIGEST_AUTO_SEND=true
PCCA_IN_PROCESS_NIGHTLY=false     # default off; use launchd on macOS instead
PCCA_DIGEST_AUTO_SEND=false       # default off — Briefs are on-demand via /briefs
PCCA_MIN_BRIEF_RELEVANCE=0.55     # send no-Briefs notice below this top score

# Browser
PCCA_BROWSER_CHANNEL=chrome       # or 'bundled' for Playwright Chromium
PCCA_BROWSER_HEADFUL_PLATFORMS=x,linkedin

# Session refresh (auto re-read cookies before scrape)
PCCA_SESSION_REFRESH_ENABLED=true
PCCA_SESSION_REFRESH_COOLDOWN_SECONDS=1800
PCCA_SESSION_REFRESH_BROWSER=     # chrome|arc|brave|edge; empty = auto

# Pass-2 summaries / Brief quality
PCCA_GEMINI_API_KEY=              # from https://aistudio.google.com/apikey
PCCA_LLM_PROVIDER=                # empty = gemini if key exists, else ollama
PCCA_LLM_MODEL=                   # empty = gemini-2.5-flash or llama3.1:8b

# Ollama local fallback
PCCA_OLLAMA_ENABLED=false
PCCA_OLLAMA_MODEL=llama3.1:8b
PCCA_OLLAMA_BASE_URL=http://localhost:11434

# Logging
PCCA_LOG_LEVEL=INFO               # DEBUG for verbose
PCCA_LOG_FILE=                    # default .pcca/logs/pcca.log; "off" to disable
PCCA_STRICT_DEPS=false            # true = fail startup if pyproject deps are missing

Scheduling on macOS

The desktop app uses an in-process scheduler for lightweight/non-nightly jobs, but Python cannot wake a sleeping laptop. For reliable overnight collection on macOS, install the launchd schedule:

pcca install-launchd
launchctl list | grep com.pcca.nightly

This writes ~/Library/LaunchAgents/com.pcca.nightly.plist and schedules pcca nightly-once using your PCCA_NIGHTLY_CRON hour/minute. The plist sets Wake=true, so macOS may wake the machine for the job. In practice, keep the laptop on AC power; closed-lid or battery-only standby can still skip wake events depending on macOS power settings.

Dedicated launchd run logs are written under .pcca/logs/nightly-YYYY-MM-DD.log. For best Brief quality, set PCCA_GEMINI_API_KEY from Google AI Studio. When that key is present and PCCA_LLM_PROVIDER is empty, PCCA uses gemini-2.5-flash for Pass-2 summaries and falls back to Ollama if Gemini is unavailable. To remove the schedule:

pcca uninstall-launchd

Where things live

.pcca/pcca.db                       SQLite database (subjects, items, scores, …)
.pcca/logs/pcca.log                 rotating app log
.pcca/browser_profiles/<platform>/  Playwright session profile per platform
.pcca/debug/browser/                screenshots + JSON breadcrumbs from failed scrapes
.pcca/debug/pcca-debug-*.zip        redacted support bundles from `pcca debug-bundle`
.env                                runtime configuration (NOT committed)

Session capture details

PCCA does not drive logins for X / LinkedIn / Google / etc. Instead it reads your real browser's session cookies and injects them into its own Playwright profile. Cookie lifetimes vary:

Platform	Lifetime
X (`auth_token`)	~30 days, sliding while you keep using X
LinkedIn (`li_at`)	~1 year
Spotify / Substack / Medium	long-lived (months+)
YouTube / Google (SID family)	rotates aggressively; auto-refresh handles it
Apple Podcasts	best-effort (varies by region/account state)

Supported on macOS today: Chrome, Arc, Brave, Microsoft Edge. Safari and Firefox tracked in tasks.md (T-38). Windows Chromium tracked in T-37D.

Failed browser scrapes save a screenshot + JSON metadata under .pcca/debug/browser/. Treat these as private debug artifacts — they may contain logged-in page content. pcca debug-bundle redacts them on export.

Multilingual scoring

Heuristic scoring is Cyrillic-aware (English / Ukrainian / Russian). Gemini Pass-2 summaries also handle Ukrainian/Russian well in current testing. If you want a fully local fallback, set PCCA_OLLAMA_ENABLED=true and pull a multilingual model:

ollama pull llama3.1:8b

Linux

The PyWebView desktop wizard is intentionally not yet shipped on Linux (tracked in T-35). On Linux, drive PCCA through the CLI commands above — they work identically.

Status / what's not yet done

Phase-1 foundation is in place: collectors for nine platforms, session capture

auto-refresh, conversational subject creation, per-Brief Telegram delivery, PyWebView wizard. Known gaps and follow-up work live in tasks.md. Notable: a pluggable Learning Strategy that reads button reactions and refinement replies (T-17), full rich-rule preference extraction including author-level conditionals (T-59), and Telegram as a source platform (T-60–T-65) are open.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
scripts		scripts
src/pcca		src/pcca
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
architecture.md		architecture.md
pyproject.toml		pyproject.toml
scenarios.md		scenarios.md
setup.py		setup.py
tasks.md		tasks.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCCA — Personal Content Curation Agent

Quick Start (≈ 5 minutes)

What you do day to day

When something goes wrong

Reference

CLI commands

Configuration (`.env`)

Scheduling on macOS

Where things live

Session capture details

Multilingual scoring

Linux

Status / what's not yet done

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PCCA — Personal Content Curation Agent

Quick Start (≈ 5 minutes)

What you do day to day

When something goes wrong

Reference

CLI commands

Configuration (.env)

Scheduling on macOS

Where things live

Session capture details

Multilingual scoring

Linux

Status / what's not yet done

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`.env`)

Packages