Skip to content
Merged
7 changes: 6 additions & 1 deletion apps/scraper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ Then fill in the values. The ones you need for scraping:
| Variable | Required | Where to get it |
|---|---|---|
| `POSTGRES_URL` | ✅ | Your Supabase project settings |
| `GOOGLE_VERTEX_PROJECT` | ✅ | Your Google Cloud project id for Vertex AI |
| `GOOGLE_VERTEX_LOCATION` | ✅ | Your Vertex AI region, e.g. `us-central1` |
| `GOOGLE_VERTEX_API_KEY` | Optional | Vertex AI Express Mode API key |
| `OPENAI_API_KEY` | ✅ | [platform.openai.com](https://platform.openai.com) |
| `CONGRESS_API_KEY` | ✅ | Free at [api.congress.gov/sign-up](https://api.congress.gov/sign-up/) |
| `COURTLISTENER_API_KEY` | Optional | Free at [courtlistener.com](https://www.courtlistener.com/sign-in/) — only needed for higher rate limits |
Expand Down Expand Up @@ -82,4 +85,6 @@ All scrapers call into `src/utils/db/operations.ts`. Each time a bill or case is

- If it's **new** → saves it and generates an AI article + thumbnail
- If the **content changed** → regenerates the article
- If **nothing changed** → skips AI generation entirely (saves API costs)
- If **nothing changed** → backfills any missing AI summary/article/thumbnail fields, otherwise skips AI generation

Set `SCRAPER_FORCE_AI_REGEN=1` to force a full AI refresh even when the record already has AI content.
2 changes: 1 addition & 1 deletion apps/scraper/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
"dependencies": {
"@acme/db": "workspace:*",
"@ai-sdk/google": "^3.0.53",
"@ai-sdk/google-vertex": "^4.0.105",
"ai": "^6.0.141",
"cheerio": "^1.2.0",
"consola": "^3.4.2",
"dotenv": "^17.3.1",
"openai": "^6.33.0",
"p-limit": "^7.3.0",
"sharp": "^0.34.5",
"turndown": "^7.2.2",
Expand Down
3 changes: 3 additions & 0 deletions apps/scraper/run.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ printHeader("Environment");
printKeyValue("POSTGRES_URL", check(process.env.POSTGRES_URL));
printKeyValue("PEXELS_API_KEY", check(process.env.PEXELS_API_KEY));
printKeyValue("OPENAI_API_KEY", check(process.env.OPENAI_API_KEY));
printKeyValue("GOOGLE_VERTEX_PROJECT", check(process.env.GOOGLE_VERTEX_PROJECT));
printKeyValue("GOOGLE_VERTEX_LOCATION", check(process.env.GOOGLE_VERTEX_LOCATION));
printKeyValue("GOOGLE_VERTEX_API_KEY", check(process.env.GOOGLE_VERTEX_API_KEY));
printFooter();

// Now import and run main
Expand Down
2 changes: 1 addition & 1 deletion apps/scraper/src/scrapers/federalregister.ts
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ async function scrape() {
title: doc.title,
type: contentType,
publishedDate,
description: doc.abstract ?? undefined,
description: fullText ? undefined : (doc.abstract ?? undefined),
fullText,
url: doc.html_url,
source: "federalregister.gov",
Expand Down
63 changes: 27 additions & 36 deletions apps/scraper/src/utils/ai/image-generation.ts
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
/**
* AI image generation using OpenAI DALL-E
* AI image generation using Google Vertex AI Imagen 3
* Generates images from text prompts and converts them to JPEG format
*/

import OpenAI from 'openai';
import { generateImage as aiGenerateImage } from 'ai';
import { vertexProvider } from './provider.js';
import { createLogger } from '../log.js';
import { trackDalle3Image } from '../costs.js';
import { trackImagenImage } from '../costs.js';
import { AIRateLimitError, setRateLimitHit } from './text-generation.js';

const logger = createLogger("image");

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export interface GeneratedImage {
data: Buffer;
mimeType: string;
Expand All @@ -27,7 +26,7 @@ async function sleep(ms: number): Promise<void> {
}

/**
* Generate an image using DALL-E 3 with retry logic for rate limits
* Generate an image using Vertex AI Imagen 3 with retry logic for rate limits
* @param prompt - Text description of desired image
* @param maxRetries - Maximum number of retry attempts (default: 3)
* @returns Generated image as Buffer with metadata, or null if generation fails
Expand All @@ -43,56 +42,48 @@ export async function generateImage(
if (attempt > 0) {
logger.warn(`Retry attempt ${attempt}/${maxRetries} for image generation`);
} else {
logger.start(`Generating image with DALL-E 3: ${prompt.substring(0, 50)}...`);
logger.start(`Generating image with Imagen 3: ${prompt.substring(0, 50)}...`);
}

// DALL-E 3 for quality
const response = await openai.images.generate({
model: 'dall-e-3',
prompt: `Professional news photography: ${prompt}. Photorealistic, high quality, journalistic style.`,
size: '1024x1024',
quality: 'standard',
response_format: 'url',
const result = await aiGenerateImage({
model: vertexProvider.image('imagen-3.0-generate-001'),
prompt: `Premium editorial photography: ${prompt}. Cinematic lighting, vibrant color palette, masterpiece composition, 8k resolution, highly detailed, expressive and dynamic.`,
aspectRatio: '1:1',
providerOptions: {
vertex: { sampleCount: 1 },
},
});

if (!response.data?.[0]?.url) {
logger.error('No image URL returned from DALL-E');
return null;
}

const imageUrl = response.data[0].url;

// Download image to buffer (URLs expire after 1 hour, need to store permanently)
const imageResponse = await fetch(imageUrl);
if (!imageResponse.ok) {
logger.error(`Failed to download image: ${imageResponse.status}`);
return null;
}

const buffer = Buffer.from(await imageResponse.arrayBuffer());
// Imagen returns base64-encoded bytes directly — no URL download needed
const buffer = Buffer.from(result.image.base64, 'base64');

trackDalle3Image();
trackImagenImage();
logger.success(`Image generated: ${buffer.length} bytes`);

return {
data: buffer,
mimeType: 'image/png', // DALL-E returns PNG
mimeType: (result.image as any).mimeType ?? 'image/png',
width: 1024,
height: 1024,
};
} catch (error) {
lastError = error instanceof Error ? error : new Error(String(error));

// Check if error is due to content policy violation (don't retry)
if (lastError.message.includes('content_policy_violation')) {
logger.warn(`Image generation blocked by content filter for prompt: ${prompt.substring(0, 100)}...`);
// Imagen safety filter block (don't retry)
if (
lastError.message.includes('SAFETY') ||
lastError.message.includes('blocked') ||
lastError.message.includes('content_filter')
) {
logger.warn(`Image generation blocked by safety filter for prompt: ${prompt.substring(0, 100)}...`);
return null;
}

// Check for rate limit errors (429 or rate_limit_exceeded)
// Check for rate limit errors (429 or RESOURCE_EXHAUSTED)
const isRateLimitError =
lastError.message.includes('rate_limit_exceeded') ||
lastError.message.includes('RESOURCE_EXHAUSTED') ||
lastError.message.includes('429') ||
lastError.message.includes('rate_limit_exceeded') ||
lastError.message.includes('Rate limit');

if (isRateLimitError && attempt < maxRetries) {
Expand Down
30 changes: 16 additions & 14 deletions apps/scraper/src/utils/ai/image-keywords.ts
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
/**
* AI-powered image keyword generation
* Uses OpenAI to extract visual concepts for image search
* Uses Google Vertex AI to extract visual concepts for image search
*/

import { google } from '@ai-sdk/google';
import { generateText, APICallError, RetryError } from 'ai';

import { AIRateLimitError, rateLimitHit, setRateLimitHit } from './text-generation.js';
import { createLogger } from '../log.js';
import { trackGeminiUsage } from '../costs.js';
import { vertexProvider } from './provider.js';

const logger = createLogger("ai");

Expand Down Expand Up @@ -44,20 +44,22 @@ export async function generateImageSearchKeywords(
}
try {
const { text, usage } = await generateText({
model: google('gemini-2.5-flash'),
prompt: `Given this ${type} title and content, generate 2-4 search keywords for finding relevant stock photos. Focus on concrete, visual, photographic concepts that would actually appear in news photography or documentary images.
model: vertexProvider('gemini-2.5-flash'),
prompt: `Given this ${type} title and content, generate 2-4 search keywords for finding visually striking, high-end editorial stock photos. Focus on dramatic, cinematic, and photographic concepts that feel professional and modern.

GOOD examples (specific, visual, photographic):
- capitol building washington dc
- hospital doctor medical equipment
- construction workers infrastructure
- classroom students education
- solar panels renewable energy
GOOD examples (specific, dynamic, visual):
- dramatic capitol building sunset
- surgical team intense motion
- worker silhouette infrastructure
- vibrant classroom activity
- cinematic solar farm aerial

BAD examples (too abstract, no clear visual):
- government policy legislation
- economic impact financial
- social justice equality
BAD examples (generic, static):
- capitol building
- doctor
- construction site
- students
- solar panels

Title: ${title}

Expand Down
14 changes: 7 additions & 7 deletions apps/scraper/src/utils/ai/marketing-generation.ts
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
/**
* AI marketing content generation using OpenAI
* AI marketing content generation using Google Vertex AI
* Generates compelling social media titles, descriptions, and image prompts
*/

import { google } from "@ai-sdk/google";
import { generateObject, APICallError, RetryError } from "ai";
import { z } from "zod";
import { createLogger } from "../log.js";
import { trackGeminiUsage } from "../costs.js";
import { AIRateLimitError, rateLimitHit, setRateLimitHit } from "./text-generation.js";
import { vertexProvider } from "./provider.js";

function isRateLimitError(error: unknown): boolean {
if (error instanceof APICallError) return error.statusCode === 429;
Expand All @@ -21,7 +21,7 @@ function isRateLimitError(error: unknown): boolean {
const logger = createLogger("ai");

const MarketingCopySchema = z.object({
title: z.string().max(100),
title: z.string().max(25), // Must match Video.title varchar(25) DB constraint
description: z.string(),
imagePrompt: z.string(),
});
Expand All @@ -47,16 +47,16 @@ export async function generateMarketingCopy(
logger.start(`Generating marketing copy for: ${articleTitle}`);

const { object, usage } = await generateObject({
model: google("gemini-2.5-flash"),
model: vertexProvider("gemini-2.5-flash"),
schema: MarketingCopySchema,
prompt: `You are a professional marketing copywriter creating engaging social media content.

Create compelling marketing copy for this ${contentType} to be displayed in a social media feed.

Requirements:
1. "title": Compelling, attention-grabbing title (MUST be 25 characters or less)
2. "description": Engaging 50-word description that makes people want to learn more. Write in an accessible, conversational tone.
3. "imagePrompt": Detailed prompt for AI image generation (describe a visually striking, photorealistic image that captures the essence of this content)
2. "description": A very short (max 25 words) summary for a mobile feed. Write in simple, plain English (8th-grade level). Focus on the "so what?"—why should a regular person care? No jargon.
3. "imagePrompt": A creative, high-energy, and visually arresting scene description that captures the *essence* of the story. Instead of literal office buildings or meetings, focus on dramatic metaphors, intense human emotion, or dynamic action. Use vivid color descriptions and interesting perspectives (e.g., extreme close-ups, wide cinematic shots, or dramatic low angles). Avoid text, icons, or stereotypical stock photo tropes.

Article Title: ${articleTitle}
Content Preview: ${articleContent.substring(0, 1000)}`,
Expand All @@ -75,7 +75,7 @@ Content Preview: ${articleContent.substring(0, 1000)}`,
return {
title: articleTitle.substring(0, 25),
description: articleContent.substring(0, 200) + "...",
imagePrompt: `professional news photography about ${articleTitle}`,
imagePrompt: `A dynamic, cinematic editorial photo about ${articleTitle}. Dramatic lighting, vivid colors.`,
};
}
}
12 changes: 12 additions & 0 deletions apps/scraper/src/utils/ai/provider.ts
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf? why a whole separate file jsut for this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three different files use it for import managing (text-generation.ts, image-keywords.ts, marketing-generation.ts)
and for separation of concerns

Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import { createVertex } from "@ai-sdk/google-vertex";

const project = process.env.GOOGLE_VERTEX_PROJECT;
const location = process.env.GOOGLE_VERTEX_LOCATION;
const apiKey = process.env.GOOGLE_VERTEX_API_KEY;

export const vertexProvider = createVertex({
...(project ? { project } : {}),
...(location ? { location } : {}),
...(apiKey ? { apiKey } : {}),
});

16 changes: 10 additions & 6 deletions apps/scraper/src/utils/ai/text-generation.ts
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
/**
* AI text generation utilities using OpenAI
* AI text generation utilities using Google Vertex AI
* Generates summaries and full articles from government content
*/

import { google } from '@ai-sdk/google';
import { generateText, APICallError, RetryError } from 'ai';
import { createLogger } from '../log.js';
import { trackGeminiUsage } from '../costs.js';
import { vertexProvider } from './provider.js';

const logger = createLogger("ai");

Expand Down Expand Up @@ -53,8 +53,12 @@ export async function generateAISummary(
}
try {
const { text, usage } = await generateText({
model: google('gemini-2.5-flash'),
prompt: `Generate a concise, engaging summary (max 100 characters) for this government content. Focus on the key action or impact.
model: vertexProvider('gemini-2.5-flash'),
prompt: `You are an expert at simplifying complex government and legal jargon for a general audience.
Generate a very short, punchy summary (max 100 characters) for this content.

Goal: Tell a regular person "what happened" or "what changed" in one quick sentence.
Style: Use active voice, plain English (8th-grade level), and NO jargon. Focus on the direct impact.

Title: ${title}

Expand Down Expand Up @@ -96,13 +100,13 @@ export async function generateAIArticle(
logger.start(`Generating AI article for: ${title}`);

const { text, usage } = await generateText({
model: google('gemini-2.5-flash'),
model: vertexProvider('gemini-2.5-flash'),
prompt: `You are an expert at making government and legal content accessible and engaging for everyday people. Transform the following ${type} into a well-structured, markdown-formatted article.

**Structure your article with these 4 sections:**

## What This Means For You
Write 2-3 concise sentences (max 150 words) that immediately tell everyday people what this means for their lives. Use plain language, avoid jargon, and focus on direct impact. Make it relatable and concrete.
Write 1-2 very short, punchy sentences (max 50 words) that immediately tell a regular person how this affects their life. Use 5th-8th grade reading level. Completely avoid legal or technical terms. Focus on the "so what?"—the direct, practical result for everyday people. Make it feel human and relevant.

## Overview
Provide a balanced, neutral, and informative explanation of what this ${type} is about. Use engaging storytelling elements while remaining objective. Break down complex concepts, define technical terms, and provide context. Make it interesting to read while being thorough. Aim for 200-400 words.
Expand Down
Loading
Loading