Pugcrawl is a lightweight and easy-to-use crawler that extracts data from any website in seconds using AI, with typed results.
Run the following command:
npm i pugcrawl@pugcrawlThis monorepo includes the following packages/apps:
demo: A small demo app@repo/pugcrawl: the crawler client@repo/base-engine: the base client for typing@repo/cheerio-engine: the cheerio client to make http requests@repo/puppeteer-engine: the puppeteer client to make http requests@repo/eslint-config:eslintconfigurations (includeseslint-config-nextandeslint-config-prettier)@repo/typescript-config:tsconfig.jsons used throughout the monorepo
Each package/app is 100% TypeScript.
This Turborepo has some additional tools already setup for you:
- TypeScript for static type checking
- ESLint for code linting
- Biome for code formatting and linting
To use the client
Install a langchain chat model and zod:
npm i @langchain/mistralai zodExample:
import { PugCrawl } from "pugcrawl";
import { z } from "zod";
import { ChatMistralAI } from "@langchain/mistralai";
const app = new PugCrawl({
model: new ChatMistralAI({
model: "mistral-medium-latest",
temperature: 0,
}),
});
// Define schema to extract contents into
const schema = z.array(
z.object({
name: z.string().describe("The name of the product"),
price: z
.string()
.describe("Price of the product"),
}),
);
const scrapeResult = await app.extract(
["https://sandbox.oxylabs.io/products/*"],
{
prompt:
"Your role is to extract the products name, price on this ecommerce website",
schema: schema,
},
);
if (!scrapeResult.success) {
throw new Error(`Failed to scrape: ${scrapeResult}`);
}
console.log("scrapeResult: ", scrapeResult.data);