Skip to content

constantgillet/pugcrawl

Repository files navigation

Pugcrawl logo

🐛🦿 Pugcrawl 🦿🐛

Pugcrawl is a lightweight and easy-to-use crawler that extracts data from any website in seconds using AI, with typed results.

⚠️🔨 This repository is in an early stage. Do not use it in production. It may consume a high number of LLM tokens. 🔨⚠️

Installation

Run the following command:

npm i pugcrawl@pugcrawl

What's inside?

This monorepo includes the following packages/apps:

Apps and Packages

  • demo: A small demo app
  • @repo/pugcrawl: the crawler client
  • @repo/base-engine: the base client for typing
  • @repo/cheerio-engine: the cheerio client to make http requests
  • @repo/puppeteer-engine: the puppeteer client to make http requests
  • @repo/eslint-config: eslint configurations (includes eslint-config-next and eslint-config-prettier)
  • @repo/typescript-config: tsconfig.jsons used throughout the monorepo

Each package/app is 100% TypeScript.

Utilities

This Turborepo has some additional tools already setup for you:

Use

To use the client

Install a langchain chat model and zod:

npm i @langchain/mistralai zod

Example:

import { PugCrawl } from "pugcrawl";
import { z } from "zod";
import { ChatMistralAI } from "@langchain/mistralai";

const app = new PugCrawl({
    model: new ChatMistralAI({
        model: "mistral-medium-latest",
        temperature: 0,
    }),
});

// Define schema to extract contents into
const schema = z.array(
    z.object({
        name: z.string().describe("The name of the product"),
        price: z
            .string()
            .describe("Price of the product"),
    }),
);

const scrapeResult = await app.extract(
    ["https://sandbox.oxylabs.io/products/*"],
    {
        prompt:
            "Your role is to extract the products name, price on this ecommerce website",
        schema: schema,
    },
);

if (!scrapeResult.success) {
    throw new Error(`Failed to scrape: ${scrapeResult}`);
}

console.log("scrapeResult: ", scrapeResult.data);

About

Pugcrawl is a lightweight and easy-to-use crawler that extracts data from any website in seconds using AI, with typed results.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors