From d4d711adf1d1b7cc9d56e1ce94af60e848a4336a Mon Sep 17 00:00:00 2001
From: bluedotiya
Date: Sun, 22 Feb 2026 19:12:05 +0200
Subject: [PATCH 1/3] feat: improve graph visualization, crawl validation, URL
normalization, and docs
- GraphView: rewrite with responsive SVG, zoom/pan, centered layout, status-colored nodes
- Crawl depth validation (1-5), URL dedup scoped by crawl_id
- Feeder: stale job reclamation for stuck IN-PROGRESS jobs
- NewCrawl: add targeted crawl toggle (domain-scoped crawling)
- URL normalization module with comprehensive tests
- Add project vision doc, update API reference docs
- Add CLAUDE.md project instructions
Co-Authored-By: Claude Opus 4.6
---
.claude/settings.json | 6 +
CLAUDE.md | 85 ++++++++++++++
Cargo.lock | 22 +++-
docs/api-reference.md | 9 +-
docs/project-vision.md | 47 ++++++++
feeder/src/job.rs | 34 +++++-
feeder/src/main.rs | 2 +
frontend/package-lock.json | 23 ++++
frontend/package.json | 2 +
frontend/src/components/GraphView.tsx | 154 ++++++++++++++++++++++++--
frontend/src/lib/api.ts | 5 +-
frontend/src/pages/CrawlDetail.tsx | 14 ++-
frontend/src/pages/CrawlList.tsx | 5 +
frontend/src/pages/NewCrawl.tsx | 31 +++++-
frontend/src/types/api.ts | 2 +
manager/src/models/crawl.rs | 4 +
manager/src/routes/crawl.rs | 35 +++++-
manager/src/services/crawl_service.rs | 23 +++-
shared/Cargo.toml | 1 +
shared/src/url_normalize.rs | 99 +++++++++++++++++
20 files changed, 573 insertions(+), 30 deletions(-)
create mode 100644 .claude/settings.json
create mode 100644 CLAUDE.md
create mode 100644 docs/project-vision.md
diff --git a/.claude/settings.json b/.claude/settings.json
new file mode 100644
index 0000000..6119f7c
--- /dev/null
+++ b/.claude/settings.json
@@ -0,0 +1,6 @@
+{
+ "enabledPlugins": {
+ "playwright-skill@playwright-skill": true,
+ "skill-creator@claude-plugins-official": true
+ }
+}
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..c8483a2
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,85 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Build & Test Commands
+
+### Rust
+```bash
+cargo check --workspace # Fast compilation check
+cargo build --release # Release build (LTO + stripped)
+cargo test --workspace # Run all tests
+cargo test -p shared # Test single crate
+cargo clippy --workspace -- -D warnings # Lint (CI-strict)
+```
+
+### Frontend (from `frontend/`)
+```bash
+npm install # Install deps
+npm run dev # Dev server on :3000 (proxies /api to :8080)
+npm run build # Type-check + production build
+npm run lint # ESLint
+npm run type-check # TypeScript check only
+```
+
+### Docker (from repo root, use minikube docker-env for local k8s)
+```bash
+docker build -t ghcr.io/bluedotiya/web-crawler/manager:latest -f manager/Dockerfile .
+docker build -t ghcr.io/bluedotiya/web-crawler/feeder:latest -f feeder/Dockerfile .
+docker build -t ghcr.io/bluedotiya/web-crawler/frontend:latest -f frontend/Dockerfile .
+```
+
+## Architecture
+
+Three services communicate through a shared Neo4j database (no direct inter-service HTTP):
+
+- **manager** — Axum HTTP server (port 8080). REST API at `/api/v1/crawls/*` + WebSocket for live progress. Creates ROOT nodes and initial URL children when a crawl is submitted.
+- **feeder** — Background workers (8 replicas). Poll Neo4j for PENDING URLs, fetch HTML, extract links, create child nodes. Atomic job claiming prevents worker conflicts.
+- **frontend** — React SPA (Vite/TypeScript/Tailwind). Served by nginx in production, proxied via Vite in dev. Uses React Query for polling and WebSocket for real-time updates.
+- **shared** — Rust library crate used by both manager and feeder. Contains: crawler (HTTP fetch + URL extraction), dns (resolution with iterative domain shortening), neo4j_client, url_normalize, schema (indexes/constraints), error types.
+
+### Data Flow
+1. User submits URL + depth (1-5) via frontend → POST `/api/v1/crawls`
+2. Manager normalizes URL, resolves DNS, creates ROOT + child URL nodes in Neo4j
+3. Feeder workers atomically claim PENDING URLs, fetch HTML, extract/deduplicate links, create children
+4. Frontend polls progress via REST (5s) or WebSocket (2s), displays force-graph visualization
+
+### Neo4j Data Model
+- **ROOT** node (one per crawl, unique on `crawl_id`) — the seed URL
+- **URL** nodes — discovered links with `job_status` (PENDING/IN-PROGRESS/COMPLETED/FAILED/CANCELLED)
+- **Lead** edges — parent → child link relationships
+- All nodes scoped by `crawl_id` for isolation between crawls
+
+## Key Conventions
+
+- **Conventional commits** required on PR titles: `feat:`, `fix:`, `chore:`, etc. (enforced by CI). Breaking changes use `!` suffix (e.g., `feat!:`). Drives automated semver + per-service tagging.
+- **Pre-commit hooks**: `cargo check`, `cargo clippy -D warnings`, `cargo test`, frontend lint+typecheck. Install: `pip install pre-commit && pre-commit install`
+- **Workspace dependency gotcha**: `default-features = false` in `[workspace.dependencies]` is ignored by Cargo. Each member crate must set it explicitly.
+- **TLS in containers**: Use `rustls-tls-webpki-roots` (bundles CAs in binary). Avoid `native-tls` or `native-roots` in slim Docker images.
+- **HTTP clients** in both feeder and manager must set `.user_agent(...)` to avoid 403 responses.
+- **TypeScript**: Strict mode enabled, no unused locals/parameters. Path alias `@/` → `./src/`.
+- **Docker images** must use full GHCR path (`ghcr.io/bluedotiya/web-crawler/{service}:tag`) to match k8s deployment specs.
+
+## API Routes (manager)
+
+| Method | Endpoint | Purpose |
+|--------|----------|---------|
+| POST | `/api/v1/crawls` | Create new crawl |
+| GET | `/api/v1/crawls` | List crawls (filter/pagination) |
+| GET | `/api/v1/crawls/{id}` | Get crawl progress |
+| DELETE | `/api/v1/crawls/{id}` | Cancel crawl |
+| GET | `/api/v1/crawls/{id}/graph` | Graph data (nodes + edges) |
+| GET | `/api/v1/crawls/{id}/stats` | Crawl statistics |
+| GET | `/api/v1/crawls/{id}/ws` | WebSocket for live updates |
+| GET | `/livez`, `/readyz` | Health probes |
+
+## Project Layout
+
+```
+shared/src/ → lib.rs, crawler.rs, dns.rs, neo4j_client.rs, url_normalize.rs, schema.rs, error.rs
+manager/src/ → main.rs, config.rs, routes/{crawl,status,graph,ws}.rs, services/{crawl,graph}_service.rs
+feeder/src/ → main.rs, config.rs, job.rs
+frontend/src/ → App.tsx, pages/{Dashboard,CrawlList,CrawlDetail,NewCrawl}.tsx, components/GraphView.tsx, lib/api.ts, hooks/useWebSocket.ts
+web-crawler/ → Helm parent chart (neo4j, manager, feeder, frontend subcharts)
+docs/ → architecture.md, api-reference.md, neo4j-graph-model.md, deployment.md, development.md
+```
diff --git a/Cargo.lock b/Cargo.lock
index c45af5e..25d8cd8 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -859,7 +859,7 @@ dependencies = [
"libc",
"percent-encoding",
"pin-project-lite",
- "socket2 0.5.10",
+ "socket2 0.6.2",
"tokio",
"tower-service",
"tracing",
@@ -1466,6 +1466,21 @@ dependencies = [
"unicode-ident",
]
+[[package]]
+name = "psl"
+version = "2.1.190"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "66fed3dc7578357ff12137c75eac73413b6aba9a7204916c19f2a0e9e1e920e0"
+dependencies = [
+ "psl-types",
+]
+
+[[package]]
+name = "psl-types"
+version = "2.0.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "33cb294fe86a74cbcf50d4445b37da762029549ebeea341421c7c70370f86cac"
+
[[package]]
name = "quote"
version = "1.0.44"
@@ -1654,7 +1669,7 @@ dependencies = [
"errno",
"libc",
"linux-raw-sys",
- "windows-sys 0.52.0",
+ "windows-sys 0.61.2",
]
[[package]]
@@ -1869,6 +1884,7 @@ dependencies = [
"futures",
"hickory-resolver",
"neo4rs",
+ "psl",
"regex",
"reqwest",
"thiserror 2.0.18",
@@ -2006,7 +2022,7 @@ dependencies = [
"getrandom 0.3.4",
"once_cell",
"rustix",
- "windows-sys 0.52.0",
+ "windows-sys 0.61.2",
]
[[package]]
diff --git a/docs/api-reference.md b/docs/api-reference.md
index 41dbfc5..db82200 100644
--- a/docs/api-reference.md
+++ b/docs/api-reference.md
@@ -20,13 +20,14 @@ Start a new crawl from a given URL.
|-------|------|----------|-------------|
| `url` | string | Yes | The URL to crawl (must be http or https) |
| `depth` | integer | Yes | Maximum link depth to follow (1–5, where 1 = root only) |
+| `targeted` | boolean | No | When `true`, only follow links within the same registered domain (eTLD+1) as the root URL. Defaults to `false`. |
**Example:**
```bash
curl -X POST http://localhost:8080/api/v1/crawls \
-H 'Content-Type: application/json' \
- -d '{"url": "https://example.com", "depth": 2}'
+ -d '{"url": "https://example.com", "depth": 2, "targeted": true}'
```
**Response:** `201 Created`
@@ -84,7 +85,8 @@ curl "http://localhost:8080/api/v1/crawls?status=running&limit=10"
"total": 42,
"completed": 40,
"failed": 2,
- "cancelled": 0
+ "cancelled": 0,
+ "targeted": true
}
],
"total": 1,
@@ -128,7 +130,8 @@ curl http://localhost:8080/api/v1/crawls/d262a3e7-19de-437f-b0a4-cf1d689b1caf
"failed": 60,
"cancelled": 0,
"root_url": "https://example.com",
- "requested_depth": 3
+ "requested_depth": 3,
+ "targeted": false
}
```
diff --git a/docs/project-vision.md b/docs/project-vision.md
new file mode 100644
index 0000000..f00127b
--- /dev/null
+++ b/docs/project-vision.md
@@ -0,0 +1,47 @@
+# Web crawler vision
+Create a free, open-source, deployable platform for Red & Blue teams that want to discover the web attack surface of their applications.
+
+## About
+This file should be used as general guidelines for development. When design decisions are made, this doc should define the "spirit" of those decisions.
+
+## My philosophy
+1. Don't reinvent the wheel - There is code written by smarter people than you. Be humble and use well-established code and tools.
+2. Open Source - This platform should be open and transparent for everyone to contribute, share, and use.
+3. Respect others - Use this platform for the betterment of software and products. Make the world better than you found it.
+4. Have fun - The process of creating things should be fun. There will be chores, but enjoy the process.
+
+
+## Design Principles (Derived from above)
+These principles are a collection of coding and design rules I personally came across and found to work. A lot of this is based on other people's design principles.
+
+---
+
+### Don't reinvent the wheel
+
+#### Adopt mainstream tools
+Use well-established tools from other open-source projects. Only create custom tools when it's absolutely necessary.
+
+#### Keep it simple stupid
+Keep the project as simple as possible. The more moving parts, the less scalable it becomes, and the more things break.
+
+### Open Source
+
+#### All source code is public
+The project vision is to be an open source platform for blue & red teams, anyone can contribute.
+
+#### All source code should be free for individuals
+This platform should always be free for individuals, and for the foreseeable future, for anyone. The code license should reflect that.
+
+### Respect others
+
+#### Respectful crawling
+Rate limiting, robots.txt awareness, and polite user-agent strings by default. The tool should be hard to misuse for DoS or abuse.
+
+### Have fun
+
+#### Visualization graph should be fun to use and explore
+The visuals and tools for exploring the graph should be fun for the user, possibly gamified.
+
+#### Project theme should be fun
+The theme of this project should be cartoony, playful, and fun. The main theme is cobweb (as it's a crawler).
+
diff --git a/feeder/src/job.rs b/feeder/src/job.rs
index c13e16b..b5e1f93 100644
--- a/feeder/src/job.rs
+++ b/feeder/src/job.rs
@@ -16,6 +16,8 @@ pub struct UrlJob {
pub current_depth: i64,
pub attempts: Option,
pub crawl_id: String,
+ pub targeted: bool,
+ pub target_domain: String,
}
/// Represents a child node to be created in Neo4j.
@@ -28,6 +30,8 @@ struct ChildNode {
current_depth: i64,
request_time: String,
crawl_id: String,
+ targeted: bool,
+ target_domain: String,
}
/// Atomically fetches and claims a single URL job from Neo4j.
@@ -64,6 +68,8 @@ pub async fn fetch_job(graph: &Graph, stale_timeout: i64) -> Result("attempts").ok(),
crawl_id: node.get("crawl_id").unwrap_or_default(),
+ targeted: node.get::("targeted").unwrap_or(false),
+ target_domain: node.get::("target_domain").unwrap_or_default(),
}))
}
None => Ok(None),
@@ -181,7 +187,8 @@ async fn batch_create_children(
ON CREATE SET c.ip = $ip, c.domain = $domain, \
c.job_status = CASE WHEN $cur_depth = $req_depth THEN 'COMPLETED' ELSE 'PENDING' END, \
c.requested_depth = $req_depth, \
- c.current_depth = $cur_depth, c.request_time = $req_time \
+ c.current_depth = $cur_depth, c.request_time = $req_time, \
+ c.targeted = $targeted, c.target_domain = $target_domain \
MERGE (p)-[:Lead]->(c)",
)
.param("pname", parent.name.as_str())
@@ -194,7 +201,9 @@ async fn batch_create_children(
.param("http_type", child.http_type.as_str())
.param("req_depth", child.requested_depth)
.param("cur_depth", child.current_depth)
- .param("req_time", child.request_time.as_str()),
+ .param("req_time", child.request_time.as_str())
+ .param("targeted", child.targeted)
+ .param("target_domain", child.target_domain.as_str()),
)
.await?;
}
@@ -282,8 +291,21 @@ pub async fn feeding(
// Step 2: Extract URLs from HTML
let extracted_urls = crawler::extract_urls(&page_data.html);
+ // Step 2b: Filter by target domain when targeted
+ let filtered_urls: Vec<&String> = if job.targeted && !job.target_domain.is_empty() {
+ extracted_urls
+ .iter()
+ .filter(|u| {
+ let (norm_name, _) = url_normalize::normalize_url(u);
+ url_normalize::is_same_registered_domain(&norm_name, &job.target_domain)
+ })
+ .collect()
+ } else {
+ extracted_urls.iter().collect()
+ };
+
// Step 3: Deduplicate against existing DB nodes (server-side)
- let upper_urls: HashSet = extracted_urls.iter().map(|u| u.to_uppercase()).collect();
+ let upper_urls: HashSet = filtered_urls.iter().map(|u| u.to_uppercase()).collect();
let new_urls = filter_new_urls(graph, &upper_urls, &job.crawl_id).await?;
if new_urls.is_empty() {
@@ -303,6 +325,9 @@ pub async fn feeding(
let current_depth = job.current_depth;
let crawl_id = job.crawl_id.clone();
+ let targeted = job.targeted;
+ let target_domain = job.target_domain.clone();
+
let dns_futures: Vec<_> = normalized
.iter()
.map(|(name, http_type)| {
@@ -310,6 +335,7 @@ pub async fn feeding(
let http_type = http_type.clone();
let req_time = request_time.clone();
let cid = crawl_id.clone();
+ let td = target_domain.clone();
async move {
match dns::get_network_stats(resolver, &name, config.max_dns_depth).await {
Ok(stats) => Some(ChildNode {
@@ -321,6 +347,8 @@ pub async fn feeding(
current_depth: current_depth + 1,
request_time: req_time,
crawl_id: cid,
+ targeted,
+ target_domain: td,
}),
Err(e) => {
tracing::error!("URL: {} -- FAILED: {}", name, e);
diff --git a/feeder/src/main.rs b/feeder/src/main.rs
index 7217e96..6ce4988 100644
--- a/feeder/src/main.rs
+++ b/feeder/src/main.rs
@@ -123,6 +123,8 @@ async fn main() -> anyhow::Result<()> {
current_depth: url_job.current_depth,
attempts: url_job.attempts,
crawl_id: url_job.crawl_id.clone(),
+ targeted: url_job.targeted,
+ target_domain: url_job.target_domain.clone(),
});
// Check for shutdown after claiming but before processing.
diff --git a/frontend/package-lock.json b/frontend/package-lock.json
index 213b8a8..6b8917d 100644
--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
@@ -18,6 +18,7 @@
"@tanstack/react-query": "^5.62.0",
"class-variance-authority": "^0.7.1",
"clsx": "^2.1.1",
+ "d3-force": "^3.0.0",
"lucide-react": "^0.460.0",
"react": "^18.3.1",
"react-dom": "^18.3.1",
@@ -30,6 +31,7 @@
},
"devDependencies": {
"@eslint/js": "^9.15.0",
+ "@types/d3-force": "^3.0.10",
"@types/react": "^18.3.12",
"@types/react-dom": "^18.3.1",
"@vitejs/plugin-react": "^4.3.4",
@@ -2355,6 +2357,13 @@
"integrity": "sha512-NcV1JjO5oDzoK26oMzbILE6HW7uVXOHLQvHshBUW4UMdZGfiY6v5BeQwh9a9tCzv+CeefZQHJt5SRgK154RtiA==",
"license": "MIT"
},
+ "node_modules/@types/d3-force": {
+ "version": "3.0.10",
+ "resolved": "https://registry.npmjs.org/@types/d3-force/-/d3-force-3.0.10.tgz",
+ "integrity": "sha512-ZYeSaCF3p73RdOKcjj+swRlZfnYpK1EbaDiYICEEp5Q6sUiqFaFQ9qgoshp5CzIyyb/yD09kD9o2zEltCexlgw==",
+ "dev": true,
+ "license": "MIT"
+ },
"node_modules/@types/d3-interpolate": {
"version": "3.0.4",
"resolved": "https://registry.npmjs.org/@types/d3-interpolate/-/d3-interpolate-3.0.4.tgz",
@@ -3257,6 +3266,20 @@
"node": ">=12"
}
},
+ "node_modules/d3-force": {
+ "version": "3.0.0",
+ "resolved": "https://registry.npmjs.org/d3-force/-/d3-force-3.0.0.tgz",
+ "integrity": "sha512-zxV/SsA+U4yte8051P4ECydjD/S+qeYtnaIyAs9tgHCqfguma/aAQDjo85A9Z6EKhBirHRJHXIgJUlffT4wdLg==",
+ "license": "ISC",
+ "dependencies": {
+ "d3-dispatch": "1 - 3",
+ "d3-quadtree": "1 - 3",
+ "d3-timer": "1 - 3"
+ },
+ "engines": {
+ "node": ">=12"
+ }
+ },
"node_modules/d3-force-3d": {
"version": "3.0.6",
"resolved": "https://registry.npmjs.org/d3-force-3d/-/d3-force-3d-3.0.6.tgz",
diff --git a/frontend/package.json b/frontend/package.json
index a78fb94..b5d1251 100644
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -21,6 +21,7 @@
"@tanstack/react-query": "^5.62.0",
"class-variance-authority": "^0.7.1",
"clsx": "^2.1.1",
+ "d3-force": "^3.0.0",
"lucide-react": "^0.460.0",
"react": "^18.3.1",
"react-dom": "^18.3.1",
@@ -33,6 +34,7 @@
},
"devDependencies": {
"@eslint/js": "^9.15.0",
+ "@types/d3-force": "^3.0.10",
"@types/react": "^18.3.12",
"@types/react-dom": "^18.3.1",
"@vitejs/plugin-react": "^4.3.4",
diff --git a/frontend/src/components/GraphView.tsx b/frontend/src/components/GraphView.tsx
index eb7ac59..543e513 100644
--- a/frontend/src/components/GraphView.tsx
+++ b/frontend/src/components/GraphView.tsx
@@ -1,8 +1,10 @@
-import { useRef, useCallback, useMemo } from "react";
+import { useState, useRef, useCallback, useMemo, useEffect } from "react";
import ForceGraph2D, {
type ForceGraphMethods,
+ type LinkObject,
type NodeObject,
} from "react-force-graph-2d";
+import { forceRadial } from "d3-force";
import type { GraphData } from "../types/api";
interface GraphViewProps {
@@ -11,6 +13,7 @@ interface GraphViewProps {
interface CrawlNode {
label: string;
+ domain: string;
depth: number;
status: string;
nodeType: string;
@@ -30,15 +33,33 @@ export function GraphView({ data }: GraphViewProps) {
const fgRef = useRef> | undefined>(
undefined
);
+ const [selectedNode, setSelectedNode] = useState(null);
+ const containerRef = useRef(null);
+
+ const needsRecenter = useRef(true);
+ const [containerWidth, setContainerWidth] = useState(0);
+
+ useEffect(() => {
+ const el = containerRef.current;
+ if (!el) return;
+ const observer = new ResizeObserver((entries) => {
+ setContainerWidth(entries[0].contentRect.width);
+ });
+ observer.observe(el);
+ return () => observer.disconnect();
+ }, []);
const graphData = useMemo(() => {
const nodes = data.nodes.map((n) => ({
id: n.id,
label: n.label,
+ domain: n.domain,
depth: n.depth,
status: n.status,
nodeType: n.node_type,
- val: n.node_type === "ROOT" ? 3 : 1,
+ val: n.node_type === "ROOT" ? 4 : n.depth === 1 ? 2.5 : n.depth === 2 ? 1.5 : 1,
+ // Pin root node at origin for stable centering
+ ...(n.node_type === "ROOT" ? { fx: 0, fy: 0 } : {}),
}));
const links = data.edges.map((e) => ({
@@ -49,6 +70,22 @@ export function GraphView({ data }: GraphViewProps) {
return { nodes, links };
}, [data]);
+ const { neighborIds, connectedLinks } = useMemo(() => {
+ if (!selectedNode) return { neighborIds: new Set(), connectedLinks: new Set() };
+ const nIds = new Set();
+ const cLinks = new Set();
+ graphData.links.forEach((link) => {
+ const src = typeof link.source === "object" ? (link.source as NodeObject).id : link.source;
+ const tgt = typeof link.target === "object" ? (link.target as NodeObject).id : link.target;
+ if (src === selectedNode || tgt === selectedNode) {
+ nIds.add(src as string);
+ nIds.add(tgt as string);
+ cLinks.add(`${src}->${tgt}`);
+ }
+ });
+ return { neighborIds: nIds, connectedLinks: cLinks };
+ }, [selectedNode, graphData]);
+
const activeStatuses = useMemo(() => {
const statuses = new Set();
data.nodes.forEach((n) => {
@@ -58,17 +95,60 @@ export function GraphView({ data }: GraphViewProps) {
return Object.entries(STATUS_COLORS).filter(([s]) => statuses.has(s));
}, [data]);
+ useEffect(() => {
+ const fg = fgRef.current;
+ if (!fg) return;
+
+ const ringSpacing = 120;
+
+ // Radial force: push nodes into concentric rings by depth
+ fg.d3Force(
+ "radial",
+ forceRadial(
+ (node: NodeObject) => ((node as CrawlNode).depth ?? 0) * ringSpacing,
+ 0,
+ 0
+ ).strength(0.8)
+ );
+
+ // Link distance based on depth
+ fg.d3Force("link")?.distance(
+ (link: LinkObject) => {
+ const src = link.source as NodeObject;
+ const tgt = link.target as NodeObject;
+ return 30 + Math.abs((tgt.depth ?? 0) - (src.depth ?? 0)) * 60;
+ }
+ );
+
+ // Stronger charge to spread nodes within rings
+ fg.d3Force("charge")?.strength(-80);
+
+ needsRecenter.current = true;
+ fg.d3ReheatSimulation();
+ }, [graphData]);
+
const handleEngineStop = useCallback(() => {
- if (fgRef.current) {
- fgRef.current.zoomToFit(400);
- }
+ const fg = fgRef.current;
+ if (!fg || !needsRecenter.current) return;
+ needsRecenter.current = false;
+
+ // Root is pinned at (0,0). Center on it and zoom to fit all nodes.
+ fg.centerAt(0, 0);
+ fg.zoomToFit(400, 40);
}, []);
const nodeColor = useCallback(
(node: NodeObject) => {
- return STATUS_COLORS[node.status || ""] || "#9ca3af";
+ const base = STATUS_COLORS[node.status || ""] || "#9ca3af";
+ if (!selectedNode) return base;
+ if (node.id === selectedNode || neighborIds.has(node.id as string)) return base;
+ // Dim unrelated nodes: parse hex to rgba with low opacity
+ const r = parseInt(base.slice(1, 3), 16);
+ const g = parseInt(base.slice(3, 5), 16);
+ const b = parseInt(base.slice(5, 7), 16);
+ return `rgba(${r},${g},${b},0.2)`;
},
- []
+ [selectedNode, neighborIds]
);
const nodeLabel = useCallback(
@@ -88,6 +168,7 @@ export function GraphView({ data }: GraphViewProps) {
return (
@@ -97,13 +178,46 @@ export function GraphView({ data }: GraphViewProps) {
nodeColor={nodeColor}
nodeLabel={nodeLabel}
nodeRelSize={6}
- linkColor={() => "rgba(255,255,255,0.15)"}
+ onNodeClick={(node: NodeObject
) => {
+ setSelectedNode(node.id === selectedNode ? null : (node.id as string));
+ }}
+ onBackgroundClick={() => setSelectedNode(null)}
+ nodeCanvasObjectMode={() => selectedNode ? ("after" as const) : undefined}
+ nodeCanvasObject={(node: NodeObject, ctx, globalScale) => {
+ if (node.id !== selectedNode) return;
+ const r = Math.sqrt(node.val ?? 1) * 6 + 2;
+ ctx.beginPath();
+ ctx.arc(node.x!, node.y!, r, 0, 2 * Math.PI);
+ ctx.strokeStyle = "#ffffff";
+ ctx.lineWidth = 2 / globalScale;
+ ctx.stroke();
+ }}
+ linkColor={(link: LinkObject) => {
+ if (selectedNode) {
+ const src = typeof link.source === "object" ? (link.source as NodeObject).id : link.source;
+ const tgt = typeof link.target === "object" ? (link.target as NodeObject).id : link.target;
+ const key = `${src}->${tgt}`;
+ return connectedLinks.has(key) ? "rgba(255,255,255,0.6)" : "rgba(255,255,255,0.03)";
+ }
+ const depth = Math.max(
+ (link.source as NodeObject)?.depth ?? 0,
+ (link.target as NodeObject)?.depth ?? 0
+ );
+ const opacity = Math.max(0.05, 0.25 - depth * 0.05);
+ return `rgba(255,255,255,${opacity})`;
+ }}
+ linkWidth={(link: LinkObject) => {
+ if (!selectedNode) return 0.5;
+ const src = typeof link.source === "object" ? (link.source as NodeObject).id : link.source;
+ const tgt = typeof link.target === "object" ? (link.target as NodeObject).id : link.target;
+ return (src === selectedNode || tgt === selectedNode) ? 2 : 0.5;
+ }}
linkDirectionalArrowLength={3}
linkDirectionalArrowRelPos={1}
backgroundColor="#111827"
onEngineStop={handleEngineStop}
cooldownTicks={100}
- width={undefined}
+ width={containerWidth || undefined}
height={600}
/>
@@ -117,6 +231,28 @@ export function GraphView({ data }: GraphViewProps) {
))}
+ {selectedNode && (() => {
+ const node = graphData.nodes.find((n) => n.id === selectedNode);
+ if (!node) return null;
+ return (
+
+
+ {node.label}
+ setSelectedNode(null)}
+ className="text-gray-400 hover:text-white shrink-0 leading-none"
+ >
+ ×
+
+
+
Domain: {node.domain}
+
Depth: {node.depth}
+
Status: {node.status}
+
Type: {node.nodeType}
+
Connections: {neighborIds.size > 0 ? neighborIds.size - 1 : 0}
+
+ );
+ })()}
);
}
diff --git a/frontend/src/lib/api.ts b/frontend/src/lib/api.ts
index fc26efa..ef20276 100644
--- a/frontend/src/lib/api.ts
+++ b/frontend/src/lib/api.ts
@@ -19,12 +19,13 @@ async function fetchJSON(url: string, init?: RequestInit): Promise {
export async function createCrawl(
url: string,
- depth: number
+ depth: number,
+ targeted?: boolean
): Promise {
return fetchJSON(`${BASE}/crawls`, {
method: "POST",
headers: { "Content-Type": "application/json" },
- body: JSON.stringify({ url, depth }),
+ body: JSON.stringify({ url, depth, ...(targeted ? { targeted } : {}) }),
});
}
diff --git a/frontend/src/pages/CrawlDetail.tsx b/frontend/src/pages/CrawlDetail.tsx
index 3a150b3..f02f7da 100644
--- a/frontend/src/pages/CrawlDetail.tsx
+++ b/frontend/src/pages/CrawlDetail.tsx
@@ -103,7 +103,13 @@ export default function CrawlDetail() {
{crawl.root_url.toLowerCase()}
- Depth: {crawl.requested_depth} | ID: {id}
+ Depth: {crawl.requested_depth}
+ {crawl.targeted && (
+
+ Targeted
+
+ )}
+ {" "}| ID: {id}
@@ -227,6 +233,12 @@ export default function CrawlDetail() {
Requested Depth
{crawl.requested_depth}
+
+
Scope
+
+ {crawl.targeted ? "Targeted" : "Unrestricted"}
+
+
Status
diff --git a/frontend/src/pages/CrawlList.tsx b/frontend/src/pages/CrawlList.tsx
index f5938cd..6a91b71 100644
--- a/frontend/src/pages/CrawlList.tsx
+++ b/frontend/src/pages/CrawlList.tsx
@@ -97,6 +97,11 @@ export default function CrawlList() {
depth {crawl.requested_depth}
+ {crawl.targeted && (
+
+ Targeted
+
+ )}
diff --git a/frontend/src/pages/NewCrawl.tsx b/frontend/src/pages/NewCrawl.tsx
index 5098ae5..e49a57f 100644
--- a/frontend/src/pages/NewCrawl.tsx
+++ b/frontend/src/pages/NewCrawl.tsx
@@ -11,6 +11,7 @@ import { Input } from "../components/ui/input";
const schema = z.object({
url: z.string().url("Please enter a valid URL"),
depth: z.number().min(1).max(5),
+ targeted: z.boolean(),
});
type FormData = z.infer;
@@ -28,7 +29,7 @@ export default function NewCrawl() {
formState: { errors },
} = useForm({
resolver: zodResolver(schema),
- defaultValues: { url: "", depth: 2 },
+ defaultValues: { url: "", depth: 2, targeted: false },
});
const depth = watch("depth");
@@ -37,7 +38,7 @@ export default function NewCrawl() {
setSubmitting(true);
setError("");
try {
- const result = await createCrawl(data.url, data.depth);
+ const result = await createCrawl(data.url, data.depth, data.targeted || undefined);
navigate(`/crawls/${result.crawl_id}`);
} catch (err) {
setError(err instanceof Error ? err.message : "Failed to start crawl");
@@ -101,6 +102,32 @@ export default function NewCrawl() {
)}
+
+
+
+
+ Targeted crawl
+
+
+ Only follow links within the same registered domain as the
+ root URL. For example, crawling{" "}
+
+ blog.example.com
+ {" "}
+ will also crawl{" "}
+
+ shop.example.com
+ {" "}
+ but not external sites.
+
+
+
+
What to expect
diff --git a/frontend/src/types/api.ts b/frontend/src/types/api.ts
index 95f04f0..ea1a25b 100644
--- a/frontend/src/types/api.ts
+++ b/frontend/src/types/api.ts
@@ -13,6 +13,7 @@ export interface CrawlProgress {
failed: number;
root_url: string;
requested_depth: number;
+ targeted: boolean;
}
export interface CrawlListItem {
@@ -23,6 +24,7 @@ export interface CrawlListItem {
total: number;
completed: number;
failed: number;
+ targeted: boolean;
}
export interface CrawlListResponse {
diff --git a/manager/src/models/crawl.rs b/manager/src/models/crawl.rs
index 8ed1e2f..1dc3f84 100644
--- a/manager/src/models/crawl.rs
+++ b/manager/src/models/crawl.rs
@@ -4,6 +4,8 @@ use serde::{Deserialize, Serialize};
pub struct CrawlRequest {
pub url: String,
pub depth: i64,
+ #[serde(default)]
+ pub targeted: Option,
}
#[derive(Serialize)]
@@ -24,6 +26,7 @@ pub struct CrawlProgress {
pub cancelled: i64,
pub root_url: String,
pub requested_depth: i64,
+ pub targeted: bool,
}
#[derive(Serialize)]
@@ -36,6 +39,7 @@ pub struct CrawlListItem {
pub completed: i64,
pub failed: i64,
pub cancelled: i64,
+ pub targeted: bool,
}
#[derive(Serialize)]
diff --git a/manager/src/routes/crawl.rs b/manager/src/routes/crawl.rs
index b766967..067f044 100644
--- a/manager/src/routes/crawl.rs
+++ b/manager/src/routes/crawl.rs
@@ -43,6 +43,23 @@ pub async fn create_crawl(
// 1. Normalize root URL
let (root_name, http_type) = url_normalize::normalize_url(&req.url);
+ let targeted = req.targeted.unwrap_or(false);
+
+ // 1b. Compute target domain for targeted crawls
+ let target_domain = if targeted {
+ match url_normalize::registered_domain(&root_name) {
+ Some(rd) => rd,
+ None => {
+ return (
+ StatusCode::BAD_REQUEST,
+ Json(json!({"error": "Cannot determine registered domain for targeted crawl (bare public suffix or invalid host)"})),
+ )
+ .into_response();
+ }
+ }
+ } else {
+ String::new()
+ };
// 2. Fetch page HTML
let page_data = match crawler::get_page_data(&state.client, &req.url).await {
@@ -85,10 +102,20 @@ pub async fn create_crawl(
// 6. Resolve DNS for each extracted URL in parallel
let request_time = format!("{:?}", page_data.elapsed);
- let dns_futures: Vec<_> = extracted_urls
+ // 6a. Normalize extracted URLs and filter by target domain if targeted
+ let normalized_urls: Vec<(String, String)> = extracted_urls
+ .iter()
+ .map(|url| url_normalize::normalize_url(url))
+ .filter(|(norm_name, _)| {
+ !targeted || url_normalize::is_same_registered_domain(norm_name, &target_domain)
+ })
+ .collect();
+
+ let dns_futures: Vec<_> = normalized_urls
.iter()
- .map(|url| {
- let (norm_name, child_http_type) = url_normalize::normalize_url(url);
+ .map(|(norm_name, child_http_type)| {
+ let norm_name = norm_name.clone();
+ let child_http_type = child_http_type.clone();
let resolver = &state.resolver;
let max_depth = state.config.max_dns_depth;
async move {
@@ -117,6 +144,8 @@ pub async fn create_crawl(
depth: req.depth,
request_time: &request_time,
children: &children,
+ targeted,
+ target_domain: &target_domain,
};
if let Err(e) = crawl_service::create_crawl_graph(&state.graph, ¶ms).await
{
diff --git a/manager/src/services/crawl_service.rs b/manager/src/services/crawl_service.rs
index 62fbff2..193902a 100644
--- a/manager/src/services/crawl_service.rs
+++ b/manager/src/services/crawl_service.rs
@@ -11,6 +11,8 @@ pub struct CreateCrawlParams<'a> {
pub depth: i64,
pub request_time: &'a str,
pub children: &'a [(String, String, String, String)],
+ pub targeted: bool,
+ pub target_domain: &'a str,
}
/// Create ROOT node and child URL nodes in a single transaction with crawl_id.
@@ -25,7 +27,8 @@ pub async fn create_crawl_graph(
query(
"CREATE (:ROOT {name: $name, ip: $ip, domain: $domain, http_type: $http_type, \
requested_depth: $req_depth, current_depth: 0, request_time: $req_time, \
- crawl_id: $crawl_id, created_at: datetime()})",
+ crawl_id: $crawl_id, created_at: datetime(), \
+ targeted: $targeted, target_domain: $target_domain})",
)
.param("name", params.root_name)
.param("ip", params.root_ip)
@@ -33,7 +36,9 @@ pub async fn create_crawl_graph(
.param("http_type", params.http_type)
.param("req_depth", params.depth)
.param("req_time", params.request_time)
- .param("crawl_id", params.crawl_id),
+ .param("crawl_id", params.crawl_id)
+ .param("targeted", params.targeted)
+ .param("target_domain", params.target_domain),
)
.await?;
@@ -46,7 +51,8 @@ pub async fn create_crawl_graph(
ON CREATE SET c.ip = $ip, c.domain = $domain, \
c.job_status = CASE WHEN 1 = $req_depth THEN 'COMPLETED' ELSE 'PENDING' END, \
c.requested_depth = $req_depth, \
- c.current_depth = 1, c.request_time = $req_time \
+ c.current_depth = 1, c.request_time = $req_time, \
+ c.targeted = $targeted, c.target_domain = $target_domain \
MERGE (root)-[:Lead]->(c)",
)
.param("crawl_id", params.crawl_id)
@@ -55,7 +61,9 @@ pub async fn create_crawl_graph(
.param("ip", child_ip.as_str())
.param("domain", child_domain.as_str())
.param("http_type", child_http_type.as_str())
- .param("req_time", params.request_time),
+ .param("req_time", params.request_time)
+ .param("targeted", params.targeted)
+ .param("target_domain", params.target_domain),
)
.await?;
}
@@ -83,6 +91,7 @@ pub async fn get_crawl_progress(
sum(CASE WHEN u.job_status = 'FAILED' THEN 1 ELSE 0 END) AS failed, \
sum(CASE WHEN u.job_status = 'CANCELLED' THEN 1 ELSE 0 END) AS cancelled \
RETURN r.name AS root_url, r.requested_depth AS depth, r.http_type AS http_type, \
+ r.targeted AS targeted, \
total, completed, pending, in_progress, failed, cancelled",
)
.param("crawl_id", crawl_id),
@@ -113,6 +122,8 @@ pub async fn get_crawl_progress(
"running".to_string()
};
+ let targeted: bool = row.get::("targeted").unwrap_or(false);
+
Ok(Some(CrawlProgress {
crawl_id: crawl_id.to_string(),
status,
@@ -124,6 +135,7 @@ pub async fn get_crawl_progress(
cancelled,
root_url: format!("{}{}", http_type, url),
requested_depth: depth,
+ targeted,
}))
}
None => Ok(None),
@@ -159,6 +171,7 @@ pub async fn list_crawls(
UNWIND items[$offset..($offset + $limit)] AS item \
RETURN item.r.crawl_id AS crawl_id, item.r.name AS root_url, \
item.r.http_type AS http_type, item.r.requested_depth AS depth, \
+ item.r.targeted AS targeted, \
item.total AS total, item.completed AS completed, item.failed AS failed, item.cancelled AS cancelled, item.status AS status, \
total_count"
} else {
@@ -178,6 +191,7 @@ pub async fn list_crawls(
UNWIND items[$offset..($offset + $limit)] AS item \
RETURN item.r.crawl_id AS crawl_id, item.r.name AS root_url, \
item.r.http_type AS http_type, item.r.requested_depth AS depth, \
+ item.r.targeted AS targeted, \
item.total AS total, item.completed AS completed, item.failed AS failed, item.cancelled AS cancelled, item.status AS status, \
total_count"
};
@@ -208,6 +222,7 @@ pub async fn list_crawls(
completed: row.get("completed")?,
failed: row.get("failed")?,
cancelled: row.get("cancelled")?,
+ targeted: row.get::("targeted").unwrap_or(false),
});
}
diff --git a/shared/Cargo.toml b/shared/Cargo.toml
index 393e47b..9a0bc99 100644
--- a/shared/Cargo.toml
+++ b/shared/Cargo.toml
@@ -13,6 +13,7 @@ regex = { workspace = true }
thiserror = { workspace = true }
tracing = { workspace = true }
futures = { workspace = true }
+psl = "2"
[dev-dependencies]
tokio = { workspace = true }
diff --git a/shared/src/url_normalize.rs b/shared/src/url_normalize.rs
index fd54467..8e01415 100644
--- a/shared/src/url_normalize.rs
+++ b/shared/src/url_normalize.rs
@@ -21,6 +21,39 @@ pub fn normalize_url(url: &str) -> (String, String) {
(name, proto.to_string())
}
+use psl::Psl;
+
+/// Extracts the registered domain (eTLD+1) from a normalized name.
+///
+/// The input should be an uppercase normalized name (no protocol, no `www.`).
+/// Ports are stripped before lookup. Returns uppercase eTLD+1.
+///
+/// # Examples
+/// - `"EXAMPLE.COM"` -> `Some("EXAMPLE.COM")`
+/// - `"BLOG.EXAMPLE.CO.UK"` -> `Some("EXAMPLE.CO.UK")`
+/// - `"EXAMPLE.COM:8080"` -> `Some("EXAMPLE.COM")`
+/// - `"COM"` (bare TLD) -> `None`
+pub fn registered_domain(normalized_name: &str) -> Option {
+ // Strip port if present
+ let host = normalized_name.split(':').next().unwrap_or(normalized_name);
+ // psl requires lowercase input
+ let lower = host.to_lowercase();
+ let domain = psl::List.domain(lower.as_bytes())?;
+ let domain_str = std::str::from_utf8(domain.as_bytes()).ok()?;
+ Some(domain_str.to_uppercase())
+}
+
+/// Checks if a normalized name belongs to the same registered domain as the target.
+///
+/// Both inputs should be uppercase. The target should already be a registered domain
+/// (output of `registered_domain()`).
+pub fn is_same_registered_domain(normalized_name: &str, target_domain: &str) -> bool {
+ match registered_domain(normalized_name) {
+ Some(rd) => rd == target_domain,
+ None => false,
+ }
+}
+
#[cfg(test)]
mod tests {
use super::*;
@@ -66,4 +99,70 @@ mod tests {
assert_eq!(name, "SUBDOMAIN.WWW.EXAMPLE.COM");
assert_eq!(proto, "HTTPS://");
}
+
+ #[test]
+ fn test_registered_domain_simple() {
+ assert_eq!(registered_domain("EXAMPLE.COM"), Some("EXAMPLE.COM".to_string()));
+ }
+
+ #[test]
+ fn test_registered_domain_subdomain() {
+ assert_eq!(registered_domain("BLOG.EXAMPLE.COM"), Some("EXAMPLE.COM".to_string()));
+ }
+
+ #[test]
+ fn test_registered_domain_deep_subdomain() {
+ assert_eq!(registered_domain("A.B.C.EXAMPLE.COM"), Some("EXAMPLE.COM".to_string()));
+ }
+
+ #[test]
+ fn test_registered_domain_co_uk() {
+ assert_eq!(registered_domain("BLOG.EXAMPLE.CO.UK"), Some("EXAMPLE.CO.UK".to_string()));
+ }
+
+ #[test]
+ fn test_registered_domain_with_port() {
+ assert_eq!(registered_domain("EXAMPLE.COM:8080"), Some("EXAMPLE.COM".to_string()));
+ }
+
+ #[test]
+ fn test_registered_domain_bare_tld() {
+ assert_eq!(registered_domain("COM"), None);
+ }
+
+ #[test]
+ fn test_registered_domain_bare_public_suffix() {
+ assert_eq!(registered_domain("GITHUB.IO"), None);
+ }
+
+ #[test]
+ fn test_registered_domain_localhost() {
+ assert_eq!(registered_domain("LOCALHOST"), None);
+ }
+
+
+ #[test]
+ fn test_is_same_registered_domain_match() {
+ assert!(is_same_registered_domain("BLOG.EXAMPLE.COM", "EXAMPLE.COM"));
+ }
+
+ #[test]
+ fn test_is_same_registered_domain_exact() {
+ assert!(is_same_registered_domain("EXAMPLE.COM", "EXAMPLE.COM"));
+ }
+
+ #[test]
+ fn test_is_same_registered_domain_no_match() {
+ assert!(!is_same_registered_domain("GOOGLE.COM", "EXAMPLE.COM"));
+ }
+
+ #[test]
+ fn test_is_same_registered_domain_with_port() {
+ assert!(is_same_registered_domain("API.EXAMPLE.COM:3000", "EXAMPLE.COM"));
+ }
+
+ #[test]
+ fn test_is_same_registered_domain_co_uk() {
+ assert!(is_same_registered_domain("SHOP.EXAMPLE.CO.UK", "EXAMPLE.CO.UK"));
+ }
}
From be5b1e7fc3501314ee8850dde6b218397ca52528 Mon Sep 17 00:00:00 2001
From: bluedotiya
Date: Sun, 22 Feb 2026 19:31:26 +0200
Subject: [PATCH 2/3] =?UTF-8?q?fix:=20address=20PR=20review=20=E2=80=94=20?=
=?UTF-8?q?4xx=20double-update=20bug,=20normalize-once,=20nits?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
- Fix double status update on 4xx: check permanent failure before
retry-vs-fail branch instead of overwriting PENDING after
- Normalize extracted URLs once into a HashMap, reuse for targeted
filtering and dedup instead of normalizing twice
- Move `use psl::Psl` to top of url_normalize.rs
- Remove extra blank line in url_normalize tests
- Replace nested ternary with lookup object in GraphView
- Add .claude/ to .gitignore and untrack .claude/settings.json
Co-Authored-By: Claude Opus 4.6
---
.claude/settings.json | 6 ---
.gitignore | 1 +
feeder/src/job.rs | 56 +++++++++++++--------------
frontend/src/components/GraphView.tsx | 2 +-
shared/src/url_normalize.rs | 5 +--
5 files changed, 32 insertions(+), 38 deletions(-)
delete mode 100644 .claude/settings.json
diff --git a/.claude/settings.json b/.claude/settings.json
deleted file mode 100644
index 6119f7c..0000000
--- a/.claude/settings.json
+++ /dev/null
@@ -1,6 +0,0 @@
-{
- "enabledPlugins": {
- "playwright-skill@playwright-skill": true,
- "skill-creator@claude-plugins-official": true
- }
-}
diff --git a/.gitignore b/.gitignore
index 3a786ce..3e56785 100644
--- a/.gitignore
+++ b/.gitignore
@@ -15,3 +15,4 @@ tests/
*.tgz
frontend/node_modules/
frontend/dist/
+.claude/
diff --git a/feeder/src/job.rs b/feeder/src/job.rs
index b5e1f93..5d619af 100644
--- a/feeder/src/job.rs
+++ b/feeder/src/job.rs
@@ -1,4 +1,4 @@
-use std::collections::HashSet;
+use std::collections::{HashMap, HashSet};
use neo4rs::{query, Graph};
@@ -116,23 +116,23 @@ async fn validate_job(
tracing::warn!("Request failed: {} -- Attempts: {} -- Error: {}", full_url, attempts, e);
- if attempts >= config.max_attempts {
- tracing::error!(
- "Failure limit reached! Giving up on {} after {} attempts.",
- full_url,
- attempts
- );
+ // 4xx errors are permanent — fail immediately without retry
+ let is_permanent = matches!(e, CrawlerError::HttpStatus { status, .. } if (400..500).contains(&status));
+
+ if is_permanent || attempts >= config.max_attempts {
+ if !is_permanent {
+ tracing::error!(
+ "Failure limit reached! Giving up on {} after {} attempts.",
+ full_url,
+ attempts
+ );
+ }
update_job_status(graph, job, "FAILED", Some(attempts)).await?;
} else {
- // Fix: reset to PENDING so other feeders can retry
+ // Reset to PENDING so other feeders can retry
update_job_status(graph, job, "PENDING", Some(attempts)).await?;
}
- // Return permanent failures (4xx) as immediate failure
- if matches!(e, CrawlerError::HttpStatus { status, .. } if (400..500).contains(&status)) {
- update_job_status(graph, job, "FAILED", Some(attempts)).await?;
- }
-
Ok(None)
}
}
@@ -288,24 +288,24 @@ pub async fn feeding(
None => return Ok(false),
};
- // Step 2: Extract URLs from HTML
+ // Step 2: Extract URLs from HTML and normalize once
let extracted_urls = crawler::extract_urls(&page_data.html);
+ let mut normalized_map: HashMap = HashMap::new();
+ for url in &extracted_urls {
+ let (norm_name, http_type) = url_normalize::normalize_url(url);
+ let upper_key = format!("{}{}", http_type, norm_name).to_uppercase();
+ normalized_map.entry(upper_key).or_insert((norm_name, http_type));
+ }
// Step 2b: Filter by target domain when targeted
- let filtered_urls: Vec<&String> = if job.targeted && !job.target_domain.is_empty() {
- extracted_urls
- .iter()
- .filter(|u| {
- let (norm_name, _) = url_normalize::normalize_url(u);
- url_normalize::is_same_registered_domain(&norm_name, &job.target_domain)
- })
- .collect()
- } else {
- extracted_urls.iter().collect()
- };
+ if job.targeted && !job.target_domain.is_empty() {
+ normalized_map.retain(|_, (norm_name, _)| {
+ url_normalize::is_same_registered_domain(norm_name, &job.target_domain)
+ });
+ }
// Step 3: Deduplicate against existing DB nodes (server-side)
- let upper_urls: HashSet = filtered_urls.iter().map(|u| u.to_uppercase()).collect();
+ let upper_urls: HashSet = normalized_map.keys().cloned().collect();
let new_urls = filter_new_urls(graph, &upper_urls, &job.crawl_id).await?;
if new_urls.is_empty() {
@@ -314,10 +314,10 @@ pub async fn feeding(
return Ok(true);
}
- // Step 4: Normalize, DNS resolve in parallel, build child list
+ // Step 4: DNS resolve in parallel, build child list
let normalized: HashSet<(String, String)> = new_urls
.iter()
- .map(|u| url_normalize::normalize_url(u))
+ .filter_map(|key| normalized_map.get(key).cloned())
.collect();
let request_time = format!("{:?}", page_data.elapsed);
diff --git a/frontend/src/components/GraphView.tsx b/frontend/src/components/GraphView.tsx
index 543e513..0d6463c 100644
--- a/frontend/src/components/GraphView.tsx
+++ b/frontend/src/components/GraphView.tsx
@@ -57,7 +57,7 @@ export function GraphView({ data }: GraphViewProps) {
depth: n.depth,
status: n.status,
nodeType: n.node_type,
- val: n.node_type === "ROOT" ? 4 : n.depth === 1 ? 2.5 : n.depth === 2 ? 1.5 : 1,
+ val: { ROOT: 4, 1: 2.5, 2: 1.5 }[n.node_type === "ROOT" ? "ROOT" : n.depth] ?? 1,
// Pin root node at origin for stable centering
...(n.node_type === "ROOT" ? { fx: 0, fy: 0 } : {}),
}));
diff --git a/shared/src/url_normalize.rs b/shared/src/url_normalize.rs
index 8e01415..dfff237 100644
--- a/shared/src/url_normalize.rs
+++ b/shared/src/url_normalize.rs
@@ -1,3 +1,5 @@
+use psl::Psl;
+
/// Normalizes a URL by uppercasing, removing protocol and www prefix.
///
/// Returns (normalized_name, protocol).
@@ -21,8 +23,6 @@ pub fn normalize_url(url: &str) -> (String, String) {
(name, proto.to_string())
}
-use psl::Psl;
-
/// Extracts the registered domain (eTLD+1) from a normalized name.
///
/// The input should be an uppercase normalized name (no protocol, no `www.`).
@@ -140,7 +140,6 @@ mod tests {
assert_eq!(registered_domain("LOCALHOST"), None);
}
-
#[test]
fn test_is_same_registered_domain_match() {
assert!(is_same_registered_domain("BLOG.EXAMPLE.COM", "EXAMPLE.COM"));
From c94b460dff03f406ff2bdd43a4d8bdf15df4ea11 Mon Sep 17 00:00:00 2001
From: bluedotiya
Date: Sun, 22 Feb 2026 19:53:10 +0200
Subject: [PATCH 3/3] chore: remove CLAUDE.md file and its associated
documentation
---
CLAUDE.md | 85 -------------------------------------------------------
1 file changed, 85 deletions(-)
delete mode 100644 CLAUDE.md
diff --git a/CLAUDE.md b/CLAUDE.md
deleted file mode 100644
index c8483a2..0000000
--- a/CLAUDE.md
+++ /dev/null
@@ -1,85 +0,0 @@
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## Build & Test Commands
-
-### Rust
-```bash
-cargo check --workspace # Fast compilation check
-cargo build --release # Release build (LTO + stripped)
-cargo test --workspace # Run all tests
-cargo test -p shared # Test single crate
-cargo clippy --workspace -- -D warnings # Lint (CI-strict)
-```
-
-### Frontend (from `frontend/`)
-```bash
-npm install # Install deps
-npm run dev # Dev server on :3000 (proxies /api to :8080)
-npm run build # Type-check + production build
-npm run lint # ESLint
-npm run type-check # TypeScript check only
-```
-
-### Docker (from repo root, use minikube docker-env for local k8s)
-```bash
-docker build -t ghcr.io/bluedotiya/web-crawler/manager:latest -f manager/Dockerfile .
-docker build -t ghcr.io/bluedotiya/web-crawler/feeder:latest -f feeder/Dockerfile .
-docker build -t ghcr.io/bluedotiya/web-crawler/frontend:latest -f frontend/Dockerfile .
-```
-
-## Architecture
-
-Three services communicate through a shared Neo4j database (no direct inter-service HTTP):
-
-- **manager** — Axum HTTP server (port 8080). REST API at `/api/v1/crawls/*` + WebSocket for live progress. Creates ROOT nodes and initial URL children when a crawl is submitted.
-- **feeder** — Background workers (8 replicas). Poll Neo4j for PENDING URLs, fetch HTML, extract links, create child nodes. Atomic job claiming prevents worker conflicts.
-- **frontend** — React SPA (Vite/TypeScript/Tailwind). Served by nginx in production, proxied via Vite in dev. Uses React Query for polling and WebSocket for real-time updates.
-- **shared** — Rust library crate used by both manager and feeder. Contains: crawler (HTTP fetch + URL extraction), dns (resolution with iterative domain shortening), neo4j_client, url_normalize, schema (indexes/constraints), error types.
-
-### Data Flow
-1. User submits URL + depth (1-5) via frontend → POST `/api/v1/crawls`
-2. Manager normalizes URL, resolves DNS, creates ROOT + child URL nodes in Neo4j
-3. Feeder workers atomically claim PENDING URLs, fetch HTML, extract/deduplicate links, create children
-4. Frontend polls progress via REST (5s) or WebSocket (2s), displays force-graph visualization
-
-### Neo4j Data Model
-- **ROOT** node (one per crawl, unique on `crawl_id`) — the seed URL
-- **URL** nodes — discovered links with `job_status` (PENDING/IN-PROGRESS/COMPLETED/FAILED/CANCELLED)
-- **Lead** edges — parent → child link relationships
-- All nodes scoped by `crawl_id` for isolation between crawls
-
-## Key Conventions
-
-- **Conventional commits** required on PR titles: `feat:`, `fix:`, `chore:`, etc. (enforced by CI). Breaking changes use `!` suffix (e.g., `feat!:`). Drives automated semver + per-service tagging.
-- **Pre-commit hooks**: `cargo check`, `cargo clippy -D warnings`, `cargo test`, frontend lint+typecheck. Install: `pip install pre-commit && pre-commit install`
-- **Workspace dependency gotcha**: `default-features = false` in `[workspace.dependencies]` is ignored by Cargo. Each member crate must set it explicitly.
-- **TLS in containers**: Use `rustls-tls-webpki-roots` (bundles CAs in binary). Avoid `native-tls` or `native-roots` in slim Docker images.
-- **HTTP clients** in both feeder and manager must set `.user_agent(...)` to avoid 403 responses.
-- **TypeScript**: Strict mode enabled, no unused locals/parameters. Path alias `@/` → `./src/`.
-- **Docker images** must use full GHCR path (`ghcr.io/bluedotiya/web-crawler/{service}:tag`) to match k8s deployment specs.
-
-## API Routes (manager)
-
-| Method | Endpoint | Purpose |
-|--------|----------|---------|
-| POST | `/api/v1/crawls` | Create new crawl |
-| GET | `/api/v1/crawls` | List crawls (filter/pagination) |
-| GET | `/api/v1/crawls/{id}` | Get crawl progress |
-| DELETE | `/api/v1/crawls/{id}` | Cancel crawl |
-| GET | `/api/v1/crawls/{id}/graph` | Graph data (nodes + edges) |
-| GET | `/api/v1/crawls/{id}/stats` | Crawl statistics |
-| GET | `/api/v1/crawls/{id}/ws` | WebSocket for live updates |
-| GET | `/livez`, `/readyz` | Health probes |
-
-## Project Layout
-
-```
-shared/src/ → lib.rs, crawler.rs, dns.rs, neo4j_client.rs, url_normalize.rs, schema.rs, error.rs
-manager/src/ → main.rs, config.rs, routes/{crawl,status,graph,ws}.rs, services/{crawl,graph}_service.rs
-feeder/src/ → main.rs, config.rs, job.rs
-frontend/src/ → App.tsx, pages/{Dashboard,CrawlList,CrawlDetail,NewCrawl}.tsx, components/GraphView.tsx, lib/api.ts, hooks/useWebSocket.ts
-web-crawler/ → Helm parent chart (neo4j, manager, feeder, frontend subcharts)
-docs/ → architecture.md, api-reference.md, neo4j-graph-model.md, deployment.md, development.md
-```