This docs/ directory contains structured documentation for the crawler project. The documentation mirrors the source tree under crawler/, excluding the top-level crawler directory itself.
At a high level, the repository layout is:
Concurrent-Web-Crawler-with-Pluggable-Pipelines/
crawler/ # Go module (CLI + Web UI + internal packages)
cmd/
crawler/ # CLI entrypoint
webui/ # Web UI entrypoint (also hosts JSON API)
api/ # Standalone JSON API entrypoint
internal/
crawler/ # Core crawler orchestration
pipeline/ # Pluggable pipeline stages and rate limiting
shared/ # Shared types such as Item, UseCase, CrawlStats, ModeSummary
service/ # CrawlService abstraction over the core crawler
httpapi/ # HTTP handlers exposing the JSON API
store/ # File-backed persistence for crawl summaries
docs/ # This documentation tree
manual/ # How to build and run the project
WHY_the_PROJECT/ # High-level motivation and use-case explanations
Problems-and-Solutions/
QnA/
code_fixing/
.github/workflows/ # CI configuration for build + tests
docker-compose.yml # Compose file to run Web UI and API in containers
- ARCHITECTURE – overall system structure and module interactions.
- DATA_FLOW – conceptual end-to-end data flow and pipeline stages.
internal/service– describes theCrawlServiceused by HTTP handlers and other integrations.internal/httpapi– documents the JSON API endpoints (for examplePOST /api/crawls,GET /api/crawls/history).internal/store– explains how crawl summaries are written to and read fromdata/crawls.jsonl.
- WHY_the_PROJECT/README.md – motivation, problems solved, and beginner/advanced perspectives.
- Developer's_Manual/README.md – how to use the crawler as a service from stacks like Next.js + Node.js + Postgres/MySQL/MongoDB.
- go.mod
- cmd/crawler/main.go
- internal/crawler/crawler.go
- internal/crawler/item.go
- internal/crawler/scheduler.go
- internal/crawler/work.go
- internal/pipeline/fetch.go
- internal/pipeline/filter.go
- internal/pipeline/parse.go
- internal/pipeline/store.go
- internal/pipeline/discover.go
- internal/pipeline/interfaces.go
- internal/pipeline/limiter.go
- Use the links above to jump to documentation for a specific source file.
- Each file-level document follows a consistent structure:
- Overview
- File Location
- Key Components
- Execution Flow
- Data Flow
- Mermaid Diagrams
- Error Handling & Edge Cases
- Example Usage
- Where a source file is currently empty, the documentation explicitly notes that and only describes the intended role implied by its name and placement.
- Some files may remain empty/legacy after refactors (for example
internal/crawler/item.go); in those cases the docs point you at the new canonical type/location.