Crawl Orchestration With Queues

How CragData queues discover and crawl jobs, handles retries, and delivers graph updates without you running workers.

  • distributed-systems
  • crawling

Crawl orchestration with queues

Production crawls are not for url in urls: fetch(url). They are queued systems with backpressure, retries, and delivery guarantees.

Enqueue

POST /v1/crawl
POST /v1/discover

You get job_id immediately. Heavy work happens asynchronously.

Workers

  • Concurrency caps per account
  • robots.txt respect
  • Anti-bot backoff on 429/503
  • scrapable flags in graph responses

Observe

  • GET /crawl/{job_id} for progress
  • GET /jobs for history
  • Webhooks: crawl.completed, page.extracted

When queue is full

HTTP 409 — finish or cancel the current job before starting another (unless your plan allows parallel jobs).

See Queues & retries in the docs.