— API docs

Fresh structured web data for AI systems

This is the primary reference for CragData—how to plan sources with link graphs, run distributed crawls, extract AI-ready JSON, and deliver live data to agents and RAG. All routes are under /v1. Get an API key from your dashboard.

Fresh structured web data for AI systems

CragData is not a scraper script. It is the living data layer between the open web and your agents, RAG stack, and automations.

Datasets go stale

Training snapshots and one-off exports miss today's pricing, partners, and policy pages. Production agents need crawl-time freshness.

LLMs hallucinate without grounding

Structured JSON + link graphs + context_for_ai summaries give models citeable sources—not vibes from outdated corpora.

RAG breaks without source planning

Embedding the whole web is expensive. Niche graphs rank which domains and URLs to read before you scrape and embed.

Quickstart

  1. Create a free account — no credit card required.
  2. Open API keys and generate a key. Copy it once; we only store a hash.
  3. Call the API with the header below. Free includes 500 calls/month plus read endpoints (/graph, /jobs, etc.) with smaller caps — see plan limits or GET /me. Export (POST /export) requires Developer or higher.

1 — Scrape a page

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

2 — Start a crawl

curl -X POST https://api.cragdata.com/v1/crawl \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "depth": 2, "max_pages": 50}'

3 — Discover domains

curl -X POST https://api.cragdata.com/v1/discover \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "max_domains": 50}'

Poll any job with GET /crawl/{job_id}. Export graph or scrapes with POST /export on Developer+ — see endpoints.

Check availability: https://api.cragdata.com/v1/health. See also our terms of service and privacy policy.

API playground

Explore response shapes for graph, queue, scrape, and entity extraction—no API key required. Use the same flows with your key for live data.

Sample responses for exploration—use your API key for live data. Create free account

{
  "seed_domain": "stripe.com",
  "context_for_ai": "Niche graph for stripe.com (depth 2). 8 strong referrers, 12 outbound destinations, 5 related domains. Use top_inbound for authority sources; top_outbound for integrations.",
  "top_inbound_domains": [
    {
      "domain": "ycombinator.com",
      "link_count": 14,
      "niche_score": 1,
      "scrapable": true
    },
    {
      "domain": "techcrunch.com",
      "link_count": 9,
      "niche_score": 0.82,
      "scrapable": true
    }
  ],
  "top_outbound_domains": [
    {
      "domain": "docs.stripe.com",
      "link_count": 42,
      "niche_score": 1,
      "scrapable": true
    },
    {
      "domain": "github.com",
      "link_count": 11,
      "niche_score": 0.71,
      "scrapable": true
    }
  ],
  "cached": false
}

Agents playbook

Recommended order for LLM agents, copilots, and RAG pipelines. Copy patterns into system prompts or tool definitions.

  1. PlanGet context_for_ai + ranked domains before any search tool.
    GET https://api.cragdata.com/v1/graph/domain-context?seed=YOUR_DOMAIN
  2. Select URLsPick scrapable pages by in-graph inlinks—not sitemap guesses.
    GET https://api.cragdata.com/v1/graph/top-pages?domain=PARTNER_DOMAIN&limit=10
  3. ExtractStructured JSON for chunking; same schema every page.
    POST https://api.cragdata.com/v1/scrape
  4. Embed & answerCite url + scraped_at in the final response for trust.
    Your vector DB + LLM

Tooling integrations

  • LangChain / CrewAI — wrap domain-context as a tool (see /llms.txt)
  • OpenAI function calling — return context_for_ai string only
  • Cron — Dashboard → Schedules for freshness without manual runs

AI context graph

Live niche graph so your AI knows where to look on the web — before RAG.

  1. GET /graph/domain-context?seed=indiehackers.com — auto-crawls if needed (200 required).
  2. Inbound/outbound domains + cluster scores in one response.
  3. GET /graph/top-pages per domain — scrapable pages ranked by in-degree.
  4. POST /scrape on top URLs, then embed for your mini-RAG.

Example — niche context for a seed

curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2" \
  -H "Authorization: Bearer ck_live_YOUR_KEY"

Response shape (abbreviated)

{
  "seed_domain": "indiehackers.com",
  "context_for_ai": "Niche graph for indiehackers.com ...",
  "top_inbound_domains": [
    { "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...]
}

Each domain includes summary and scrapable so agents can skip dead or blocked pages. Pair with GET /graph/top-pages and POST /scrape for a mini-RAG pipeline. See endpoints for full parameters.

Crawl lifecycle

Every discover, crawl, and scrape job moves through the same managed pipeline—queues, workers, graph writes, and delivery.

  1. 1. Enqueue

    POST /crawl or /discover returns job_id immediately. Job enters the account queue (409 if queue full).

  2. 2. Fetch

    Workers pull URLs with concurrency limits, respect robots.txt, and apply anti-bot retries with backoff.

  3. 3. Parse & graph

    HTML becomes nodes (pages) and edges (links). Domains aggregate for /graph endpoints.

  4. 4. Extract

    POST /scrape strips boilerplate into content[] blocks, metadata, and optional entities.

  5. 5. Deliver

    Poll GET /crawl/{job_id}, subscribe via webhooks, or export JSONL/Parquet to your warehouse.

Queues & retries

Queue behavior

  • Per-account crawl queue — one active heavy job unless plan allows more
  • Discover jobs run with max_domains caps per plan
  • GET /jobs lists history with status filters

Retries & anti-bot

  • Transient HTTP failures retry with exponential backoff
  • 429/503 from targets respect Retry-After when present
  • Non-retryable 4xx marked scrapable: false in graph responses

Rate limits

  • Plan-based req/s — see X-RateLimit-Limit header
  • Monthly credits — X-Credits-Remaining on every billable call
  • Read endpoints (/graph, /jobs) have separate caps on Free

Examples

Copy-paste flows for common production setups.

Agent RAG — niche graph first

Full flow for a research agent with one seed domain.

import os, httpx

API = "https://api.cragdata.com/v1"
H = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"}

seed = "stripe.com"
ctx = httpx.get(f"{API}/graph/domain-context", params={"seed": seed}, headers=H, timeout=30).json()
system = f"Sources plan: {ctx['context_for_ai']}"

domain = ctx["top_inbound_domains"][0]["domain"]
pages = httpx.get(f"{API}/graph/top-pages", params={"domain": domain, "limit": 5}, headers=H).json()

docs = []
for p in pages["pages"][:3]:
    docs.append(httpx.post(f"{API}/scrape", json={"url": p["url"]}, headers=H).json())

# → chunk docs[*]["content"] → embed → answer

Website monitoring with webhooks

Schedule crawls and react on page.extracted.

# Dashboard → Webhooks: https://your.app/hooks/crag
# Events: crawl.completed, page.extracted

# Schedule: Dashboard → Schedules (cron)
# Or POST /crawl on demand

# Verify signature:
# X-Crag-Signature = HMAC-SHA256(body, webhook_secret)

Export graph to warehouse

Developer+ JSONL export for dbt / Snowflake / BigQuery.

curl -X POST https://api.cragdata.com/v1/export \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"format": "jsonl", "scope": "graph"}'

Market map from seeds

Discover linked domains then crawl the cluster.

discover = httpx.post(
    "https://api.cragdata.com/v1/discover",
    headers=H,
    json={"url": "https://example.com", "max_domains": 100},
).json()

job = httpx.get(f"https://api.cragdata.com/v1/crawl/{discover['job_id']}", headers=H).json()
graph = httpx.get(f"https://api.cragdata.com/v1/graph", params={"format": "domains"}, headers=H).json()

Authentication

Send your API key on every request using the Authorization header:

Authorization: Bearer ck_live_xxxxxxxx

Keys are tied to your account and plan. Revoke compromised keys in the dashboard; usage is logged per key.

Endpoints

Base URL: https://api.cragdata.com/v1

GET/health

Health check

Verify API availability and version. No authentication required.

Example response

{ "ok": true, "version": "1.0.0", "service": "cragcrawler" }
POST/crawl

Start crawl job

Queue a crawl from a seed URL. Debits credits on your plan.

Request body

{
  "url": "https://example.com",
  "depth": 2,
  "max_pages": 100
}

curl

curl -X POST https://api.cragdata.com/v1/crawl \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "depth": 2, "max_pages": 50}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/crawl",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"url": "https://example.com", "depth": 2, "max_pages": 50},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/crawl", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    depth: 2,
    max_pages: 50,
  }),
});
console.log(await res.json());

Example response

{ "job_id": "job_abc123", "status": "queued" }
POST/discover

Discover domains

Queue a discover job from seed URL(s). Maps new root domains linked from seeds (plan limits apply). Poll with GET /crawl/{job_id}.

Request body

{
  "url": "https://vercel.com",
  "urls": ["https://vercel.com", "https://nextjs.org"],
  "max_domains": 200
}

curl

curl -X POST https://api.cragdata.com/v1/discover \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://vercel.com", "max_domains": 50}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/discover",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"url": "https://vercel.com", "max_domains": 50},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/discover", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ url: "https://vercel.com", max_domains: 50 }),
});
console.log(await res.json());

Example response

{ "job_id": "job_abc123", "status": "queued", "max_domains": 200 }
POST/export

Export data

Export your graph or scrapes as JSONL or Parquet. Requires Developer plan or higher. Returns a server path (and optional s3_url if the API host has S3 configured).

Request body

{ "format": "jsonl", "scope": "graph" }

curl

curl -X POST https://api.cragdata.com/v1/export \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"format": "jsonl", "scope": "scrapes"}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/export",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"format": "jsonl", "scope": "graph"},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/export", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ format: "jsonl", scope: "graph" }),
});
console.log(await res.json());

Example response

{ "export_id": "exp_...", "format": "jsonl", "path": "/tmp/...", "s3_url": null }
GET/crawl/{job_id}

Crawl job status

Poll job status, pages crawled, and graph nodes/edges.

curl

curl "https://api.cragdata.com/v1/crawl/job_abc123" \
  -H "Authorization: Bearer ck_live_YOUR_KEY"

Python (httpx)

import httpx

r = httpx.get(
    "https://api.cragdata.com/v1/crawl/job_abc123",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/crawl/job_abc123", {
  headers: { Authorization: "Bearer ck_live_YOUR_KEY" },
});
console.log(await res.json());

Example response

{
  "status": "completed",
  "pages_crawled": 42,
  "nodes": [...],
  "edges": [...]
}
POST/scrape

Scrape page

Extract title, content, and optional structured JSON from one URL.

Request body

{
  "url": "https://example.com/about",
  "extract_json": true
}

curl

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/scrape",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"url": "https://example.com", "extract_json": True},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ url: "https://example.com", extract_json: true }),
});
console.log(await res.json());

Example response

{
  "url": "https://example.com/about",
  "title": "About",
  "content": "...",
  "json_data": { ... },
  "scraped_at": "2026-05-22T12:00:00Z"
}
GET/scrapes

List scrapes

Paginated scrapes filtered by domain.

Query

?domain=example.com&limit=50&offset=0

Example response

{ "items": [...], "total": 128 }
GET/me

Account & limits

Your plan, monthly quota, read limits, rate limit, and enabled endpoints.

Example response

{
  "plan": "developer",
  "monthly_quota": 10000,
  "credits_remaining": 8420,
  "rate_limit_per_sec": 5,
  "api_keys_limit": 3,
  "discover_domains_per_run": 200,
  "read_limits": { "graph": 300, "scrapes": 50, "jobs": 30, "nodes": 200 },
  "endpoints": {
    "crawl": true,
    "discover": true,
    "scrape": true,
    "export": true,
    "jobs": true,
    "graph_pages": true,
    "graph_domains": true,
    "graph_stats": true,
    "graph_domain_context": true,
    "graph_top_pages": true,
    "graph_hops": true,
    "scrapes": true,
    "nodes": true
  }
}
GET/graph/domain-context

AI niche context (domain)

Primary endpoint for agents: given a seed domain, returns inbound/outbound/related domains with niche_score, scrapable flags, and LLM-ready summaries. Uses your indexed graph when present; otherwise fetches the homepage, parses links in HTML (fast — not a full BFS crawl), and saves nodes/edges + a response snapshot. Set auto_acquire=false to only read the DB.

Query

?seed=indiehackers.com&depth_hops=2&cluster_limit=20

curl

curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2" \
  -H "Authorization: Bearer ck_live_YOUR_KEY"

Python (httpx)

import httpx

r = httpx.get(
    "https://api.cragdata.com/v1/graph/domain-context",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    params={"seed": "indiehackers.com", "depth_hops": 2},
)
print(r.json()["context_for_ai"])

JavaScript (fetch)

const res = await fetch(
  "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2",
  { headers: { Authorization: "Bearer ck_live_YOUR_KEY" } },
);
const ctx = await res.json();
console.log(ctx.context_for_ai, ctx.top_inbound_domains);

Example response

{
  "seed_domain": "indiehackers.com",
  "context_for_ai": "Niche graph for indiehackers.com ...",
  "top_inbound_domains": [
    { "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...]
}
POST/graph/domain-context

AI niche context (JSON body)

Same as GET; use POST when passing seed_domain in the body from agent tool calls.

Request body

{ "seed_domain": "indiehackers.com", "depth_hops": 2, "cluster_limit": 20 }

Example response

{
  "seed_domain": "indiehackers.com",
  "context_for_ai": "Niche graph for indiehackers.com ...",
  "top_inbound_domains": [
    { "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...]
}
GET/graph/top-pages

Top pages for RAG

Pages inside a domain ranked by in-degree and depth. Defaults to scrapable_only (HTTP 200 or already scraped).

Query

?domain=indiehackers.com&limit=20&scrapable_only=true

Example response

{ "domain": "indiehackers.com", "pages": [{ "url": "...", "inlinks": 8, "scrapable": true, "priority_score": 4.2, "summary": "..." }] }
GET/graph/hops

Domain path (hops)

Shortest link path between two domains in your indexed graph — useful to explain how niche players connect.

Query

?from_domain=indiehackers.com&to_domain=stripe.com&max_hops=6

Example response

{ "found": true, "hops": 2, "path": ["indiehackers.com", "..."], "context_for_ai": "..." }
GET/stats

Public stats

Marketing aggregates (pages crawled, domains discovered). No authentication.

Example response

{ "pages_crawled": 1200000, "domains_discovered": 120000, "edges": 1400000, "scrapes": 95000, "pages_200": 980000, "source": "live" }
GET/jobs

List crawl jobs

Paginated crawl jobs for your account (newest first).

Query

?status=completed&limit=20&offset=0

Example response

{ "items": [{ "job_id": "...", "status": "completed" }], "total": 5 }
GET/scrape/data

Stored scrape

Full content for a URL already scraped via POST /scrape.

Query

?url=https://example.com/about

Example response

{ "url": "...", "content": [...], "links": [...] }
GET/nodes

Discovered pages

Pages indexed by crawls or scrapes, filterable by domain.

Query

?domain=example.com&limit=100

Example response

{ "items": [...], "total": 240 }
GET/graph

Link graph

Page-level graph (default) or domain-level aggregation.

Query

?domain=example.com&format=pages|domains&limit=300

Example response

{ "format": "pages", "nodes": [...], "edges": [...] }
GET/graph/stats

Graph stats

Counts of pages, scrapes, edges, and domains in your workspace.

Query

?domain=example.com

Example response

{ "pages": 120, "edges": 890, "domains": 4 }

Webhooks

Configure HTTPS endpoints in Dashboard → Webhooks. We POST JSON when events fire; each delivery includes a signature you can verify with your webhook secret.

Events

  • crawl.completedFired when a POST /crawl job finishes (completed, failed, or canceled).
  • discover.completedFired when a POST /discover job finishes.
  • page.extractedFired after a successful POST /scrape (HTTP 200).

Request headers

  • X-Crag-Event — event name (same as JSON event)
  • X-Crag-Signature — HMAC-SHA256 of the raw body using your webhook secret
  • X-Crag-Webhook-Id — webhook configuration id

Payload

{
  "event": "crawl.completed",
  "created_at": 1716400000.123,
  "data": {
    "job_id": "job_abc123",
    "status": "completed",
    "pages_crawled": 42
  }
}

Only https:// URLs are accepted. Failed deliveries are logged server-side; retry policy may evolve — treat webhooks as at-least-once.

SDKs

Official clients live in the CragData repo under packages/cragdata-python and packages/cragdata-js. They wrap auth, discover, crawl, scrape, export, and job polling.

Python

pip install httpx  # or use the repo client: packages/cragdata-python

from cragdata import CragDataClient

client = CragDataClient("ck_live_YOUR_KEY", base_url="https://api.cragdata.com/v1")
ctx = client.domain_context("indiehackers.com", depth_hops=2)
print(ctx["context_for_ai"])
pages = client.top_pages("indiehackers.com", limit=10)
print(pages["pages"][0]["url"])

JavaScript

// packages/cragdata-js in this repo
import { CragDataClient } from "./packages/cragdata-js/index.js";

const client = new CragDataClient("ck_live_YOUR_KEY", {
  baseUrl: "https://api.cragdata.com/v1",
});
const ctx = await client.domainContext("indiehackers.com", { depth_hops: 2 });
console.log(ctx.context_for_ai);
const pages = await client.topPages("indiehackers.com", { limit: 10 });
console.log(pages.pages[0].url);

Errors & limits

CodeCauseHow to fix
401Invalid or revoked API keyCreate a new key in Dashboard → API keys
402Monthly quota exhaustedUpgrade to Developer ($10/mo) or Startup
403Endpoint not available on your plan (e.g. POST /export on Free)Upgrade to Developer or Startup, or call GET /me to see enabled endpoints
409Crawl queue is fullWait for the current job to finish or cancel it from the dashboard
429Rate limit exceededSlow down requests (see X-RateLimit-Limit response header)

Plan limits

All plans include crawl, scrape, and graph APIs. Use GET /me for your live quota, rate limit, and read_limits.

PlanCalls / monthRate limitAPI keysDiscoverExportGraph / list caps
Free5002 req/s125 domains/run50 nodes, 10 scrapes
Developer10,0005 req/s3200 domains/runJSONL / Parquet300 nodes, 50 scrapes
Startup50,00020 req/s5500 domains/runJSONL / Parquet1000 nodes, 200 scrapes
EnterpriseCustomCustomUnlimitedCustomCustom5000+ nodes

Response headers include X-Credits-Remaining and X-RateLimit-Limit (requests per second for your plan).

Dashboard

After signing in, use the app dashboard to manage keys, monitor usage, and billing:

  • Overview — usage summary and plan status
  • API keys — create and revoke keys
  • Usage — calls, credits, and per-endpoint breakdown
  • Billing — upgrade to Developer ($10/mo), Startup, or manage subscription
  • Webhooks — register HTTPS URLs and secrets for event delivery
  • Schedules — recurring crawl or discover jobs (cron via platform scheduler)
  • Team — invite members to your organization