Fresh structured web data for AI systems
This is the primary reference for CragData—how to plan sources with link graphs, run distributed crawls, extract AI-ready JSON, and deliver live data to agents and RAG. All routes are under /v1. Get an API key from your dashboard.
Fresh structured web data for AI systems
CragData is not a scraper script. It is the living data layer between the open web and your agents, RAG stack, and automations.
Datasets go stale
Training snapshots and one-off exports miss today's pricing, partners, and policy pages. Production agents need crawl-time freshness.
LLMs hallucinate without grounding
Structured JSON + link graphs + context_for_ai summaries give models citeable sources—not vibes from outdated corpora.
RAG breaks without source planning
Embedding the whole web is expensive. Niche graphs rank which domains and URLs to read before you scrape and embed.
Quickstart
- Create a free account — no credit card required.
- Open API keys and generate a key. Copy it once; we only store a hash.
- Call the API with the header below. Free includes 500 calls/month plus read endpoints (
/graph,/jobs, etc.) with smaller caps — see plan limits orGET /me. Export (POST /export) requires Developer or higher.
1 — Scrape a page
curl -X POST https://api.cragdata.com/v1/scrape \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'2 — Start a crawl
curl -X POST https://api.cragdata.com/v1/crawl \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "depth": 2, "max_pages": 50}'3 — Discover domains
curl -X POST https://api.cragdata.com/v1/discover \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "max_domains": 50}'Poll any job with GET /crawl/{job_id}. Export graph or scrapes with POST /export on Developer+ — see endpoints.
Check availability: https://api.cragdata.com/v1/health. See also our terms of service and privacy policy.
API playground
Explore response shapes for graph, queue, scrape, and entity extraction—no API key required. Use the same flows with your key for live data.
Sample responses for exploration—use your API key for live data. Create free account
{
"seed_domain": "stripe.com",
"context_for_ai": "Niche graph for stripe.com (depth 2). 8 strong referrers, 12 outbound destinations, 5 related domains. Use top_inbound for authority sources; top_outbound for integrations.",
"top_inbound_domains": [
{
"domain": "ycombinator.com",
"link_count": 14,
"niche_score": 1,
"scrapable": true
},
{
"domain": "techcrunch.com",
"link_count": 9,
"niche_score": 0.82,
"scrapable": true
}
],
"top_outbound_domains": [
{
"domain": "docs.stripe.com",
"link_count": 42,
"niche_score": 1,
"scrapable": true
},
{
"domain": "github.com",
"link_count": 11,
"niche_score": 0.71,
"scrapable": true
}
],
"cached": false
}Agents playbook
Recommended order for LLM agents, copilots, and RAG pipelines. Copy patterns into system prompts or tool definitions.
- Plan — Get context_for_ai + ranked domains before any search tool.
GET https://api.cragdata.com/v1/graph/domain-context?seed=YOUR_DOMAIN - Select URLs — Pick scrapable pages by in-graph inlinks—not sitemap guesses.
GET https://api.cragdata.com/v1/graph/top-pages?domain=PARTNER_DOMAIN&limit=10 - Extract — Structured JSON for chunking; same schema every page.
POST https://api.cragdata.com/v1/scrape - Embed & answer — Cite url + scraped_at in the final response for trust.
Your vector DB + LLM
Tooling integrations
- LangChain / CrewAI — wrap domain-context as a tool (see /llms.txt)
- OpenAI function calling — return context_for_ai string only
- Cron — Dashboard → Schedules for freshness without manual runs
AI context graph
Live niche graph so your AI knows where to look on the web — before RAG.
- GET /graph/domain-context?seed=indiehackers.com — auto-crawls if needed (200 required).
- Inbound/outbound domains + cluster scores in one response.
- GET /graph/top-pages per domain — scrapable pages ranked by in-degree.
- POST /scrape on top URLs, then embed for your mini-RAG.
Example — niche context for a seed
curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2" \
-H "Authorization: Bearer ck_live_YOUR_KEY"Response shape (abbreviated)
{
"seed_domain": "indiehackers.com",
"context_for_ai": "Niche graph for indiehackers.com ...",
"top_inbound_domains": [
{ "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
],
"top_outbound_domains": [...],
"related_domains": [...]
}Each domain includes summary and scrapable so agents can skip dead or blocked pages. Pair with GET /graph/top-pages and POST /scrape for a mini-RAG pipeline. See endpoints for full parameters.
Crawl lifecycle
Every discover, crawl, and scrape job moves through the same managed pipeline—queues, workers, graph writes, and delivery.
1. Enqueue
POST /crawl or /discover returns job_id immediately. Job enters the account queue (409 if queue full).
2. Fetch
Workers pull URLs with concurrency limits, respect robots.txt, and apply anti-bot retries with backoff.
3. Parse & graph
HTML becomes nodes (pages) and edges (links). Domains aggregate for /graph endpoints.
4. Extract
POST /scrape strips boilerplate into content[] blocks, metadata, and optional entities.
5. Deliver
Poll GET /crawl/{job_id}, subscribe via webhooks, or export JSONL/Parquet to your warehouse.
Queues & retries
Queue behavior
- Per-account crawl queue — one active heavy job unless plan allows more
- Discover jobs run with max_domains caps per plan
- GET /jobs lists history with status filters
Retries & anti-bot
- Transient HTTP failures retry with exponential backoff
- 429/503 from targets respect Retry-After when present
- Non-retryable 4xx marked scrapable: false in graph responses
Rate limits
- Plan-based req/s — see X-RateLimit-Limit header
- Monthly credits — X-Credits-Remaining on every billable call
- Read endpoints (/graph, /jobs) have separate caps on Free
Examples
Copy-paste flows for common production setups.
Agent RAG — niche graph first
Full flow for a research agent with one seed domain.
import os, httpx
API = "https://api.cragdata.com/v1"
H = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"}
seed = "stripe.com"
ctx = httpx.get(f"{API}/graph/domain-context", params={"seed": seed}, headers=H, timeout=30).json()
system = f"Sources plan: {ctx['context_for_ai']}"
domain = ctx["top_inbound_domains"][0]["domain"]
pages = httpx.get(f"{API}/graph/top-pages", params={"domain": domain, "limit": 5}, headers=H).json()
docs = []
for p in pages["pages"][:3]:
docs.append(httpx.post(f"{API}/scrape", json={"url": p["url"]}, headers=H).json())
# → chunk docs[*]["content"] → embed → answerWebsite monitoring with webhooks
Schedule crawls and react on page.extracted.
# Dashboard → Webhooks: https://your.app/hooks/crag
# Events: crawl.completed, page.extracted
# Schedule: Dashboard → Schedules (cron)
# Or POST /crawl on demand
# Verify signature:
# X-Crag-Signature = HMAC-SHA256(body, webhook_secret)Export graph to warehouse
Developer+ JSONL export for dbt / Snowflake / BigQuery.
curl -X POST https://api.cragdata.com/v1/export \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"format": "jsonl", "scope": "graph"}'Market map from seeds
Discover linked domains then crawl the cluster.
discover = httpx.post(
"https://api.cragdata.com/v1/discover",
headers=H,
json={"url": "https://example.com", "max_domains": 100},
).json()
job = httpx.get(f"https://api.cragdata.com/v1/crawl/{discover['job_id']}", headers=H).json()
graph = httpx.get(f"https://api.cragdata.com/v1/graph", params={"format": "domains"}, headers=H).json()Authentication
Send your API key on every request using the Authorization header:
Authorization: Bearer ck_live_xxxxxxxxKeys are tied to your account and plan. Revoke compromised keys in the dashboard; usage is logged per key.
Endpoints
Base URL: https://api.cragdata.com/v1
/healthHealth check
Verify API availability and version. No authentication required.
Example response
{ "ok": true, "version": "1.0.0", "service": "cragcrawler" }/crawlStart crawl job
Queue a crawl from a seed URL. Debits credits on your plan.
Request body
{
"url": "https://example.com",
"depth": 2,
"max_pages": 100
}curl
curl -X POST https://api.cragdata.com/v1/crawl \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "depth": 2, "max_pages": 50}'Python (httpx)
import httpx
r = httpx.post(
"https://api.cragdata.com/v1/crawl",
headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
json={"url": "https://example.com", "depth": 2, "max_pages": 50},
)
print(r.json())JavaScript (fetch)
const res = await fetch("https://api.cragdata.com/v1/crawl", {
method: "POST",
headers: {
Authorization: "Bearer ck_live_YOUR_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://example.com",
depth: 2,
max_pages: 50,
}),
});
console.log(await res.json());Example response
{ "job_id": "job_abc123", "status": "queued" }/discoverDiscover domains
Queue a discover job from seed URL(s). Maps new root domains linked from seeds (plan limits apply). Poll with GET /crawl/{job_id}.
Request body
{
"url": "https://vercel.com",
"urls": ["https://vercel.com", "https://nextjs.org"],
"max_domains": 200
}curl
curl -X POST https://api.cragdata.com/v1/discover \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://vercel.com", "max_domains": 50}'Python (httpx)
import httpx
r = httpx.post(
"https://api.cragdata.com/v1/discover",
headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
json={"url": "https://vercel.com", "max_domains": 50},
)
print(r.json())JavaScript (fetch)
const res = await fetch("https://api.cragdata.com/v1/discover", {
method: "POST",
headers: {
Authorization: "Bearer ck_live_YOUR_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({ url: "https://vercel.com", max_domains: 50 }),
});
console.log(await res.json());Example response
{ "job_id": "job_abc123", "status": "queued", "max_domains": 200 }/exportExport data
Export your graph or scrapes as JSONL or Parquet. Requires Developer plan or higher. Returns a server path (and optional s3_url if the API host has S3 configured).
Request body
{ "format": "jsonl", "scope": "graph" }curl
curl -X POST https://api.cragdata.com/v1/export \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"format": "jsonl", "scope": "scrapes"}'Python (httpx)
import httpx
r = httpx.post(
"https://api.cragdata.com/v1/export",
headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
json={"format": "jsonl", "scope": "graph"},
)
print(r.json())JavaScript (fetch)
const res = await fetch("https://api.cragdata.com/v1/export", {
method: "POST",
headers: {
Authorization: "Bearer ck_live_YOUR_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({ format: "jsonl", scope: "graph" }),
});
console.log(await res.json());Example response
{ "export_id": "exp_...", "format": "jsonl", "path": "/tmp/...", "s3_url": null }/crawl/{job_id}Crawl job status
Poll job status, pages crawled, and graph nodes/edges.
curl
curl "https://api.cragdata.com/v1/crawl/job_abc123" \
-H "Authorization: Bearer ck_live_YOUR_KEY"Python (httpx)
import httpx
r = httpx.get(
"https://api.cragdata.com/v1/crawl/job_abc123",
headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
)
print(r.json())JavaScript (fetch)
const res = await fetch("https://api.cragdata.com/v1/crawl/job_abc123", {
headers: { Authorization: "Bearer ck_live_YOUR_KEY" },
});
console.log(await res.json());Example response
{
"status": "completed",
"pages_crawled": 42,
"nodes": [...],
"edges": [...]
}/scrapeScrape page
Extract title, content, and optional structured JSON from one URL.
Request body
{
"url": "https://example.com/about",
"extract_json": true
}curl
curl -X POST https://api.cragdata.com/v1/scrape \
-H "Authorization: Bearer ck_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Python (httpx)
import httpx
r = httpx.post(
"https://api.cragdata.com/v1/scrape",
headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
json={"url": "https://example.com", "extract_json": True},
)
print(r.json())JavaScript (fetch)
const res = await fetch("https://api.cragdata.com/v1/scrape", {
method: "POST",
headers: {
Authorization: "Bearer ck_live_YOUR_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({ url: "https://example.com", extract_json: true }),
});
console.log(await res.json());Example response
{
"url": "https://example.com/about",
"title": "About",
"content": "...",
"json_data": { ... },
"scraped_at": "2026-05-22T12:00:00Z"
}/scrapesList scrapes
Paginated scrapes filtered by domain.
Query
?domain=example.com&limit=50&offset=0Example response
{ "items": [...], "total": 128 }/meAccount & limits
Your plan, monthly quota, read limits, rate limit, and enabled endpoints.
Example response
{
"plan": "developer",
"monthly_quota": 10000,
"credits_remaining": 8420,
"rate_limit_per_sec": 5,
"api_keys_limit": 3,
"discover_domains_per_run": 200,
"read_limits": { "graph": 300, "scrapes": 50, "jobs": 30, "nodes": 200 },
"endpoints": {
"crawl": true,
"discover": true,
"scrape": true,
"export": true,
"jobs": true,
"graph_pages": true,
"graph_domains": true,
"graph_stats": true,
"graph_domain_context": true,
"graph_top_pages": true,
"graph_hops": true,
"scrapes": true,
"nodes": true
}
}/graph/domain-contextAI niche context (domain)
Primary endpoint for agents: given a seed domain, returns inbound/outbound/related domains with niche_score, scrapable flags, and LLM-ready summaries. Uses your indexed graph when present; otherwise fetches the homepage, parses links in HTML (fast — not a full BFS crawl), and saves nodes/edges + a response snapshot. Set auto_acquire=false to only read the DB.
Query
?seed=indiehackers.com&depth_hops=2&cluster_limit=20curl
curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2" \
-H "Authorization: Bearer ck_live_YOUR_KEY"Python (httpx)
import httpx
r = httpx.get(
"https://api.cragdata.com/v1/graph/domain-context",
headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
params={"seed": "indiehackers.com", "depth_hops": 2},
)
print(r.json()["context_for_ai"])JavaScript (fetch)
const res = await fetch(
"https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2",
{ headers: { Authorization: "Bearer ck_live_YOUR_KEY" } },
);
const ctx = await res.json();
console.log(ctx.context_for_ai, ctx.top_inbound_domains);Example response
{
"seed_domain": "indiehackers.com",
"context_for_ai": "Niche graph for indiehackers.com ...",
"top_inbound_domains": [
{ "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
],
"top_outbound_domains": [...],
"related_domains": [...]
}/graph/domain-contextAI niche context (JSON body)
Same as GET; use POST when passing seed_domain in the body from agent tool calls.
Request body
{ "seed_domain": "indiehackers.com", "depth_hops": 2, "cluster_limit": 20 }Example response
{
"seed_domain": "indiehackers.com",
"context_for_ai": "Niche graph for indiehackers.com ...",
"top_inbound_domains": [
{ "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
],
"top_outbound_domains": [...],
"related_domains": [...]
}/graph/top-pagesTop pages for RAG
Pages inside a domain ranked by in-degree and depth. Defaults to scrapable_only (HTTP 200 or already scraped).
Query
?domain=indiehackers.com&limit=20&scrapable_only=trueExample response
{ "domain": "indiehackers.com", "pages": [{ "url": "...", "inlinks": 8, "scrapable": true, "priority_score": 4.2, "summary": "..." }] }/graph/hopsDomain path (hops)
Shortest link path between two domains in your indexed graph — useful to explain how niche players connect.
Query
?from_domain=indiehackers.com&to_domain=stripe.com&max_hops=6Example response
{ "found": true, "hops": 2, "path": ["indiehackers.com", "..."], "context_for_ai": "..." }/statsPublic stats
Marketing aggregates (pages crawled, domains discovered). No authentication.
Example response
{ "pages_crawled": 1200000, "domains_discovered": 120000, "edges": 1400000, "scrapes": 95000, "pages_200": 980000, "source": "live" }/jobsList crawl jobs
Paginated crawl jobs for your account (newest first).
Query
?status=completed&limit=20&offset=0Example response
{ "items": [{ "job_id": "...", "status": "completed" }], "total": 5 }/scrape/dataStored scrape
Full content for a URL already scraped via POST /scrape.
Query
?url=https://example.com/aboutExample response
{ "url": "...", "content": [...], "links": [...] }/nodesDiscovered pages
Pages indexed by crawls or scrapes, filterable by domain.
Query
?domain=example.com&limit=100Example response
{ "items": [...], "total": 240 }/graphLink graph
Page-level graph (default) or domain-level aggregation.
Query
?domain=example.com&format=pages|domains&limit=300Example response
{ "format": "pages", "nodes": [...], "edges": [...] }/graph/statsGraph stats
Counts of pages, scrapes, edges, and domains in your workspace.
Query
?domain=example.comExample response
{ "pages": 120, "edges": 890, "domains": 4 }Webhooks
Configure HTTPS endpoints in Dashboard → Webhooks. We POST JSON when events fire; each delivery includes a signature you can verify with your webhook secret.
Events
crawl.completed— Fired when a POST /crawl job finishes (completed, failed, or canceled).discover.completed— Fired when a POST /discover job finishes.page.extracted— Fired after a successful POST /scrape (HTTP 200).
Request headers
X-Crag-Event— event name (same as JSONevent)X-Crag-Signature— HMAC-SHA256 of the raw body using your webhook secretX-Crag-Webhook-Id— webhook configuration id
Payload
{
"event": "crawl.completed",
"created_at": 1716400000.123,
"data": {
"job_id": "job_abc123",
"status": "completed",
"pages_crawled": 42
}
}Only https:// URLs are accepted. Failed deliveries are logged server-side; retry policy may evolve — treat webhooks as at-least-once.
SDKs
Official clients live in the CragData repo under packages/cragdata-python and packages/cragdata-js. They wrap auth, discover, crawl, scrape, export, and job polling.
Python
pip install httpx # or use the repo client: packages/cragdata-python
from cragdata import CragDataClient
client = CragDataClient("ck_live_YOUR_KEY", base_url="https://api.cragdata.com/v1")
ctx = client.domain_context("indiehackers.com", depth_hops=2)
print(ctx["context_for_ai"])
pages = client.top_pages("indiehackers.com", limit=10)
print(pages["pages"][0]["url"])JavaScript
// packages/cragdata-js in this repo
import { CragDataClient } from "./packages/cragdata-js/index.js";
const client = new CragDataClient("ck_live_YOUR_KEY", {
baseUrl: "https://api.cragdata.com/v1",
});
const ctx = await client.domainContext("indiehackers.com", { depth_hops: 2 });
console.log(ctx.context_for_ai);
const pages = await client.topPages("indiehackers.com", { limit: 10 });
console.log(pages.pages[0].url);Errors & limits
| Code | Cause | How to fix |
|---|---|---|
401 | Invalid or revoked API key | Create a new key in Dashboard → API keys |
402 | Monthly quota exhausted | Upgrade to Developer ($10/mo) or Startup |
403 | Endpoint not available on your plan (e.g. POST /export on Free) | Upgrade to Developer or Startup, or call GET /me to see enabled endpoints |
409 | Crawl queue is full | Wait for the current job to finish or cancel it from the dashboard |
429 | Rate limit exceeded | Slow down requests (see X-RateLimit-Limit response header) |
Plan limits
All plans include crawl, scrape, and graph APIs. Use GET /me for your live quota, rate limit, and read_limits.
| Plan | Calls / month | Rate limit | API keys | Discover | Export | Graph / list caps |
|---|---|---|---|---|---|---|
| Free | 500 | 2 req/s | 1 | 25 domains/run | — | 50 nodes, 10 scrapes |
| Developer | 10,000 | 5 req/s | 3 | 200 domains/run | JSONL / Parquet | 300 nodes, 50 scrapes |
| Startup | 50,000 | 20 req/s | 5 | 500 domains/run | JSONL / Parquet | 1000 nodes, 200 scrapes |
| Enterprise | Custom | Custom | Unlimited | Custom | Custom | 5000+ nodes |
Response headers include X-Credits-Remaining and X-RateLimit-Limit (requests per second for your plan).
Dashboard
After signing in, use the app dashboard to manage keys, monitor usage, and billing:
- Overview — usage summary and plan status
- API keys — create and revoke keys
- Usage — calls, credits, and per-endpoint breakdown
- Billing — upgrade to Developer ($10/mo), Startup, or manage subscription
- Webhooks — register HTTPS URLs and secrets for event delivery
- Schedules — recurring crawl or discover jobs (cron via platform scheduler)
- Team — invite members to your organization