CragData is web intelligence infrastructure: discover domains, crawl pages, extract structured JSON, and explore link graphs—built for AI agents, RAG, and data products.

What is web intelligence?

Web intelligence is the practice of turning the live web into structured, queryable data—graphs, entities, and documents—rather than one-off HTML snapshots.

What is AI-ready data?

JSON designed for models: clean content blocks, metadata, link graphs, and plain-English summaries like context_for_ai that drop into system prompts.

How should AI agents use CragData?

Call GET /graph/domain-context before broad search, pick domains with GET /graph/top-pages, scrape top URLs, then embed. Patterns are in /docs and /llms.txt.

What is live retrieval for AI agents?

Fetching fresh web data at query time instead of relying on frozen training corpora—so answers reflect current pages, pricing, and ecosystem links.

What is RAG web crawling?

Crawling only the URLs that matter for a user question, after planning sources with a graph—so embeddings stay relevant and token spend stays low.

How does distributed crawling work?

Jobs are queued, workers run concurrent fetches with retries and rate limits, and results land in graph tables and JSON exports you pull via API.

How do anti-bot systems work?

Sites use fingerprints, rate limits, and challenges. We rotate strategies, respect robots.txt, backoff on failures, and surface scrapable flags in graph responses.

Why do AI agents need live web crawling?

Models trained on static snapshots cannot see today's pricing, partners, or news. Live crawl + extract keeps agent answers grounded in current pages.

Are static datasets dying for RAG?

For production agents, yes—freshness wins. Snapshots are fine for eval; customer-facing RAG needs scheduled or on-demand live ingestion.

How is CragData different from a scraper library?

We run discovery, queues, crawling, extraction, storage, and APIs as managed infrastructure—not a single-URL fetch you host yourself.

Do you crawl the entire internet?

No. You provide seeds; we discover and crawl within your plan limits and configuration.

What format is the data?

JSON per page (title, content[], links[], metadata) plus graph endpoints for domains, top pages, and hops.

Can I use my own database?

Yes. Pull via REST, export JSONL or Parquet on Developer+, or push events with webhooks into your warehouse.

Do you offer webhooks?

Yes — crawl.completed, discover.completed, page.extracted. Configure HTTPS URLs in Dashboard → Webhooks; payloads are HMAC-signed.

Is there an official SDK?

Python and Node clients live under packages/cragdata-python and packages/cragdata-js in our GitHub repo.

What is a niche graph?

A ranked view of who links to a seed domain, where the seed links out, and related clusters—with niche_score and scrapable flags for agents.

How fast can I integrate?

Most teams send a first /scrape or /graph/domain-context call within 15 minutes using the quickstart at /docs.

What are your rate limits?

Plan-based requests per second; see /docs#errors or GET /me for live caps. Headers include X-Credits-Remaining and X-RateLimit-Limit.

Do you respect robots.txt?

Yes. You are responsible for lawful use; we encourage robots compliance and reasonable crawl rates.

You are responsible for how you use data. Enterprise plans include compliance discussions for sensitive use cases.

Can I monitor competitors?

Yes—seed competitor domains, schedule crawls, and diff structured JSON over time. See /use-cases/competitor-monitoring.

How does crawl orchestration with queues work?

POST /crawl or /discover returns a job_id; workers process the queue; you poll GET /crawl/{job_id} or listen on webhooks.

What is structured extraction?

HTML is normalized to JSON blocks with nav/scripts stripped—ready for search indexes, analytics, or embedding pipelines.

CragData vs Firecrawl?

Firecrawl is strong for page-to-markdown. CragData adds niche graphs and agent-first source planning. See /compare/cragdata-vs-firecrawl.

Apify is an actor marketplace. CragData is an opinionated intelligence API with graphs and managed pipelines. See /compare/cragdata-vs-apify.

What uptime do you target?

Production API at api.cragdata.com with health at /v1/health. Enterprise SLAs available; status practices documented for ops teams.

Do you support always-on crawls?

Yes—Dashboard → Schedules runs recurring discover/crawl jobs for monitoring use cases.

What is context_for_ai?

A plain-English summary in graph responses describing the niche topology—designed to paste into an agent system prompt before retrieval.

Who is CragData built for?

Teams shipping AI agents, RAG products, GTM enrichment, SEO research, and market intelligence who need live structured web data at scale.

— API docs

Fresh structured web data for AI systems

This is the primary reference for CragData—how to plan sources with link graphs, run distributed crawls, extract AI-ready JSON, and deliver live data to agents and RAG. All routes are under /v1. Get an API key from your dashboard.

Playground Examples llms.txt Blog

Fresh structured web data for AI systems

CragData is not a scraper script. It is the living data layer between the open web and your agents, RAG stack, and automations.

Datasets go stale

Training snapshots and one-off exports miss today's pricing, partners, and policy pages. Production agents need crawl-time freshness.

LLMs hallucinate without grounding

Structured JSON + link graphs + context_for_ai summaries give models citeable sources—not vibes from outdated corpora.

RAG breaks without source planning

Embedding the whole web is expensive. Niche graphs rank which domains and URLs to read before you scrape and embed.

Quickstart

Create a free account — no credit card required.
Open API keys and generate a key. Copy it once; we only store a hash.
Call the API with the header below. Free includes 500 calls/month plus read endpoints (/graph, /jobs, etc.) with smaller caps — see plan limits or GET /me. Export (POST /export) requires Developer or higher.

1 — Scrape a page

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

2 — Start a crawl

curl -X POST https://api.cragdata.com/v1/crawl \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "depth": 2, "max_pages": 50}'

3 — Discover domains

curl -X POST https://api.cragdata.com/v1/discover \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "max_domains": 50}'

Poll any job with GET /crawl/{job_id}. Export graph or scrapes with POST /export on Developer+ — see endpoints.

Check availability: https://api.cragdata.com/v1/health. See also our terms of service and privacy policy.

API playground

Explore response shapes for graph, queue, scrape, and entity extraction—no API key required. Use the same flows with your key for live data.

Seed domain

Sample responses for exploration—use your API key for live data. Create free account

{
  "seed_domain": "stripe.com",
  "context_for_ai": "Niche graph for stripe.com (depth 2). 8 strong referrers, 12 outbound destinations, 5 related domains. Use top_inbound for authority sources; top_outbound for integrations.",
  "top_inbound_domains": [
    {
      "domain": "ycombinator.com",
      "link_count": 14,
      "niche_score": 1,
      "scrapable": true
    },
    {
      "domain": "techcrunch.com",
      "link_count": 9,
      "niche_score": 0.82,
      "scrapable": true
    }
  ],
  "top_outbound_domains": [
    {
      "domain": "docs.stripe.com",
      "link_count": 42,
      "niche_score": 1,
      "scrapable": true
    },
    {
      "domain": "github.com",
      "link_count": 11,
      "niche_score": 0.71,
      "scrapable": true
    }
  ],
  "cached": false
}

Agents playbook

Recommended order for LLM agents, copilots, and RAG pipelines. Copy patterns into system prompts or tool definitions.

Plan — Get context_for_ai + ranked domains before any search tool.

GET https://api.cragdata.com/v1/graph/domain-context?seed=YOUR_DOMAIN

Select URLs — Pick scrapable pages by in-graph inlinks—not sitemap guesses.

GET https://api.cragdata.com/v1/graph/top-pages?domain=PARTNER_DOMAIN&limit=10

Extract — Structured JSON for chunking; same schema every page.
```
POST https://api.cragdata.com/v1/scrape
```
Embed & answer — Cite url + scraped_at in the final response for trust.
```
Your vector DB + LLM
```

Tooling integrations

LangChain / CrewAI — wrap domain-context as a tool (see /llms.txt)
OpenAI function calling — return context_for_ai string only
Cron — Dashboard → Schedules for freshness without manual runs

AI context graph

Live niche graph so your AI knows where to look on the web — before RAG.

GET /graph/domain-context?seed=indiehackers.com — auto-crawls if needed (200 required).
Inbound/outbound domains + cluster scores in one response.
GET /graph/top-pages per domain — scrapable pages ranked by in-degree.
POST /scrape on top URLs, then embed for your mini-RAG.

Example — niche context for a seed

curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2" \
  -H "Authorization: Bearer ck_live_YOUR_KEY"

Response shape (abbreviated)

{
  "seed_domain": "indiehackers.com",
  "context_for_ai": "Niche graph for indiehackers.com ...",
  "top_inbound_domains": [
    { "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...]
}

Each domain includes summary and scrapable so agents can skip dead or blocked pages. Pair with GET /graph/top-pages and POST /scrape for a mini-RAG pipeline. See endpoints for full parameters.

Crawl lifecycle

Every discover, crawl, and scrape job moves through the same managed pipeline—queues, workers, graph writes, and delivery.

1. Enqueue
POST /crawl or /discover returns job_id immediately. Job enters the account queue (409 if queue full).
2. Fetch
Workers pull URLs with concurrency limits, respect robots.txt, and apply anti-bot retries with backoff.
3. Parse & graph
HTML becomes nodes (pages) and edges (links). Domains aggregate for /graph endpoints.
4. Extract
POST /scrape strips boilerplate into content[] blocks, metadata, and optional entities.
5. Deliver
Poll GET /crawl/{job_id}, subscribe via webhooks, or export JSONL/Parquet to your warehouse.

Queues & retries

Queue behavior

Per-account crawl queue — one active heavy job unless plan allows more
Discover jobs run with max_domains caps per plan
GET /jobs lists history with status filters

Retries & anti-bot

Transient HTTP failures retry with exponential backoff
429/503 from targets respect Retry-After when present
Non-retryable 4xx marked scrapable: false in graph responses

Rate limits

Plan-based req/s — see X-RateLimit-Limit header
Monthly credits — X-Credits-Remaining on every billable call
Read endpoints (/graph, /jobs) have separate caps on Free

Examples

Copy-paste flows for common production setups.

Agent RAG — niche graph first

Full flow for a research agent with one seed domain.

import os, httpx

API = "https://api.cragdata.com/v1"
H = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"}

seed = "stripe.com"
ctx = httpx.get(f"{API}/graph/domain-context", params={"seed": seed}, headers=H, timeout=30).json()
system = f"Sources plan: {ctx['context_for_ai']}"

domain = ctx["top_inbound_domains"][0]["domain"]
pages = httpx.get(f"{API}/graph/top-pages", params={"domain": domain, "limit": 5}, headers=H).json()

docs = []
for p in pages["pages"][:3]:
    docs.append(httpx.post(f"{API}/scrape", json={"url": p["url"]}, headers=H).json())

# → chunk docs[*]["content"] → embed → answer

Website monitoring with webhooks

Schedule crawls and react on page.extracted.

# Dashboard → Webhooks: https://your.app/hooks/crag
# Events: crawl.completed, page.extracted

# Schedule: Dashboard → Schedules (cron)
# Or POST /crawl on demand

# Verify signature:
# X-Crag-Signature = HMAC-SHA256(body, webhook_secret)

Export graph to warehouse

Developer+ JSONL export for dbt / Snowflake / BigQuery.

curl -X POST https://api.cragdata.com/v1/export \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"format": "jsonl", "scope": "graph"}'

Market map from seeds

Discover linked domains then crawl the cluster.

discover = httpx.post(
    "https://api.cragdata.com/v1/discover",
    headers=H,
    json={"url": "https://example.com", "max_domains": 100},
).json()

job = httpx.get(f"https://api.cragdata.com/v1/crawl/{discover['job_id']}", headers=H).json()
graph = httpx.get(f"https://api.cragdata.com/v1/graph", params={"format": "domains"}, headers=H).json()

Authentication

Send your API key on every request using the Authorization header:

Authorization: Bearer ck_live_xxxxxxxx

Keys are tied to your account and plan. Revoke compromised keys in the dashboard; usage is logged per key.

Endpoints

Base URL: https://api.cragdata.com/v1

GET/health

Health check

Verify API availability and version. No authentication required.

Example response

{ "ok": true, "version": "1.0.0", "service": "cragcrawler" }

POST/crawl

Start crawl job

Queue a crawl from a seed URL. Debits credits on your plan.

Request body

{
  "url": "https://example.com",
  "depth": 2,
  "max_pages": 100
}

curl

curl -X POST https://api.cragdata.com/v1/crawl \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "depth": 2, "max_pages": 50}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/crawl",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"url": "https://example.com", "depth": 2, "max_pages": 50},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/crawl", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    depth: 2,
    max_pages: 50,
  }),
});
console.log(await res.json());

Example response

{ "job_id": "job_abc123", "status": "queued" }

POST/discover

Discover domains

Queue a discover job from seed URL(s). Maps new root domains linked from seeds (plan limits apply). Poll with GET /crawl/{job_id}.

Request body

{
  "url": "https://vercel.com",
  "urls": ["https://vercel.com", "https://nextjs.org"],
  "max_domains": 200
}

curl

curl -X POST https://api.cragdata.com/v1/discover \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://vercel.com", "max_domains": 50}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/discover",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"url": "https://vercel.com", "max_domains": 50},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/discover", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ url: "https://vercel.com", max_domains: 50 }),
});
console.log(await res.json());

Example response

{ "job_id": "job_abc123", "status": "queued", "max_domains": 200 }

POST/export

Export data

Export your graph or scrapes as JSONL or Parquet. Requires Developer plan or higher. Returns a server path (and optional s3_url if the API host has S3 configured).

Request body

{ "format": "jsonl", "scope": "graph" }

curl

curl -X POST https://api.cragdata.com/v1/export \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"format": "jsonl", "scope": "scrapes"}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/export",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"format": "jsonl", "scope": "graph"},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/export", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ format: "jsonl", scope: "graph" }),
});
console.log(await res.json());

Example response

{ "export_id": "exp_...", "format": "jsonl", "path": "/tmp/...", "s3_url": null }

GET/crawl/{job_id}

Crawl job status

Poll job status, pages crawled, and graph nodes/edges.

curl

curl "https://api.cragdata.com/v1/crawl/job_abc123" \
  -H "Authorization: Bearer ck_live_YOUR_KEY"

Python (httpx)

import httpx

r = httpx.get(
    "https://api.cragdata.com/v1/crawl/job_abc123",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/crawl/job_abc123", {
  headers: { Authorization: "Bearer ck_live_YOUR_KEY" },
});
console.log(await res.json());

Example response

{
  "status": "completed",
  "pages_crawled": 42,
  "nodes": [...],
  "edges": [...]
}

POST/scrape

Scrape page

Extract title, content, and optional structured JSON from one URL.

Request body

{
  "url": "https://example.com/about",
  "extract_json": true
}

curl

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer ck_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Python (httpx)

import httpx

r = httpx.post(
    "https://api.cragdata.com/v1/scrape",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    json={"url": "https://example.com", "extract_json": True},
)
print(r.json())

JavaScript (fetch)

const res = await fetch("https://api.cragdata.com/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer ck_live_YOUR_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ url: "https://example.com", extract_json: true }),
});
console.log(await res.json());

Example response

{
  "url": "https://example.com/about",
  "title": "About",
  "content": "...",
  "json_data": { ... },
  "scraped_at": "2026-05-22T12:00:00Z"
}

GET/scrapes

List scrapes

Paginated scrapes filtered by domain.

Query

?domain=example.com&limit=50&offset=0

Example response

{ "items": [...], "total": 128 }

GET/me

Account & limits

Your plan, monthly quota, read limits, rate limit, and enabled endpoints.

Example response

{
  "plan": "developer",
  "monthly_quota": 10000,
  "credits_remaining": 8420,
  "rate_limit_per_sec": 5,
  "api_keys_limit": 3,
  "discover_domains_per_run": 200,
  "read_limits": { "graph": 300, "scrapes": 50, "jobs": 30, "nodes": 200 },
  "endpoints": {
    "crawl": true,
    "discover": true,
    "scrape": true,
    "export": true,
    "jobs": true,
    "graph_pages": true,
    "graph_domains": true,
    "graph_stats": true,
    "graph_domain_context": true,
    "graph_top_pages": true,
    "graph_hops": true,
    "scrapes": true,
    "nodes": true
  }
}

GET/graph/domain-context

AI niche context (domain)

Primary endpoint for agents: given a seed domain, returns inbound/outbound/related domains with niche_score, scrapable flags, and LLM-ready summaries. Uses your indexed graph when present; otherwise fetches the homepage, parses links in HTML (fast — not a full BFS crawl), and saves nodes/edges + a response snapshot. Set auto_acquire=false to only read the DB.

Query

?seed=indiehackers.com&depth_hops=2&cluster_limit=20

curl

curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2" \
  -H "Authorization: Bearer ck_live_YOUR_KEY"

Python (httpx)

import httpx

r = httpx.get(
    "https://api.cragdata.com/v1/graph/domain-context",
    headers={"Authorization": "Bearer ck_live_YOUR_KEY"},
    params={"seed": "indiehackers.com", "depth_hops": 2},
)
print(r.json()["context_for_ai"])

JavaScript (fetch)

const res = await fetch(
  "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com&depth_hops=2",
  { headers: { Authorization: "Bearer ck_live_YOUR_KEY" } },
);
const ctx = await res.json();
console.log(ctx.context_for_ai, ctx.top_inbound_domains);

Example response

{
  "seed_domain": "indiehackers.com",
  "context_for_ai": "Niche graph for indiehackers.com ...",
  "top_inbound_domains": [
    { "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...]
}

POST/graph/domain-context

AI niche context (JSON body)

Same as GET; use POST when passing seed_domain in the body from agent tool calls.

Request body

{ "seed_domain": "indiehackers.com", "depth_hops": 2, "cluster_limit": 20 }

Example response

{
  "seed_domain": "indiehackers.com",
  "context_for_ai": "Niche graph for indiehackers.com ...",
  "top_inbound_domains": [
    { "domain": "news.ycombinator.com", "link_count": 12, "niche_score": 1.0, "scrapable": true, "summary": "..." }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...]
}

GET/graph/top-pages

Top pages for RAG

Pages inside a domain ranked by in-degree and depth. Defaults to scrapable_only (HTTP 200 or already scraped).

Query

?domain=indiehackers.com&limit=20&scrapable_only=true

Example response

{ "domain": "indiehackers.com", "pages": [{ "url": "...", "inlinks": 8, "scrapable": true, "priority_score": 4.2, "summary": "..." }] }

GET/graph/hops

Domain path (hops)

Shortest link path between two domains in your indexed graph — useful to explain how niche players connect.

Query

?from_domain=indiehackers.com&to_domain=stripe.com&max_hops=6

Example response

{ "found": true, "hops": 2, "path": ["indiehackers.com", "..."], "context_for_ai": "..." }

GET/stats

Public stats

Marketing aggregates (pages crawled, domains discovered). No authentication.

Example response

{ "pages_crawled": 1200000, "domains_discovered": 120000, "edges": 1400000, "scrapes": 95000, "pages_200": 980000, "source": "live" }

GET/jobs

List crawl jobs

Paginated crawl jobs for your account (newest first).

Query

?status=completed&limit=20&offset=0

Example response

{ "items": [{ "job_id": "...", "status": "completed" }], "total": 5 }

GET/scrape/data

Stored scrape

Full content for a URL already scraped via POST /scrape.

Query

?url=https://example.com/about

Example response

{ "url": "...", "content": [...], "links": [...] }

GET/nodes

Discovered pages

Pages indexed by crawls or scrapes, filterable by domain.

Query

?domain=example.com&limit=100

Example response

{ "items": [...], "total": 240 }

GET/graph

Link graph

Page-level graph (default) or domain-level aggregation.

Query

?domain=example.com&format=pages|domains&limit=300

Example response

{ "format": "pages", "nodes": [...], "edges": [...] }

GET/graph/stats

Graph stats

Counts of pages, scrapes, edges, and domains in your workspace.

Query

?domain=example.com

Example response

{ "pages": 120, "edges": 890, "domains": 4 }

Webhooks

Configure HTTPS endpoints in Dashboard → Webhooks. We POST JSON when events fire; each delivery includes a signature you can verify with your webhook secret.

Events

crawl.completed — Fired when a POST /crawl job finishes (completed, failed, or canceled).
discover.completed — Fired when a POST /discover job finishes.
page.extracted — Fired after a successful POST /scrape (HTTP 200).

Request headers

X-Crag-Event — event name (same as JSON event)
X-Crag-Signature — HMAC-SHA256 of the raw body using your webhook secret
X-Crag-Webhook-Id — webhook configuration id

Payload

{
  "event": "crawl.completed",
  "created_at": 1716400000.123,
  "data": {
    "job_id": "job_abc123",
    "status": "completed",
    "pages_crawled": 42
  }
}

Only https:// URLs are accepted. Failed deliveries are logged server-side; retry policy may evolve — treat webhooks as at-least-once.

SDKs

Official clients live in the CragData repo under packages/cragdata-python and packages/cragdata-js. They wrap auth, discover, crawl, scrape, export, and job polling.

Python

pip install httpx  # or use the repo client: packages/cragdata-python

from cragdata import CragDataClient

client = CragDataClient("ck_live_YOUR_KEY", base_url="https://api.cragdata.com/v1")
ctx = client.domain_context("indiehackers.com", depth_hops=2)
print(ctx["context_for_ai"])
pages = client.top_pages("indiehackers.com", limit=10)
print(pages["pages"][0]["url"])

JavaScript

// packages/cragdata-js in this repo
import { CragDataClient } from "./packages/cragdata-js/index.js";

const client = new CragDataClient("ck_live_YOUR_KEY", {
  baseUrl: "https://api.cragdata.com/v1",
});
const ctx = await client.domainContext("indiehackers.com", { depth_hops: 2 });
console.log(ctx.context_for_ai);
const pages = await client.topPages("indiehackers.com", { limit: 10 });
console.log(pages.pages[0].url);

Errors & limits

Code	Cause	How to fix
`401`	Invalid or revoked API key	Create a new key in Dashboard → API keys
`402`	Monthly quota exhausted	Upgrade to Developer ($10/mo) or Startup
`403`	Endpoint not available on your plan (e.g. POST /export on Free)	Upgrade to Developer or Startup, or call GET /me to see enabled endpoints
`409`	Crawl queue is full	Wait for the current job to finish or cancel it from the dashboard
`429`	Rate limit exceeded	Slow down requests (see X-RateLimit-Limit response header)

Plan limits

All plans include crawl, scrape, and graph APIs. Use GET /me for your live quota, rate limit, and read_limits.

Plan	Calls / month	Rate limit	API keys	Discover	Export	Graph / list caps
Free	500	2 req/s	1	25 domains/run	—	50 nodes, 10 scrapes
Developer	10,000	5 req/s	3	200 domains/run	JSONL / Parquet	300 nodes, 50 scrapes
Startup	50,000	20 req/s	5	500 domains/run	JSONL / Parquet	1000 nodes, 200 scrapes
Enterprise	Custom	Custom	Unlimited	Custom	Custom	5000+ nodes

Response headers include X-Credits-Remaining and X-RateLimit-Limit (requests per second for your plan).

Dashboard

After signing in, use the app dashboard to manage keys, monitor usage, and billing:

Overview — usage summary and plan status
API keys — create and revoke keys
Usage — calls, credits, and per-endpoint breakdown
Billing — upgrade to Developer ($10/mo), Startup, or manage subscription
Webhooks — register HTTPS URLs and secrets for event delivery
Schedules — recurring crawl or discover jobs (cron via platform scheduler)
Team — invite members to your organization

Fresh structured web data for AI systems

Datasets go stale

LLMs hallucinate without grounding

RAG breaks without source planning

Quickstart

API playground

Agents playbook

Tooling integrations

AI context graph

Crawl lifecycle

1. Enqueue

2. Fetch

3. Parse & graph

4. Extract

5. Deliver

Queues & retries

Queue behavior

Retries & anti-bot

Rate limits

Examples

Agent RAG — niche graph first

Website monitoring with webhooks

Export graph to warehouse

Market map from seeds

Authentication

Endpoints

Health check

Start crawl job

Discover domains

Export data

Crawl job status

Scrape page

List scrapes

Account & limits

AI niche context (domain)

AI niche context (JSON body)

Top pages for RAG

Domain path (hops)

Public stats

List crawl jobs

Stored scrape

Discovered pages

Link graph

Graph stats

Webhooks

Events

Request headers

SDKs

Errors & limits

Plan limits

Dashboard