# CragData > Niche graph API for AI agents. Crawl a seed domain, get back who links to who, ranked by strength — structured for system prompts. Use before RAG, not instead of it. Website: https://www.cragdata.com About (citable facts): https://www.cragdata.com/#about-cragdata AI discovery: https://www.cragdata.com/.well-known/ai.txt Base URL: https://api.cragdata.com/v1 Auth: `Authorization: Bearer ck_live_YOUR_KEY` Docs: https://www.cragdata.com/docs Dashboard: https://www.cragdata.com/dashboard --- ## What it does CragData answers: **"where should my agent look before it searches the web?"** You pass a seed domain. CragData crawls it, maps every outbound link, groups URLs by domain, and returns a ranked graph: inbound domains (who links to the seed), outbound domains (where the seed links out), and a cluster of related domains — all with `niche_score` (0–1) and `scrapable` flags. The `context_for_ai` field in every response is a plain English summary designed to drop directly into a system prompt. --- ## Core endpoints ### GET /graph/domain-context The main endpoint. Returns the niche graph around a seed domain. **Parameters:** - `seed` (required) — domain to build graph around, e.g. `stripe.com` - `auto_acquire` (bool, default true) — crawl live if not in workspace yet - `depth_hops` (int 1–4, default 2) — BFS depth for related domain cluster - `cluster_limit` (int 5–50, default 20) — max domains per section **Response shape:** ```json { "seed_domain": "stripe.com", "context_for_ai": "Niche graph for stripe.com (depth 2 hops). 8 strong referrers, 12 destinations, 5 related domains. Use top_inbound/outbound/related to plan RAG sources before broad web search.", "seed": { "domain": "stripe.com", "pages_indexed": 24, "scrapable_pages": 22 }, "top_inbound_domains": [ { "domain": "ycombinator.com", "link_count": 14, "niche_score": 1.0, "scrapable": true, "summary": "ycombinator.com: 12 indexed pages, 12 scrapable. Role: links to seed." } ], "top_outbound_domains": [...], "related_domains": [...], "cached": true } ``` First call crawls live (3–15 s). Subsequent calls return from snapshot cache (~400 ms, `"cached": true`). --- ### GET /graph/top-pages Returns the most-linked pages inside a domain, ranked by in-graph inlinks. **Parameters:** - `domain` (required) — e.g. `stripe.com` - `limit` (int 1–100, default 20) - `scrapable_only` (bool, default true) **Use:** after `domain-context`, pick the best domain, call `top-pages` to know which URLs to actually read. --- ### GET /graph/hops Shortest domain path between two sites. **Parameters:** `from_domain`, `to_domain`, `max_hops` (default 6) --- ### POST /crawl Start a crawl job from a seed URL. **Body:** `{ "url": "https://example.com", "depth": 2, "max_pages": 100 }` **Returns:** `{ "job_id": "job_abc123", "status": "queued" }` --- ### POST /scrape Extract structured JSON from a URL: title, description, content[], links[], og metadata. **Body:** `{ "url": "https://example.com/page" }` --- ### POST /discover Find new domains from a seed URL without full crawl. **Body:** `{ "url": "https://example.com", "max_domains": 50 }` --- ### GET /jobs, GET /crawl/{job_id} List or poll crawl/discover jobs. --- ### GET /me Returns your plan, calls remaining, rate limit, and read limits. --- ## Plans | Plan | Calls/mo | Price | |---|---|---| | Free | 500 | $0 — no CC | | Developer | 10,000 | $10/mo | | Startup | 50,000 | $99/mo | | Enterprise | Custom | Custom | Rate limit: 2 req/s (free), 5 (developer), 20 (startup). Response headers: `X-Credits-Remaining`, `X-RateLimit-Limit`. --- ## Recommended agent pattern ``` 1. niche_graph(seed_domain) → understand the niche topology 2. pick top inbound/outbound domains → know where to look 3. top_pages(best_domain) → which pages inside that domain 4. scrape(url) for top 3–5 pages → structured content for embedding 5. embed + answer → grounded RAG response ``` --- ## Python — copy-paste integration ```python import os, httpx API = "https://api.cragdata.com/v1" HDR = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"} def niche_graph(seed: str) -> dict: """Call before RAG. Returns niche topology + context_for_ai string.""" return httpx.get(f"{API}/graph/domain-context", params={"seed": seed, "auto_acquire": "true"}, headers=HDR, timeout=30).json() def top_pages(domain: str, limit: int = 5) -> list: return httpx.get(f"{API}/graph/top-pages", params={"domain": domain, "limit": limit}, headers=HDR).json().get("pages", []) def scrape(url: str) -> dict: return httpx.post(f"{API}/scrape", json={"url": url}, headers=HDR).json() # Usage ctx = niche_graph("stripe.com") system = f"Niche context: {ctx['context_for_ai']}" pages = top_pages(ctx["top_inbound_domains"][0]["domain"]) docs = [scrape(p["url"]) for p in pages[:3]] ``` --- ## LangChain tools — copy-paste ```python from langchain.tools import tool import os, httpx _API = "https://api.cragdata.com/v1" _HDR = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"} @tool def niche_graph(seed_domain: str) -> str: """ Returns niche graph for a domain: inbound/outbound players, cluster scores, and a context_for_ai summary. Call this BEFORE web search to plan sources. Input: a domain like 'stripe.com' or 'ycombinator.com'. """ r = httpx.get(f"{_API}/graph/domain-context", params={"seed": seed_domain, "auto_acquire": "true"}, headers=_HDR, timeout=30) d = r.json() return d.get("context_for_ai", "No graph data yet for this domain.") @tool def top_pages(domain: str) -> str: """ Returns the top scrapable pages inside a domain, ranked by in-graph inlinks. Use after niche_graph to decide which URLs to read. Input: a domain like 'stripe.com'. """ r = httpx.get(f"{_API}/graph/top-pages", params={"domain": domain, "limit": 10}, headers=_HDR) pages = r.json().get("pages", []) return "\n".join(f'{p["url"]} (score {p["priority_score"]})' for p in pages) ``` --- ## CLAUDE.md block Paste this into your `CLAUDE.md` or agent system prompt: ```markdown ## CragData — niche graph tool Before searching the web for any research task, call niche_graph(seed_domain) to understand which domains are relevant. This saves tokens and grounds sources. ### Tools - niche_graph(seed) → context_for_ai string + ranked inbound/outbound/related domains - top_pages(domain) → scrapable pages ranked by in-graph inlinks - scrape(url) → structured JSON: title, content[], links[], og ### When to use - User asks to research a company, topic, or market → call niche_graph first - Need to read pages from a domain → call top_pages, then scrape top results - Never do broad web search before calling niche_graph ### Pattern niche_graph("seed.com") → pick domains → top_pages(domain) → scrape → embed → answer ### Auth Set CRAGDATA_API_KEY in environment. Base URL: https://api.cragdata.com/v1 ``` --- ## Error codes | Code | Cause | |---|---| | 401 | Missing or invalid API key | | 402 | Monthly quota exceeded — upgrade plan | | 429 | Rate limit hit — slow down | | 400 | Invalid seed domain | | 502 | Live crawl failed (site blocked or returned non-200) | --- ## SDKs - Python: `packages/cragdata-python` (github.com/joaobenedetmachado/cragdata) - JavaScript/Node: `packages/cragdata-js` Install: `pip install cragdata` / `npm install cragdata`