# CragData

> Niche graph API for AI agents. Crawl a seed domain, get back who links to who, ranked by strength — structured for system prompts. Use before RAG, not instead of it.

Website: https://www.cragdata.com
About (citable facts): https://www.cragdata.com/#about-cragdata
AI discovery: https://www.cragdata.com/.well-known/ai.txt
Base URL: https://api.cragdata.com/v1
Auth: `Authorization: Bearer ck_live_YOUR_KEY`
Docs: https://www.cragdata.com/docs
Dashboard: https://www.cragdata.com/dashboard

---

## What it does

CragData answers: **"where should my agent look before it searches the web?"**

You pass a seed domain. CragData crawls it, maps every outbound link, groups URLs by domain, and returns a ranked graph: inbound domains (who links to the seed), outbound domains (where the seed links out), and a cluster of related domains — all with `niche_score` (0–1) and `scrapable` flags.

The `context_for_ai` field in every response is a plain English summary designed to drop directly into a system prompt.

---

## Core endpoints

### GET /graph/domain-context
The main endpoint. Returns the niche graph around a seed domain.

**Parameters:**
- `seed` (required) — domain to build graph around, e.g. `stripe.com`
- `auto_acquire` (bool, default true) — crawl live if not in workspace yet
- `depth_hops` (int 1–4, default 2) — BFS depth for related domain cluster
- `cluster_limit` (int 5–50, default 20) — max domains per section

**Response shape:**
```json
{
  "seed_domain": "stripe.com",
  "context_for_ai": "Niche graph for stripe.com (depth 2 hops). 8 strong referrers, 12 destinations, 5 related domains. Use top_inbound/outbound/related to plan RAG sources before broad web search.",
  "seed": {
    "domain": "stripe.com",
    "pages_indexed": 24,
    "scrapable_pages": 22
  },
  "top_inbound_domains": [
    {
      "domain": "ycombinator.com",
      "link_count": 14,
      "niche_score": 1.0,
      "scrapable": true,
      "summary": "ycombinator.com: 12 indexed pages, 12 scrapable. Role: links to seed."
    }
  ],
  "top_outbound_domains": [...],
  "related_domains": [...],
  "cached": true
}
```

First call crawls live (3–15 s). Subsequent calls return from snapshot cache (~400 ms, `"cached": true`).

---

### GET /graph/top-pages
Returns the most-linked pages inside a domain, ranked by in-graph inlinks.

**Parameters:**
- `domain` (required) — e.g. `stripe.com`
- `limit` (int 1–100, default 20)
- `scrapable_only` (bool, default true)

**Use:** after `domain-context`, pick the best domain, call `top-pages` to know which URLs to actually read.

---

### GET /graph/hops
Shortest domain path between two sites.

**Parameters:** `from_domain`, `to_domain`, `max_hops` (default 6)

---

### POST /crawl
Start a crawl job from a seed URL.

**Body:** `{ "url": "https://example.com", "depth": 2, "max_pages": 100 }`
**Returns:** `{ "job_id": "job_abc123", "status": "queued" }`

---

### POST /scrape
Extract structured JSON from a URL: title, description, content[], links[], og metadata.

**Body:** `{ "url": "https://example.com/page" }`

---

### POST /discover
Find new domains from a seed URL without full crawl.

**Body:** `{ "url": "https://example.com", "max_domains": 50 }`

---

### GET /jobs, GET /crawl/{job_id}
List or poll crawl/discover jobs.

---

### GET /me
Returns your plan, calls remaining, rate limit, and read limits.

---

## Plans

| Plan | Calls/mo | Price |
|---|---|---|
| Free | 500 | $0 — no CC |
| Developer | 10,000 | $10/mo |
| Startup | 50,000 | $99/mo |
| Enterprise | Custom | Custom |

Rate limit: 2 req/s (free), 5 (developer), 20 (startup).
Response headers: `X-Credits-Remaining`, `X-RateLimit-Limit`.

---

## Recommended agent pattern

```
1. niche_graph(seed_domain)          → understand the niche topology
2. pick top inbound/outbound domains → know where to look
3. top_pages(best_domain)            → which pages inside that domain
4. scrape(url) for top 3–5 pages     → structured content for embedding
5. embed + answer                    → grounded RAG response
```

---

## Python — copy-paste integration

```python
import os, httpx

API = "https://api.cragdata.com/v1"
HDR = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"}

def niche_graph(seed: str) -> dict:
    """Call before RAG. Returns niche topology + context_for_ai string."""
    return httpx.get(f"{API}/graph/domain-context",
                     params={"seed": seed, "auto_acquire": "true"},
                     headers=HDR, timeout=30).json()

def top_pages(domain: str, limit: int = 5) -> list:
    return httpx.get(f"{API}/graph/top-pages",
                     params={"domain": domain, "limit": limit},
                     headers=HDR).json().get("pages", [])

def scrape(url: str) -> dict:
    return httpx.post(f"{API}/scrape", json={"url": url}, headers=HDR).json()

# Usage
ctx = niche_graph("stripe.com")
system = f"Niche context: {ctx['context_for_ai']}"
pages = top_pages(ctx["top_inbound_domains"][0]["domain"])
docs = [scrape(p["url"]) for p in pages[:3]]
```

---

## LangChain tools — copy-paste

```python
from langchain.tools import tool
import os, httpx

_API = "https://api.cragdata.com/v1"
_HDR = {"Authorization": f"Bearer {os.environ['CRAGDATA_API_KEY']}"}

@tool
def niche_graph(seed_domain: str) -> str:
    """
    Returns niche graph for a domain: inbound/outbound players, cluster scores,
    and a context_for_ai summary. Call this BEFORE web search to plan sources.
    Input: a domain like 'stripe.com' or 'ycombinator.com'.
    """
    r = httpx.get(f"{_API}/graph/domain-context",
                  params={"seed": seed_domain, "auto_acquire": "true"},
                  headers=_HDR, timeout=30)
    d = r.json()
    return d.get("context_for_ai", "No graph data yet for this domain.")

@tool
def top_pages(domain: str) -> str:
    """
    Returns the top scrapable pages inside a domain, ranked by in-graph inlinks.
    Use after niche_graph to decide which URLs to read.
    Input: a domain like 'stripe.com'.
    """
    r = httpx.get(f"{_API}/graph/top-pages",
                  params={"domain": domain, "limit": 10},
                  headers=_HDR)
    pages = r.json().get("pages", [])
    return "\n".join(f'{p["url"]} (score {p["priority_score"]})' for p in pages)
```

---

## CLAUDE.md block

Paste this into your `CLAUDE.md` or agent system prompt:

```markdown
## CragData — niche graph tool

Before searching the web for any research task, call niche_graph(seed_domain)
to understand which domains are relevant. This saves tokens and grounds sources.

### Tools
- niche_graph(seed) → context_for_ai string + ranked inbound/outbound/related domains
- top_pages(domain) → scrapable pages ranked by in-graph inlinks
- scrape(url) → structured JSON: title, content[], links[], og

### When to use
- User asks to research a company, topic, or market → call niche_graph first
- Need to read pages from a domain → call top_pages, then scrape top results
- Never do broad web search before calling niche_graph

### Pattern
niche_graph("seed.com") → pick domains → top_pages(domain) → scrape → embed → answer

### Auth
Set CRAGDATA_API_KEY in environment. Base URL: https://api.cragdata.com/v1
```

---

## Error codes

| Code | Cause |
|---|---|
| 401 | Missing or invalid API key |
| 402 | Monthly quota exceeded — upgrade plan |
| 429 | Rate limit hit — slow down |
| 400 | Invalid seed domain |
| 502 | Live crawl failed (site blocked or returned non-200) |

---

## SDKs

- Python: `packages/cragdata-python` (github.com/joaobenedetmachado/cragdata)
- JavaScript/Node: `packages/cragdata-js`

Install: `pip install cragdata` / `npm install cragdata`