CragData is web intelligence infrastructure: discover domains, crawl pages, extract structured JSON, and explore link graphs—built for AI agents, RAG, and data products.

What is web intelligence?

Web intelligence is the practice of turning the live web into structured, queryable data—graphs, entities, and documents—rather than one-off HTML snapshots.

What is AI-ready data?

JSON designed for models: clean content blocks, metadata, link graphs, and plain-English summaries like context_for_ai that drop into system prompts.

How should AI agents use CragData?

Call GET /graph/domain-context before broad search, pick domains with GET /graph/top-pages, scrape top URLs, then embed. Patterns are in /docs and /llms.txt.

What is live retrieval for AI agents?

Fetching fresh web data at query time instead of relying on frozen training corpora—so answers reflect current pages, pricing, and ecosystem links.

What is RAG web crawling?

Crawling only the URLs that matter for a user question, after planning sources with a graph—so embeddings stay relevant and token spend stays low.

How does distributed crawling work?

Jobs are queued, workers run concurrent fetches with retries and rate limits, and results land in graph tables and JSON exports you pull via API.

How do anti-bot systems work?

Sites use fingerprints, rate limits, and challenges. We rotate strategies, respect robots.txt, backoff on failures, and surface scrapable flags in graph responses.

Why do AI agents need live web crawling?

Models trained on static snapshots cannot see today's pricing, partners, or news. Live crawl + extract keeps agent answers grounded in current pages.

Are static datasets dying for RAG?

For production agents, yes—freshness wins. Snapshots are fine for eval; customer-facing RAG needs scheduled or on-demand live ingestion.

How is CragData different from a scraper library?

We run discovery, queues, crawling, extraction, storage, and APIs as managed infrastructure—not a single-URL fetch you host yourself.

Do you crawl the entire internet?

No. You provide seeds; we discover and crawl within your plan limits and configuration.

What format is the data?

JSON per page (title, content[], links[], metadata) plus graph endpoints for domains, top pages, and hops.

Can I use my own database?

Yes. Pull via REST, export JSONL or Parquet on Developer+, or push events with webhooks into your warehouse.

Do you offer webhooks?

Yes — crawl.completed, discover.completed, page.extracted. Configure HTTPS URLs in Dashboard → Webhooks; payloads are HMAC-signed.

Is there an official SDK?

Python and Node clients live under packages/cragdata-python and packages/cragdata-js in our GitHub repo.

What is a niche graph?

A ranked view of who links to a seed domain, where the seed links out, and related clusters—with niche_score and scrapable flags for agents.

How fast can I integrate?

Most teams send a first /scrape or /graph/domain-context call within 15 minutes using the quickstart at /docs.

What are your rate limits?

Plan-based requests per second; see /docs#errors or GET /me for live caps. Headers include X-Credits-Remaining and X-RateLimit-Limit.

Do you respect robots.txt?

Yes. You are responsible for lawful use; we encourage robots compliance and reasonable crawl rates.

You are responsible for how you use data. Enterprise plans include compliance discussions for sensitive use cases.

Can I monitor competitors?

Yes—seed competitor domains, schedule crawls, and diff structured JSON over time. See /use-cases/competitor-monitoring.

How does crawl orchestration with queues work?

POST /crawl or /discover returns a job_id; workers process the queue; you poll GET /crawl/{job_id} or listen on webhooks.

What is structured extraction?

HTML is normalized to JSON blocks with nav/scripts stripped—ready for search indexes, analytics, or embedding pipelines.

CragData vs Firecrawl?

Firecrawl is strong for page-to-markdown. CragData adds niche graphs and agent-first source planning. See /compare/cragdata-vs-firecrawl.

Apify is an actor marketplace. CragData is an opinionated intelligence API with graphs and managed pipelines. See /compare/cragdata-vs-apify.

What uptime do you target?

Production API at api.cragdata.com with health at /v1/health. Enterprise SLAs available; status practices documented for ops teams.

Do you support always-on crawls?

Yes—Dashboard → Schedules runs recurring discover/crawl jobs for monitoring use cases.

What is context_for_ai?

A plain-English summary in graph responses describing the niche topology—designed to paste into an agent system prompt before retrieval.

Who is CragData built for?

Teams shipping AI agents, RAG products, GTM enrichment, SEO research, and market intelligence who need live structured web data at scale.

May 26, 2026

Technical Validation for AI Research Teams

Benchmarks and A/B evaluation showing how CragData improves RAG ingestion and agent research—numbers, honest coverage limits, and reproduction steps.

validation
benchmarks
rag
ai-agents

Technical validation for AI research & search teams

This article summarizes a hands-on technical validation of CragData as a building block for AI research, RAG ingestion, and domain-focused search pipelines.

Goal: show evidence (numbers + A/B evaluation) that CragData can improve research outcomes for an AI team—without claiming “index the whole web”.

Positioning — what CragData is / isn’t

What it is

CragData is useful as a domain + niche grounding layer:

1. Build a niche/domain graph from a seed (GET /graph/domain-context).

2. Prioritize what to read (GET /graph/top-pages).

3. Extract structured text for RAG (POST /scrape).

What it isn’t

It is not a global web search engine. Some domains are:

behind login (302),
blocked (403),
heavily JS-rendered (low extracted text),

…and any scrape-based system needs fallbacks or alternate sources for those.

Integration pattern (Python)

The core loop we recommend:

1. domain_context(seed) → context_for_ai + inbound/outbound domain lists

2. top_pages(domain) → most central internal pages (when available)

3. scrape(url) → structured JSON (title, content[], links[], og, word_count, …)

Example with httpx:

import httpx

API = "https://api.cragdata.com/v1"

def crag_headers():
    return {"Authorization": "Bearer ck_live_YOUR_KEY"}

def domain_context(client: httpx.Client, seed: str, auto_acquire: bool = True) -> dict:
    r = client.get(
        f"{API}/graph/domain-context",
        params={"seed": seed, "auto_acquire": str(auto_acquire).lower()},
        headers=crag_headers(),
    )
    r.raise_for_status()
    return r.json()

def top_pages(client: httpx.Client, domain: str, limit: int = 10) -> dict:
    r = client.get(
        f"{API}/graph/top-pages",
        params={"domain": domain, "limit": limit},
        headers=crag_headers(),
    )
    r.raise_for_status()
    return r.json()

def scrape_url(client: httpx.Client, url: str) -> dict:
    r = client.post(
        f"{API}/scrape",
        json={"url": url},
        headers=crag_headers(),
    )
    r.raise_for_status()
    return r.json()

Use context_for_ai in your agent system prompt, then scrape the top URLs returned by top_pages.

Bench testing — throughput + usefulness

We ran a bench mode that:

executes many seeds/rounds,
calls domain-context and top-pages,
scrapes top candidates,
stores raw JSON artifacts,
computes latency percentiles, rate limits, useful scrape rate (text-size proxy), per-seed breakdown, and worst URLs (empty/blocked/redirect).

Bench A — “happy path” RAG ingestion

Controlled run on scrape-friendly domains (startup plan).

Headline results:

Requests: 95
API status: 95/95 HTTP 200
Latency: p50 301 ms, p90 918 ms, p99 1482 ms, max 2227 ms
Scrapes attempted: 55 — rate limit events: 0
Useful scrape threshold: ≥ 150 words
Useful scrapes: 55/55 (100%)
Extracted text: average ~917.5 words/scrape (min 543, max 1594)

Interpretation: when a site is scrape-friendly and top-pages returns good candidates, CragData delivers high-density text fast enough for RAG pipelines and research agents.

Bench B — coverage across harder seeds

Same bench with a fallback URL strategy when top-pages is empty (/docs, /blog, /pricing, /solutions, /customers, /sitemap.xml, …).

Headline results:

Useful scrapes: 27/60 (45%)
Useful seed coverage: 3/4 (75%)

Per-seed highlights:

cragsoftware.com — 15/15 useful scrapes
stripe.com — 9/15 useful (some URLs redirect to login, e.g. dashboard)
anthropic.com — 3/15 useful (mix of 302/404 + some content-rich pages)
openai.com — 0/15 useful (403 blocking / anti-bot)

Interpretation: CragData is operationally stable, but coverage depends on the target domain. For blocked sites the right product behavior is detect + classify and route to alternate strategies (other domains, cached sources, sanctioned APIs, or rendering).

A/B evaluation — does it improve agent research?

We ran an A/B eval:

A (baseline): answer the research question with no CragData context.
B (with CragData): answer with context_for_ai + inbound/outbound lists + scraped snippets.
A judge model scores both answers (0–10) and picks a winner.

Configuration: seed cragsoftware.com, answer model gpt-4o-mini, judge model gpt-4o-mini, 3 questions.

Results:

Winners: B won 3/3
Average judge score: A 6.67 vs B 9.00

Interpretation: when the agent receives domain-grounded context plus relevant scraped pages, answers become more specific, more actionable, and better grounded—less generic filler.

Example (question 1)

Question: “What are the top 3 capabilities offered, and how would I evaluate quality/risk?”

Baseline (A) stayed generic (“innovation / delivery / customer support”).
With CragData (B) the agent listed concrete capabilities from the site: machine learning solutions, data analytics & dashboards, web scraping.

What to say in a sales conversation

1. RAG ingestion quality: “~918 words/page on average; 55/55 useful pages in a controlled bench.”

2. Operational stability: “95/95 API calls returned 200; p90 latency under 1s.”

3. Research quality: “A/B eval: CragData-grounded answers won 3/3 (9.0 vs 6.7 average score).”

4. Honest boundary: “Some sites return 403—we detect that. CragData is domain grounding, not crawl-the-entire-web.”

Reproduce on your stack

Use the same three endpoints from the API docs or playground:

curl "https://api.cragdata.com/v1/graph/domain-context?seed=YOUR_DOMAIN" \
  -H "Authorization: Bearer $CRAGDATA_API_KEY"

curl "https://api.cragdata.com/v1/graph/top-pages?domain=YOUR_DOMAIN&limit=8" \
  -H "Authorization: Bearer $CRAGDATA_API_KEY"

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer $CRAGDATA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://YOUR_DOMAIN/"}'

Run your own A/B eval by injecting context_for_ai + 3–5 scraped snippets into the system prompt, then score answers with your judge model.

Next steps

Start free — 500 calls/month, graph + crawl + scrape
Docs playground — try responses without wiring Python first
Building RAG with live web data — recommended pipeline