Technical Validation for AI Research Teams
Benchmarks and A/B evaluation showing how CragData improves RAG ingestion and agent research—numbers, honest coverage limits, and reproduction steps.
Technical validation for AI research & search teams
This article summarizes a hands-on technical validation of CragData as a building block for AI research, RAG ingestion, and domain-focused search pipelines.
Goal: show evidence (numbers + A/B evaluation) that CragData can improve research outcomes for an AI team—without claiming “index the whole web”.
Positioning — what CragData is / isn’t
What it is
CragData is useful as a domain + niche grounding layer:
1. Build a niche/domain graph from a seed (GET /graph/domain-context).
2. Prioritize what to read (GET /graph/top-pages).
3. Extract structured text for RAG (POST /scrape).
What it isn’t
It is not a global web search engine. Some domains are:
- behind login (302),
- blocked (403),
- heavily JS-rendered (low extracted text),
…and any scrape-based system needs fallbacks or alternate sources for those.
Integration pattern (Python)
The core loop we recommend:
1. domain_context(seed) → context_for_ai + inbound/outbound domain lists
2. top_pages(domain) → most central internal pages (when available)
3. scrape(url) → structured JSON (title, content[], links[], og, word_count, …)
Example with httpx:
import httpx
API = "https://api.cragdata.com/v1"
def crag_headers():
return {"Authorization": "Bearer ck_live_YOUR_KEY"}
def domain_context(client: httpx.Client, seed: str, auto_acquire: bool = True) -> dict:
r = client.get(
f"{API}/graph/domain-context",
params={"seed": seed, "auto_acquire": str(auto_acquire).lower()},
headers=crag_headers(),
)
r.raise_for_status()
return r.json()
def top_pages(client: httpx.Client, domain: str, limit: int = 10) -> dict:
r = client.get(
f"{API}/graph/top-pages",
params={"domain": domain, "limit": limit},
headers=crag_headers(),
)
r.raise_for_status()
return r.json()
def scrape_url(client: httpx.Client, url: str) -> dict:
r = client.post(
f"{API}/scrape",
json={"url": url},
headers=crag_headers(),
)
r.raise_for_status()
return r.json()
Use context_for_ai in your agent system prompt, then scrape the top URLs returned by top_pages.
Bench testing — throughput + usefulness
We ran a bench mode that:
- executes many seeds/rounds,
- calls
domain-contextandtop-pages, - scrapes top candidates,
- stores raw JSON artifacts,
- computes latency percentiles, rate limits, useful scrape rate (text-size proxy), per-seed breakdown, and worst URLs (empty/blocked/redirect).
Bench A — “happy path” RAG ingestion
Controlled run on scrape-friendly domains (startup plan).
Headline results:
- Requests: 95
- API status: 95/95 HTTP 200
- Latency: p50 301 ms, p90 918 ms, p99 1482 ms, max 2227 ms
- Scrapes attempted: 55 — rate limit events: 0
- Useful scrape threshold: ≥ 150 words
- Useful scrapes: 55/55 (100%)
- Extracted text: average ~917.5 words/scrape (min 543, max 1594)
Interpretation: when a site is scrape-friendly and top-pages returns good candidates, CragData delivers high-density text fast enough for RAG pipelines and research agents.
Bench B — coverage across harder seeds
Same bench with a fallback URL strategy when top-pages is empty (/docs, /blog, /pricing, /solutions, /customers, /sitemap.xml, …).
Headline results:
- Useful scrapes: 27/60 (45%)
- Useful seed coverage: 3/4 (75%)
Per-seed highlights:
- cragsoftware.com — 15/15 useful scrapes
- stripe.com — 9/15 useful (some URLs redirect to login, e.g. dashboard)
- anthropic.com — 3/15 useful (mix of 302/404 + some content-rich pages)
- openai.com — 0/15 useful (403 blocking / anti-bot)
Interpretation: CragData is operationally stable, but coverage depends on the target domain. For blocked sites the right product behavior is detect + classify and route to alternate strategies (other domains, cached sources, sanctioned APIs, or rendering).
A/B evaluation — does it improve agent research?
We ran an A/B eval:
- A (baseline): answer the research question with no CragData context.
- B (with CragData): answer with
context_for_ai+ inbound/outbound lists + scraped snippets. - A judge model scores both answers (0–10) and picks a winner.
Configuration: seed cragsoftware.com, answer model gpt-4o-mini, judge model gpt-4o-mini, 3 questions.
Results:
- Winners: B won 3/3
- Average judge score: A 6.67 vs B 9.00
Interpretation: when the agent receives domain-grounded context plus relevant scraped pages, answers become more specific, more actionable, and better grounded—less generic filler.
Example (question 1)
Question: “What are the top 3 capabilities offered, and how would I evaluate quality/risk?”
- Baseline (A) stayed generic (“innovation / delivery / customer support”).
- With CragData (B) the agent listed concrete capabilities from the site: machine learning solutions, data analytics & dashboards, web scraping.
What to say in a sales conversation
1. RAG ingestion quality: “~918 words/page on average; 55/55 useful pages in a controlled bench.”
2. Operational stability: “95/95 API calls returned 200; p90 latency under 1s.”
3. Research quality: “A/B eval: CragData-grounded answers won 3/3 (9.0 vs 6.7 average score).”
4. Honest boundary: “Some sites return 403—we detect that. CragData is domain grounding, not crawl-the-entire-web.”
Reproduce on your stack
Use the same three endpoints from the API docs or playground:
curl "https://api.cragdata.com/v1/graph/domain-context?seed=YOUR_DOMAIN" \
-H "Authorization: Bearer $CRAGDATA_API_KEY"
curl "https://api.cragdata.com/v1/graph/top-pages?domain=YOUR_DOMAIN&limit=8" \
-H "Authorization: Bearer $CRAGDATA_API_KEY"
curl -X POST https://api.cragdata.com/v1/scrape \
-H "Authorization: Bearer $CRAGDATA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://YOUR_DOMAIN/"}'
Run your own A/B eval by injecting context_for_ai + 3–5 scraped snippets into the system prompt, then score answers with your judge model.
Next steps
- Start free — 500 calls/month, graph + crawl + scrape
- Docs playground — try responses without wiring Python first
- Building RAG with live web data — recommended pipeline