What CragData is
- Niche/domain graph from a seed (`/graph/domain-context`)
- Prioritized reading list (`/graph/top-pages`)
- Structured text for RAG (`/scrape`)
Give your agents access to knowledge they didn't know existed. Discover hidden documents, subdomains, APIs, and relationships across the web — before you crawl.
1.2M+ pages crawled · 120k+ domains discovered
LLMs hallucinate on stale corpora. RAG breaks when embeddings are weeks old. You need a live structured web layer—not another scraper.
You export a dataset once, embed it, and ship. Two weeks later pricing, partners, and docs changed—but your agent still cites the old world.
Plan sources with a niche graph, crawl on demand or on a schedule, extract AI-ready JSON, and deliver via API or webhooks—fresh structured web data for every answer.
We ran controlled API benches and an A/B study (same LLM, with vs without CragData context). Full methodology, numbers, and honest coverage limits— published for your team to verify.
Seed: cragsoftware.com · 3 research questions · gpt-4o-mini (answer + judge)
Useful scrapes 27/60 (45%) · useful seeds 3/4 (75%)
cragsoftware.com15/15 useful scrapesstripe.com9/15 (some login redirects)anthropic.com3/15 (302/404 mix)openai.com0/15 (403 anti-bot)Full write-up with integration code, bench design, and reproduction steps →
Scraping is a commodity. The layer that matters is live, structured, citeable web intelligence—so agents and RAG stop hallucinating on stale corpora.
Managed web intelligence pipeline—not a single-URL scraper script. Built for agents, RAG, and production data products.
Expand domains and URLs from seeds.
Concurrent fetch with retries and rate limits.
AI-ready JSON—content blocks, links, metadata.
Graph tables, niche scores, top pages.
REST API, webhooks, JSONL/Parquet export.
Try a seed domain below—interactive demo graph (no API key). Live crawl, export, and webhooks on your workspace require signup.
Add seeds, run the pipeline, consume structured data where your team already works.
URLs or domains you care about.
We map the web slice you defined.
Call /graph/domain-context and /graph/top-pages so agents know which domains and URLs to read first.
Discover, crawl, extract, query, export, and monitor—bundled into plans that scale with your volume.
GET /graph/domain-context returns your niche subgraph: who links to the seed, who the seed links to, related domains with scores, and summaries agents can consume directly. Pair with /graph/top-pages and /graph/hops.
Expand your universe of sites from a handful of seeds. Discover mode surfaces new registrable domains and queues them for crawl—ideal for market maps and competitive landscapes.
Crawl at scale with guardrails. Every page becomes a node; every internal link becomes an edge—ready for graph analysis or downstream extraction.
No raw HTML dumps. Get normalized JSON designed for search, LLM pipelines, and analytics—with noise stripped (nav, scripts, footers).
Raw page/domain graph plus stats for dashboards. Use AI Context Graph when you need ranked, agent-ready context—not just nodes and edges.
Monitor pipeline health and coverage from one overview—see what's discovered, crawled, and extracted.
Download your graph or scrapes as JSONL or Parquet (Developer+). Wire exports into your warehouse or notebooks.
Set seeds once; we keep discovering, deepening, and extracting on a schedule—built for monitoring, not one-off exports.
Watch crawls live: node events, logs, and progress over a secure WebSocket—ideal for dashboards and internal tools.
Web intelligence infrastructure—not a hobby scraping tool.
Retries, rate limits, and scrapable flags so agents skip dead pages.
Queued jobs scale across workers—not one browser on a laptop.
On-demand and scheduled crawls keep RAG off stale snapshots.
Structured JSON and context_for_ai strings for system prompts.
From 500 free calls/month to enterprise volume and SLAs.
Inbound, outbound, and related domains ranked before you embed.
Specific outcomes—not abstract “web data” promises.
Seed your top 5 competitors. CragData discovers their ecosystem, extracts messaging, and builds the market map your sales team has been asking for.
Extract titles, headings, word counts, and internal links across 500 domains—without writing a single scraper or managing proxies.
Structured JSON with provenance (URL, status, scraped_at). Feed embeddings and RAG pipelines directly. No HTML parsing, no noise.
Every page tracked with status code and timestamp. Crawls are reproducible. Pull datasets via API or export—cite it, share it, version it.
Continuous loops: discover → crawl → extract on schedule. Get alerted when new sites appear in your vertical.
CragData is the discovery layer that sits upstream of everything else. Add it before you search, before you crawl, before you extract — and every downstream tool gets dramatically richer input.
Exa searches what's indexed. CragData discovers what isn't.
Firecrawl extracts pages you give it. CragData finds the pages first.
Diffbot structures known entities. CragData surfaces the hidden ones.
Bright Data collects at scale. CragData maps what to collect.
One call returns the niche topology around your seed: inbound/outbound domains ranked by link strength, scores, and a context_for_ai string ready for your system prompt.
// GET /v1/graph/domain-context?seed=ycombinator.com
{
"seed_domain": "ycombinator.com",
"context_for_ai": "Niche graph for ycombinator.com (depth 2 hops). 15 destinations, use top_outbound to plan RAG sources.",
"seed": { "domain": "ycombinator.com", "pages_indexed": 12, "scrapable_pages": 12 },
"top_outbound_domains": [
{ "domain": "startupschool.org", "link_count": 16, "niche_score": 1.0, "scrapable": true },
{ "domain": "paulgraham.com", "link_count": 6, "niche_score": 0.375, "scrapable": true },
{ "domain": "news.ycombinator.com", "link_count": 9, "niche_score": 0.56, "scrapable": true }
],
"top_inbound_domains": [...],
"related_domains": [...],
"cached": true
}Free for PoCs, Developer from $10/mo, Startup for production pilots, Enterprise for volume and compliance.
Try the full pipeline on a small slice of the web. Perfect for evaluating data quality before you wire up production.
For serious solo devs and small projects that need real volume—not a $99 commitment.
Production volume, webhooks, schedules, and higher limits. For teams shipping a feature, not a science project.
Dedicated infrastructure, custom limits, security review, SLAs, and solution design.
CragData is web intelligence infrastructure: discover domains, crawl pages, extract structured JSON, and explore link graphs—built for AI agents, RAG, and data products.
Web intelligence is the practice of turning the live web into structured, queryable data—graphs, entities, and documents—rather than one-off HTML snapshots.
JSON designed for models: clean content blocks, metadata, link graphs, and plain-English summaries like context_for_ai that drop into system prompts.
Call GET /graph/domain-context before broad search, pick domains with GET /graph/top-pages, scrape top URLs, then embed. Patterns are in /docs and /llms.txt.
Fetching fresh web data at query time instead of relying on frozen training corpora—so answers reflect current pages, pricing, and ecosystem links.
Crawling only the URLs that matter for a user question, after planning sources with a graph—so embeddings stay relevant and token spend stays low.
Jobs are queued, workers run concurrent fetches with retries and rate limits, and results land in graph tables and JSON exports you pull via API.
Sites use fingerprints, rate limits, and challenges. We rotate strategies, respect robots.txt, backoff on failures, and surface scrapable flags in graph responses.
Models trained on static snapshots cannot see today's pricing, partners, or news. Live crawl + extract keeps agent answers grounded in current pages.
For production agents, yes—freshness wins. Snapshots are fine for eval; customer-facing RAG needs scheduled or on-demand live ingestion.
We run discovery, queues, crawling, extraction, storage, and APIs as managed infrastructure—not a single-URL fetch you host yourself.
No. You provide seeds; we discover and crawl within your plan limits and configuration.
JSON per page (title, content[], links[], metadata) plus graph endpoints for domains, top pages, and hops.
Yes. Pull via REST, export JSONL or Parquet on Developer+, or push events with webhooks into your warehouse.
Yes — crawl.completed, discover.completed, page.extracted. Configure HTTPS URLs in Dashboard → Webhooks; payloads are HMAC-signed.
Python and Node clients live under packages/cragdata-python and packages/cragdata-js in our GitHub repo.
A ranked view of who links to a seed domain, where the seed links out, and related clusters—with niche_score and scrapable flags for agents.
Most teams send a first /scrape or /graph/domain-context call within 15 minutes using the quickstart at /docs.
Plan-based requests per second; see /docs#errors or GET /me for live caps. Headers include X-Credits-Remaining and X-RateLimit-Limit.
Yes. You are responsible for lawful use; we encourage robots compliance and reasonable crawl rates.
You are responsible for how you use data. Enterprise plans include compliance discussions for sensitive use cases.
Yes—seed competitor domains, schedule crawls, and diff structured JSON over time. See /use-cases/competitor-monitoring.
POST /crawl or /discover returns a job_id; workers process the queue; you poll GET /crawl/{job_id} or listen on webhooks.
HTML is normalized to JSON blocks with nav/scripts stripped—ready for search indexes, analytics, or embedding pipelines.
Firecrawl is strong for page-to-markdown. CragData adds niche graphs and agent-first source planning. See /compare/cragdata-vs-firecrawl.
Apify is an actor marketplace. CragData is an opinionated intelligence API with graphs and managed pipelines. See /compare/cragdata-vs-apify.
Production API at api.cragdata.com with health at /v1/health. Enterprise SLAs available; status practices documented for ops teams.
Yes—Dashboard → Schedules runs recurring discover/crawl jobs for monitoring use cases.
A plain-English summary in graph responses describing the niche topology—designed to paste into an agent system prompt before retrieval.
Teams shipping AI agents, RAG products, GTM enrichment, SEO research, and market intelligence who need live structured web data at scale.
Start free, upgrade to Developer for $10/mo, or talk to us about Enterprise volume and compliance.
Questions about plans, volume, or Enterprise? We respond within 24 hours.
CragData is a web crawl API and link graph service for AI agents and RAG pipelines. It crawls a seed domain, maps inbound/outbound links, and returns a niche topology graph with a context_for_ai string ready for system prompts—enabling agents to plan research sources before broad web search. Outputs: structured JSON, domain graphs, and scrapable page lists. Free tier: 500 calls/month, no credit card.