Building RAG With Live Web Data

A concrete pipeline for RAG with niche graphs, top pages, structured scrape JSON, and embeddings.

  • rag
  • tutorial

Building RAG with live web data

This is the pipeline we recommend for teams shipping retrieval that respects token budgets and freshness.

Step 1 — Plan sources with a graph

curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com" \
  -H "Authorization: Bearer $CRAGDATA_API_KEY"

Read context_for_ai and ranked top_inbound_domains / top_outbound_domains.

Step 2 — Pick pages inside the best domain

curl "https://api.cragdata.com/v1/graph/top-pages?domain=indiehackers.com&limit=10" \
  -H "Authorization: Bearer $CRAGDATA_API_KEY"

Step 3 — Scrape structured JSON

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer $CRAGDATA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://indiehackers.com/post/example"}'

Step 4 — Chunk and embed

Use content[] blocks as chunks. Store URL, title, and fetch time for citations in the final answer.

Operations

  • Schedule recurring crawls for monitoring
  • Use webhooks instead of polling when possible
  • Export JSONL to your warehouse on Developer+

Full reference: documentation and llms.txt.