Introduction

Turn public web pages into clean data for applications and AI agents. DataBlue handles retrieval, rendering, cleaning, and delivery.

Core workflows Scrape · Crawl · Search · Map · Extract

LLM-ready outputs Markdown · JSON · Links · Images

Base URL https://api.datablue.dev/v1

Web Data Built for LLMs

DataBlue removes page noise and returns LLM-ready markdown, JSON, links, images, HTML, or screenshots.

Output	What it preserves	Common use
Markdown	Readable headings, paragraphs, lists, tables, and useful links	RAG ingestion, summarization, research, and agent context
Structured JSON	Fields that match the schema or endpoint contract you requested	Automations, databases, comparisons, and application features
Links and images	Discoverable URLs, media URLs, labels, and available alt text	Site discovery, media analysis, and follow-up retrieval
HTML and raw HTML	Processed or original page markup	Custom parsing, auditing, and specialized downstream transforms
Screenshots	The rendered visual state of a requested page	Visual review, multimodal models, and evidence capture
Metadata	Source URL and available page-level details	Traceability, indexing, validation, and citations

DataBlue is the web-data layer between a source and your model, database, or application.

Five Modes, One Platform

Choose the smallest workflow that produces the result you need.

Mode	Use it when	You provide	You receive
Scrape	You already know the exact page you need	One URL and the required output formats	Page content such as markdown, HTML, links, images, screenshots, headings, and metadata
Crawl	You need content from a connected section of a website	A starting URL plus page, depth, and path boundaries	A multi-page collection with each page retrieved and processed as part of one job
Search	You know the question but not the source URLs	A query and optional result or content settings	Search results and, when requested, content from result pages
Map	You need to understand a site's URL structure before collecting content	A domain or starting URL with discovery boundaries	A URL inventory that can be filtered before a scrape or crawl
Extract	Your application needs specific fields instead of general page text	Source content or URLs plus the shape of the desired result	Structured data suitable for validation and application logic

A simple selection rule

Known page: use Scrape.
Known website, unknown page: use Map, then scrape the URLs you select.
Connected website section: use Crawl with explicit limits.
Known question, unknown website: use Search.
Known fields: use Extract.
Known supported data surface: use its specialized Data API.

Specialized Data APIs

Use a Data API when you need structured records from a supported source instead of general page content.

Family	Available documentation covers	Typical application
Google data	SERP, Maps, News, Jobs, Flights, Finance, keyword suggestions, and Trends autocomplete	Research, discovery, market intelligence, and location-aware products
Quick commerce	Blinkit, Instamart, and Zepto product data	Catalog comparison, availability analysis, and pricing workflows
E-commerce	Amazon product, store, A+ content, reviews, rankings, and media surfaces, plus Flipkart and AJIO product data	Product intelligence, catalog enrichment, and review analysis
App Store	Application search and application detail	App discovery, category research, and metadata enrichment
Ads transparency	Google advertiser and creative discovery plus Facebook ads search	Creative research and advertising intelligence
YouTube	Video search and video detail	Media discovery, research, and content enrichment

Use core modes for flexible web retrieval. Use Data APIs for source-specific records. Check Data API Status before production use.

What DataBlue Handles

DataBlue manages the retrieval work behind each request and returns one consistent response shape.

Page retrievalRequests the source and follows the supported retrieval path needed to obtain usable content.

Browser renderingRenders supported dynamic pages when meaningful content is not present in the initial document.

Managed proxy routingUses configured proxy infrastructure when a supported request requires a different network path.

Content cleaningReduces page chrome and noisy markup while preserving useful text, links, media, and document structure.

Output normalizationPackages requested formats and metadata into a consistent response envelope across workflows.

Job orchestrationTracks longer operations such as crawls and exposes progress through polling and supported webhook flows.

Failure reportingReturns explicit status and error information so your application can retry, recover, or ask for a smaller scope.

What your application still controls

You decide	Why it remains your responsibility
The source and authorization to access it	You understand the source, intended use, and applicable policies better than the retrieval system.
The smallest sufficient scope	Clear page, path, depth, and result limits improve predictability.
The outputs to request	Requesting only what you use reduces response size and downstream processing.
Validation of critical facts	Web content changes and structured extraction should be checked before consequential use.
Storage, retention, and user-facing behavior	Your product owns its data lifecycle and recovery experience.

Performance Without Made-Up Numbers

Timing varies by source, output, and page count. Scrape returns directly; longer work uses jobs.

Benchmark the real sources and formats your product will use.

Example

curl --fail-with-body --silent --show-error \
  -X POST "https://api.datablue.dev/v1/scrape" \
  -H "Authorization: Bearer $DATABLUE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links", "images"]
  }' | jq '{
    success,
    time_taken: .data.time_taken,
    words: .data.metadata.word_count,
    links: (.data.links | length),
    images: (.data.images | length)
  }'

A Basic Request

Export an API key, scrape one stable page, and inspect the response.

export DATABLUE_API_KEY="wh_your_api_key"

curl --fail-with-body --silent --show-error \
  -X POST "https://api.datablue.dev/v1/scrape" \
  -H "Authorization: Bearer $DATABLUE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links", "images", "headings"]
  }' | tee datablue-response.json

Example response

{
  "success": true,
  "data": {
    "markdown": "# Example Domain

This domain is for use in illustrative examples.",
    "links": ["https://www.iana.org/domains/example"],
    "images": [],
    "headings": [{"level": 1, "text": "Example Domain"}]
  }
}

If you used the cURL example, inspect the saved response without printing the entire payload:

Example

jq '{
  success,
  source: .data.metadata.source_url,
  title: .data.metadata.title,
  word_count: .data.metadata.word_count,
  markdown_preview: (.data.markdown | .[0:300]),
  link_count: (.data.links | length),
  image_count: (.data.images | length),
  time_taken: .data.time_taken
}' datablue-response.json

A successful first request proves four things

Your API key is accepted.
DataBlue can retrieve and process the source.
The response contains the output formats you requested.
Your application can parse the response without depending on undocumented internals.

Use DataBlue From Your Preferred Surface

Surface	Use it for	Documentation
Playground	Compare outputs and verify a source before writing integration code	Open Playground
REST API	Language-independent HTTP integration and complete request control	First Request
Python SDK	Sync and async Python services, notebooks, and data pipelines	Python SDK
Node.js SDK	Async Node.js and TypeScript applications	Node SDK
MCP server	Controlled DataBlue tools for compatible AI agents and coding clients	MCP Server

Continue From Here

AuthenticationCreate, store, and send an API key correctly. Choose a WorkflowCompare mode boundaries with practical decision examples. First RequestBuild the same request through REST, Python, Node.js, and the Playground. Understand ResponsesRead immediate responses, jobs, metadata, and errors safely. Production ReadinessAdd bounded scope, retries, webhooks, validation, and usage controls. MCP ServerConnect DataBlue tools to a compatible AI client.