DataBlue API Documentation
DataBlue is a self-hosted web scraping platform that provides a simple, powerful API for extracting content from any website. It handles JavaScript rendering, anti-bot bypass, proxy rotation, and content extraction so you can focus on building your application.
Scrape
Extract content from any URL with JS rendering.
Crawl
Follow links and scrape entire websites.
Extract
LLM-powered structured data extraction.
The API is fully compatible with the Firecrawl API spec. Base URL: https://api.datablue.dev/v1
Authentication
All API requests require authentication via the Authorization header. DataBlue supports two authentication methods:
API Key (Recommended)
API keys have the prefix wh_ and do not expire. Create them from the dashboard or via the API.
# Using an API key
curl -H "Authorization: Bearer wh_your_api_key" \
https://api.datablue.dev/v1/scrapeJWT Token
JWT tokens are obtained from POST /v1/auth/login and expire after 7 days. They are primarily used by the dashboard frontend.
# Obtain a JWT
curl -X POST https://api.datablue.dev/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com", "password": "your_password"}'Create an API Key
Programmatically create a new API key via the API (requires JWT authentication):
curl -X POST https://api.datablue.dev/v1/auth/api-keys \
-H "Authorization: Bearer eyJ..." \
-H "Content-Type: application/json" \
-d '{"name": "my-scraper"}'Quick Start
Get up and running in under a minute. Scrape a page and get clean markdown back.
1. Get your API key
Create an API key from your DataBlue dashboard at Settings → API Keys.
2. Make your first request
curl -X POST https://api.datablue.dev/v1/scrape \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown", "links"]
}'3. Get your results
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"links": [
{ "url": "https://www.iana.org/domains/example", "text": "More information..." }
],
"metadata": {
"title": "Example Domain",
"description": "Example Domain",
"status_code": 200,
"word_count": 58
}
}
}/v1/scrapeScrape
Extract content from any URL. DataBlue handles JavaScript rendering, anti-bot bypass, proxy rotation, and content extraction automatically. Uses a staggered race strategy — HTTP at t=0, browser at t=3s — and the first valid result wins.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urlrequired | string | — | The URL to scrape. |
formats | string[] | ["markdown"] | Output formats to return. Options: "markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images". |
only_main_content | bool | true | Extract only the main content, removing navs, footers, sidebars. |
mobile | bool | false | Emulate a mobile device viewport. |
timeout | number | 30000 | Request timeout in milliseconds. |
wait_for | number | 0 | Wait for this many milliseconds after page load before extracting. |
css_selector | string | — | Only extract content matching this CSS selector. |
include_tags | string[] | — | Only include these HTML tags in extraction. |
exclude_tags | string[] | — | Exclude these HTML tags from extraction. |
use_proxy | bool | false | Route request through a proxy for anti-bot bypass. |
Example Request
curl -X POST https://api.datablue.dev/v1/scrape \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown", "links"]
}'Example Response
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"links": [
{ "url": "https://www.iana.org/domains/example", "text": "More information..." }
],
"metadata": {
"title": "Example Domain",
"description": "Example Domain",
"status_code": 200,
"word_count": 58
}
}
}/v1/crawlCrawl
Crawl a website by following internal links. Supports BFS (breadth-first), DFS (depth-first), and best-first crawl strategies. Returns progressive results as pages are scraped. The job runs asynchronously — poll the job endpoint for status and results.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urlrequired | string | — | The starting URL to crawl. |
max_pages | number | 100 | Maximum number of pages to crawl (max 1000). |
max_depth | number | 3 | Maximum link depth from the starting URL (max 10). |
concurrency | number | 3 | Number of parallel scrape workers (max 10). |
crawl_strategy | string | "bfs" | Crawl strategy: "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first). |
allow_external_links | bool | false | Follow links to external domains. |
respect_robots_txt | bool | true | Obey the target site's robots.txt rules. |
include_paths | string[] | — | Only crawl URLs matching these path patterns. |
exclude_paths | string[] | — | Skip URLs matching these path patterns. |
scrape_options | object | — | Options passed to each page scrape: { formats, only_main_content }. |
Example Request
curl -X POST https://api.datablue.dev/v1/crawl \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"max_pages": 50,
"scrape_options": {
"formats": ["markdown"],
"only_main_content": true
}
}'Example Response
{
"success": true,
"job_id": "crawl_a1b2c3d4",
"status": "started"
}
// Poll GET /v1/crawl/crawl_a1b2c3d4 for results:
{
"success": true,
"status": "completed",
"total_pages": 47,
"data": [
{
"url": "https://example.com",
"markdown": "# Example Domain...",
"metadata": { "title": "Example Domain", "status_code": 200 }
},
...
]
}/v1/searchSearch
Perform a web search via SERP and scrape each result page for clean content. Supports multiple search engines including Google, DuckDuckGo, and Brave. The job runs asynchronously — poll for results, each containing the page URL, extracted markdown, and metadata.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
queryrequired | string | — | The search query. |
num_results | number | 10 | Number of results to return (max 100). |
engine | string | "google" | Search engine: "google", "duckduckgo", or "brave". |
formats | string[] | ["markdown"] | Output formats for each scraped result page. |
only_main_content | bool | true | Extract only the main content from each result. |
mobile | bool | false | Emulate mobile device for result pages. |
Example Request
curl -X POST https://api.datablue.dev/v1/search \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"query": "web scraping best practices",
"num_results": 10,
"formats": ["markdown"]
}'Example Response
{
"success": true,
"job_id": "search_e5f6g7h8",
"status": "started"
}
// Poll for results:
{
"success": true,
"status": "completed",
"data": [
{
"url": "https://example.com/scraping-guide",
"markdown": "# Web Scraping Best Practices...",
"metadata": {
"title": "Web Scraping Best Practices",
"description": "A comprehensive guide...",
"status_code": 200
}
},
...
]
}/v1/mapMap
Discover all URLs on a website without scraping content. Combines sitemap.xml parsing with link extraction for comprehensive URL discovery. Returns URLs with metadata including title, description, last modified date, and priority.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urlrequired | string | — | The website URL to map. |
limit | number | 100 | Maximum number of URLs to return. |
include_subdomains | bool | true | Include URLs from subdomains. |
use_sitemap | bool | true | Use the site's sitemap.xml for discovery. |
search | string | — | Optional filter to match URLs or titles. |
Example Request
curl -X POST https://api.datablue.dev/v1/map \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"limit": 100,
"use_sitemap": true
}'Example Response
{
"success": true,
"total": 47,
"links": [
{
"url": "https://example.com/about",
"title": "About Us",
"description": "Learn more about our company",
"lastmod": "2025-12-01",
"priority": 0.8
},
{
"url": "https://example.com/blog",
"title": "Blog",
"description": "Latest articles and news",
"lastmod": "2025-12-15",
"priority": 0.7
},
...
]
}/v1/extractExtract
LLM-powered structured data extraction. Provide a natural language prompt describing what data you need and an optional JSON Schema for the output format. DataBlue scrapes the page, sends the content to an LLM, and returns structured data matching your schema.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urlrequired | string | — | The URL to extract data from. |
promptrequired | string | — | Natural language instruction for what data to extract. |
schema | object | — | JSON Schema defining the expected output structure. |
model | string | — | LLM model to use for extraction. |
formats | string[] | ["markdown"] | Additional output formats to include alongside extracted data. |
Example Request
curl -X POST https://api.datablue.dev/v1/extract \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"prompt": "Extract all product names and prices",
"schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" }
}
}
}
}
}
}'Example Response
{
"success": true,
"data": {
"extract": {
"products": [
{ "name": "Wireless Headphones", "price": 79.99 },
{ "name": "Bluetooth Speaker", "price": 49.99 },
{ "name": "USB-C Hub", "price": 34.99 }
]
},
"markdown": "# Products\n\n- Wireless Headphones — $79.99...",
"metadata": {
"title": "Products - Example Store",
"status_code": 200
}
}
}Output Formats
Control what data you receive by passing one or more format strings in the formats array. Each format returns a corresponding field in the response.
markdown
Clean markdown conversion of the page content. Preserves headings, lists, links, tables, and inline formatting. Scripts, styles, and navigation are removed.
"markdown": "# Page Title\n\nIntroduction paragraph...\n\n## Section\n\n- Item one\n- Item two"html / raw_html
html returns cleaned HTML with scripts, styles, and tracking elements removed. raw_html returns the original, unmodified HTML source.
"html": "<h1>Page Title</h1><p>Content paragraph...</p>",
"raw_html": "<!DOCTYPE html><html>...full source...</html>"links
Array of all URLs found on the page, including anchor text.
"links": [
{ "url": "https://example.com/about", "text": "About Us" },
{ "url": "https://example.com/contact", "text": "Contact" }
]screenshot
Base64-encoded PNG screenshot of the fully rendered page. Requires a browser-based scrape.
"screenshot": "data:image/png;base64,iVBORw0KGgoAAAANSUhEU..."structured_data
Extracted structured data from the page including JSON-LD, OpenGraph tags, Twitter Cards, and Schema.org markup.
"structured_data": {
"json_ld": [{ "@type": "Article", "headline": "..." }],
"opengraph": { "og:title": "...", "og:description": "..." },
"twitter": { "twitter:card": "summary_large_image" },
"schema_org": [{ "@type": "Organization", "name": "..." }]
}headings
Heading hierarchy extracted from the page, useful for understanding document structure.
"headings": [
{ "level": 1, "text": "Main Title" },
{ "level": 2, "text": "First Section" },
{ "level": 3, "text": "Subsection" },
{ "level": 2, "text": "Second Section" }
]images
All images found on the page with source URLs, alt text, and dimensions when available.
"images": [
{ "src": "https://example.com/hero.jpg", "alt": "Hero image", "width": 1200, "height": 630 },
{ "src": "https://example.com/logo.png", "alt": "Company logo", "width": 200, "height": 50 }
]Data APIs
Pre-built data extraction endpoints for popular platforms. These endpoints handle the complexity of scraping specific data sources and return clean, structured JSON.
/v1/data/google/searchGoogle Search
Scrape Google search results including organic results, featured snippets, knowledge panels, and "People also ask" boxes.
curl -X POST https://api.datablue.dev/v1/data/google/search \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "web scraping api", "num_results": 20, "gl": "us", "hl": "en"}'/v1/data/google/mapsGoogle Maps
Extract business listings from Google Maps including name, address, rating, reviews, phone number, website, hours, and coordinates. Supports grid search for comprehensive coverage.
curl -X POST https://api.datablue.dev/v1/data/google/maps \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "coffee shops", "location": "San Francisco, CA", "limit": 50}'/v1/data/google/newsGoogle News
Scrape Google News articles for any topic. Returns headline, source, publication date, snippet, and link to the full article.
curl -X POST https://api.datablue.dev/v1/data/google/news \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "artificial intelligence", "num_results": 20, "time_range": "week"}'/v1/data/google/jobsGoogle Jobs
Extract job listings from Google Jobs. Returns title, company, location, salary range, description, posted date, and application link.
curl -X POST https://api.datablue.dev/v1/data/google/jobs \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "software engineer", "location": "New York", "num_results": 30}'/v1/data/google/imagesGoogle Images
Scrape Google Image search results. Returns image URL, thumbnail, source page, title, and dimensions.
curl -X POST https://api.datablue.dev/v1/data/google/images \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "modern architecture", "num_results": 30, "size": "large"}'/v1/data/google/flightsGoogle Flights
Search Google Flights for pricing and schedule data. Returns flight options with airline, departure/arrival times, duration, stops, and price. HTTP-only with protobuf encoding.
curl -X POST https://api.datablue.dev/v1/data/google/flights \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"origin": "SFO",
"destination": "JFK",
"departure_date": "2026-04-15",
"return_date": "2026-04-22",
"passengers": 1
}'/v1/data/google/financeGoogle Finance
Get stock and market data from Google Finance. Returns current price, change, market cap, P/E ratio, and historical data. Supports market overview and individual quote pages.
curl -X POST https://api.datablue.dev/v1/data/google/finance \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"symbol": "AAPL", "exchange": "NASDAQ"}'/v1/data/google/shoppingGoogle Shopping
Scrape Google Shopping results. Returns product name, price, seller, rating, image, and link for comparison shopping data.
curl -X POST https://api.datablue.dev/v1/data/google/shopping \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "wireless headphones", "num_results": 20, "min_price": 50, "max_price": 200}'/v1/data/amazonAmazon
Extract product data from Amazon. Returns product title, price, rating, review count, images, description, features, and availability. Supports search and individual product URLs.
curl -X POST https://api.datablue.dev/v1/data/amazon \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{"query": "mechanical keyboard", "num_results": 20, "marketplace": "us"}'Rate Limits
Rate limits are enforced per API key on a per-minute basis. When you exceed the limit, you'll receive a 429 status code. The response includes Retry-After headers.
| Plan | Scrape | Crawl | Search | Map |
|---|---|---|---|---|
| Free | 10/min | 2/min | 5/min | 5/min |
| Starter | 50/min | 10/min | 20/min | 20/min |
| Pro | 300/min | 50/min | 100/min | 100/min |
| Growth | 300/min | 50/min | 100/min | 100/min |
Error Codes
All errors return a JSON body with { "success": false, "error": "message" }.
| Code | Status | Description |
|---|---|---|
400 | Bad Request | Missing or invalid parameters. Check the request body. |
401 | Unauthorized | Missing or invalid API key / JWT token. |
403 | Forbidden | Your plan does not allow this action or resource. |
404 | Not Found | The requested resource or job ID does not exist. |
429 | Rate Limited | Too many requests. Wait and retry after the Retry-After period. |
500 | Server Error | Internal server error. Retry the request or contact support. |
Webhooks
Instead of polling for job results, you can provide a webhook URL that DataBlue will call when the job completes. Pass webhook_url and optionally webhook_secret with any async job request.
Sending a webhook
curl -X POST https://api.datablue.dev/v1/crawl \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"max_pages": 50,
"webhook_url": "https://your-server.com/webhook",
"webhook_secret": "your_secret_key"
}'Webhook payload
DataBlue sends a POST request to your webhook URL with the job results in the body. If you provided a webhook_secret, the request includes an X-Webhook-Signature header containing an HMAC-SHA256 signature of the request body.
// Headers
X-Webhook-Signature: sha256=a1b2c3d4e5f6...
Content-Type: application/json
// Body
{
"success": true,
"job_id": "crawl_a1b2c3d4",
"status": "completed",
"total_pages": 47,
"data": [...]
}Verifying signatures
# Signature is HMAC-SHA256 of the raw request body
# using your webhook_secret as the keyPlans
DataBlue offers tiered plans to match your usage. All plans include access to every endpoint. Higher plans unlock higher rate limits, more concurrent crawls, and priority support.
Free
$0/mo
- 500 scrapes/month
- 10 crawl jobs/month
- Community support
- 1 API key
Starter
$29/mo
- 5,000 scrapes/month
- 100 crawl jobs/month
- Email support
- 5 API keys
Pro
$99/mo
- 50,000 scrapes/month
- 1,000 crawl jobs/month
- Priority support
- Unlimited API keys
- Webhooks
Growth
$299/mo
- Unlimited scrapes
- Unlimited crawl jobs
- Dedicated support
- Unlimited API keys
- Webhooks
- Custom rate limits