DataBlue API Documentation

DataBlue is a self-hosted web scraping platform that provides a simple, powerful API for extracting content from any website. It handles JavaScript rendering, anti-bot bypass, proxy rotation, and content extraction so you can focus on building your application.

Scrape

Extract content from any URL with JS rendering.

Crawl

Follow links and scrape entire websites.

Extract

LLM-powered structured data extraction.

The API is fully compatible with the Firecrawl API spec. Base URL: https://api.datablue.dev/v1

Authentication

All API requests require authentication via the Authorization header. DataBlue supports two authentication methods:

API Key (Recommended)

API keys have the prefix wh_ and do not expire. Create them from the dashboard or via the API.

bash
# Using an API key
curl -H "Authorization: Bearer wh_your_api_key" \
  https://api.datablue.dev/v1/scrape

JWT Token

JWT tokens are obtained from POST /v1/auth/login and expire after 7 days. They are primarily used by the dashboard frontend.

bash
# Obtain a JWT
curl -X POST https://api.datablue.dev/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com", "password": "your_password"}'

Create an API Key

Programmatically create a new API key via the API (requires JWT authentication):

bash
curl -X POST https://api.datablue.dev/v1/auth/api-keys \
  -H "Authorization: Bearer eyJ..." \
  -H "Content-Type: application/json" \
  -d '{"name": "my-scraper"}'

Quick Start

Get up and running in under a minute. Scrape a page and get clean markdown back.

1. Get your API key

Create an API key from your DataBlue dashboard at Settings → API Keys.

2. Make your first request

curl -X POST https://api.datablue.dev/v1/scrape \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links"]
  }'

3. Get your results

json
{
  "success": true,
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "links": [
      { "url": "https://www.iana.org/domains/example", "text": "More information..." }
    ],
    "metadata": {
      "title": "Example Domain",
      "description": "Example Domain",
      "status_code": 200,
      "word_count": 58
    }
  }
}
POST/v1/scrape

Scrape

Extract content from any URL. DataBlue handles JavaScript rendering, anti-bot bypass, proxy rotation, and content extraction automatically. Uses a staggered race strategy — HTTP at t=0, browser at t=3s — and the first valid result wins.

Parameters

ParameterTypeDefaultDescription
urlrequired
stringThe URL to scrape.
formats
string[]["markdown"]Output formats to return. Options: "markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images".
only_main_content
booltrueExtract only the main content, removing navs, footers, sidebars.
mobile
boolfalseEmulate a mobile device viewport.
timeout
number30000Request timeout in milliseconds.
wait_for
number0Wait for this many milliseconds after page load before extracting.
css_selector
stringOnly extract content matching this CSS selector.
include_tags
string[]Only include these HTML tags in extraction.
exclude_tags
string[]Exclude these HTML tags from extraction.
use_proxy
boolfalseRoute request through a proxy for anti-bot bypass.

Example Request

curl -X POST https://api.datablue.dev/v1/scrape \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links"]
  }'

Example Response

json
{
  "success": true,
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "links": [
      { "url": "https://www.iana.org/domains/example", "text": "More information..." }
    ],
    "metadata": {
      "title": "Example Domain",
      "description": "Example Domain",
      "status_code": 200,
      "word_count": 58
    }
  }
}
POST/v1/crawl

Crawl

Crawl a website by following internal links. Supports BFS (breadth-first), DFS (depth-first), and best-first crawl strategies. Returns progressive results as pages are scraped. The job runs asynchronously — poll the job endpoint for status and results.

Parameters

ParameterTypeDefaultDescription
urlrequired
stringThe starting URL to crawl.
max_pages
number100Maximum number of pages to crawl (max 1000).
max_depth
number3Maximum link depth from the starting URL (max 10).
concurrency
number3Number of parallel scrape workers (max 10).
crawl_strategy
string"bfs"Crawl strategy: "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first).
allow_external_links
boolfalseFollow links to external domains.
respect_robots_txt
booltrueObey the target site's robots.txt rules.
include_paths
string[]Only crawl URLs matching these path patterns.
exclude_paths
string[]Skip URLs matching these path patterns.
scrape_options
objectOptions passed to each page scrape: { formats, only_main_content }.

Example Request

curl -X POST https://api.datablue.dev/v1/crawl \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 50,
    "scrape_options": {
      "formats": ["markdown"],
      "only_main_content": true
    }
  }'

Example Response

json
{
  "success": true,
  "job_id": "crawl_a1b2c3d4",
  "status": "started"
}

// Poll GET /v1/crawl/crawl_a1b2c3d4 for results:
{
  "success": true,
  "status": "completed",
  "total_pages": 47,
  "data": [
    {
      "url": "https://example.com",
      "markdown": "# Example Domain...",
      "metadata": { "title": "Example Domain", "status_code": 200 }
    },
    ...
  ]
}
POST/v1/map

Map

Discover all URLs on a website without scraping content. Combines sitemap.xml parsing with link extraction for comprehensive URL discovery. Returns URLs with metadata including title, description, last modified date, and priority.

Parameters

ParameterTypeDefaultDescription
urlrequired
stringThe website URL to map.
limit
number100Maximum number of URLs to return.
include_subdomains
booltrueInclude URLs from subdomains.
use_sitemap
booltrueUse the site's sitemap.xml for discovery.
search
stringOptional filter to match URLs or titles.

Example Request

curl -X POST https://api.datablue.dev/v1/map \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "limit": 100,
    "use_sitemap": true
  }'

Example Response

json
{
  "success": true,
  "total": 47,
  "links": [
    {
      "url": "https://example.com/about",
      "title": "About Us",
      "description": "Learn more about our company",
      "lastmod": "2025-12-01",
      "priority": 0.8
    },
    {
      "url": "https://example.com/blog",
      "title": "Blog",
      "description": "Latest articles and news",
      "lastmod": "2025-12-15",
      "priority": 0.7
    },
    ...
  ]
}
POST/v1/extract

Extract

LLM-powered structured data extraction. Provide a natural language prompt describing what data you need and an optional JSON Schema for the output format. DataBlue scrapes the page, sends the content to an LLM, and returns structured data matching your schema.

Parameters

ParameterTypeDefaultDescription
urlrequired
stringThe URL to extract data from.
promptrequired
stringNatural language instruction for what data to extract.
schema
objectJSON Schema defining the expected output structure.
model
stringLLM model to use for extraction.
formats
string[]["markdown"]Additional output formats to include alongside extracted data.

Example Request

curl -X POST https://api.datablue.dev/v1/extract \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "prompt": "Extract all product names and prices",
    "schema": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "number" }
            }
          }
        }
      }
    }
  }'

Example Response

json
{
  "success": true,
  "data": {
    "extract": {
      "products": [
        { "name": "Wireless Headphones", "price": 79.99 },
        { "name": "Bluetooth Speaker", "price": 49.99 },
        { "name": "USB-C Hub", "price": 34.99 }
      ]
    },
    "markdown": "# Products\n\n- Wireless Headphones — $79.99...",
    "metadata": {
      "title": "Products - Example Store",
      "status_code": 200
    }
  }
}

Output Formats

Control what data you receive by passing one or more format strings in the formats array. Each format returns a corresponding field in the response.

markdown

Clean markdown conversion of the page content. Preserves headings, lists, links, tables, and inline formatting. Scripts, styles, and navigation are removed.

json
"markdown": "# Page Title\n\nIntroduction paragraph...\n\n## Section\n\n- Item one\n- Item two"

html / raw_html

html returns cleaned HTML with scripts, styles, and tracking elements removed. raw_html returns the original, unmodified HTML source.

json
"html": "<h1>Page Title</h1><p>Content paragraph...</p>",
"raw_html": "<!DOCTYPE html><html>...full source...</html>"

screenshot

Base64-encoded PNG screenshot of the fully rendered page. Requires a browser-based scrape.

json
"screenshot": "data:image/png;base64,iVBORw0KGgoAAAANSUhEU..."

structured_data

Extracted structured data from the page including JSON-LD, OpenGraph tags, Twitter Cards, and Schema.org markup.

json
"structured_data": {
  "json_ld": [{ "@type": "Article", "headline": "..." }],
  "opengraph": { "og:title": "...", "og:description": "..." },
  "twitter": { "twitter:card": "summary_large_image" },
  "schema_org": [{ "@type": "Organization", "name": "..." }]
}

headings

Heading hierarchy extracted from the page, useful for understanding document structure.

json
"headings": [
  { "level": 1, "text": "Main Title" },
  { "level": 2, "text": "First Section" },
  { "level": 3, "text": "Subsection" },
  { "level": 2, "text": "Second Section" }
]

images

All images found on the page with source URLs, alt text, and dimensions when available.

json
"images": [
  { "src": "https://example.com/hero.jpg", "alt": "Hero image", "width": 1200, "height": 630 },
  { "src": "https://example.com/logo.png", "alt": "Company logo", "width": 200, "height": 50 }
]
POST/v1/data/google/maps

Google Maps

Extract business listings from Google Maps including name, address, rating, reviews, phone number, website, hours, and coordinates. Supports grid search for comprehensive coverage.

bash
curl -X POST https://api.datablue.dev/v1/data/google/maps \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "coffee shops", "location": "San Francisco, CA", "limit": 50}'
POST/v1/data/google/news

Google News

Scrape Google News articles for any topic. Returns headline, source, publication date, snippet, and link to the full article.

bash
curl -X POST https://api.datablue.dev/v1/data/google/news \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "artificial intelligence", "num_results": 20, "time_range": "week"}'
POST/v1/data/google/jobs

Google Jobs

Extract job listings from Google Jobs. Returns title, company, location, salary range, description, posted date, and application link.

bash
curl -X POST https://api.datablue.dev/v1/data/google/jobs \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "software engineer", "location": "New York", "num_results": 30}'
POST/v1/data/google/images

Google Images

Scrape Google Image search results. Returns image URL, thumbnail, source page, title, and dimensions.

bash
curl -X POST https://api.datablue.dev/v1/data/google/images \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "modern architecture", "num_results": 30, "size": "large"}'
POST/v1/data/google/flights

Google Flights

Search Google Flights for pricing and schedule data. Returns flight options with airline, departure/arrival times, duration, stops, and price. HTTP-only with protobuf encoding.

bash
curl -X POST https://api.datablue.dev/v1/data/google/flights \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "SFO",
    "destination": "JFK",
    "departure_date": "2026-04-15",
    "return_date": "2026-04-22",
    "passengers": 1
  }'
POST/v1/data/google/finance

Google Finance

Get stock and market data from Google Finance. Returns current price, change, market cap, P/E ratio, and historical data. Supports market overview and individual quote pages.

bash
curl -X POST https://api.datablue.dev/v1/data/google/finance \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"symbol": "AAPL", "exchange": "NASDAQ"}'
POST/v1/data/google/shopping

Google Shopping

Scrape Google Shopping results. Returns product name, price, seller, rating, image, and link for comparison shopping data.

bash
curl -X POST https://api.datablue.dev/v1/data/google/shopping \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "wireless headphones", "num_results": 20, "min_price": 50, "max_price": 200}'
POST/v1/data/amazon

Amazon

Extract product data from Amazon. Returns product title, price, rating, review count, images, description, features, and availability. Supports search and individual product URLs.

bash
curl -X POST https://api.datablue.dev/v1/data/amazon \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"query": "mechanical keyboard", "num_results": 20, "marketplace": "us"}'

Rate Limits

Rate limits are enforced per API key on a per-minute basis. When you exceed the limit, you'll receive a 429 status code. The response includes Retry-After headers.

PlanScrapeCrawlSearchMap
Free10/min2/min5/min5/min
Starter50/min10/min20/min20/min
Pro300/min50/min100/min100/min
Growth300/min50/min100/min100/min

Error Codes

All errors return a JSON body with { "success": false, "error": "message" }.

CodeStatusDescription
400Bad RequestMissing or invalid parameters. Check the request body.
401UnauthorizedMissing or invalid API key / JWT token.
403ForbiddenYour plan does not allow this action or resource.
404Not FoundThe requested resource or job ID does not exist.
429Rate LimitedToo many requests. Wait and retry after the Retry-After period.
500Server ErrorInternal server error. Retry the request or contact support.

Webhooks

Instead of polling for job results, you can provide a webhook URL that DataBlue will call when the job completes. Pass webhook_url and optionally webhook_secret with any async job request.

Sending a webhook

bash
curl -X POST https://api.datablue.dev/v1/crawl \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 50,
    "webhook_url": "https://your-server.com/webhook",
    "webhook_secret": "your_secret_key"
  }'

Webhook payload

DataBlue sends a POST request to your webhook URL with the job results in the body. If you provided a webhook_secret, the request includes an X-Webhook-Signature header containing an HMAC-SHA256 signature of the request body.

text
// Headers
X-Webhook-Signature: sha256=a1b2c3d4e5f6...
Content-Type: application/json

// Body
{
  "success": true,
  "job_id": "crawl_a1b2c3d4",
  "status": "completed",
  "total_pages": 47,
  "data": [...]
}

Verifying signatures

# Signature is HMAC-SHA256 of the raw request body
# using your webhook_secret as the key

Plans

DataBlue offers tiered plans to match your usage. All plans include access to every endpoint. Higher plans unlock higher rate limits, more concurrent crawls, and priority support.

Free

$0/mo

  • 500 scrapes/month
  • 10 crawl jobs/month
  • Community support
  • 1 API key

Starter

$29/mo

  • 5,000 scrapes/month
  • 100 crawl jobs/month
  • Email support
  • 5 API keys

Pro

$99/mo

  • 50,000 scrapes/month
  • 1,000 crawl jobs/month
  • Priority support
  • Unlimited API keys
  • Webhooks

Growth

$299/mo

  • Unlimited scrapes
  • Unlimited crawl jobs
  • Dedicated support
  • Unlimited API keys
  • Webhooks
  • Custom rate limits