Web Scraping - Geekflare

The Web Scraping endpoint scrapes any URL and returns clean page content in your preferred format. It handles JavaScript-heavy sites, blocks ads, rotates proxies, and can extract structured data using CSS or XPath selectors. Endpoint: POST https://api.geekflare.com/webscraping

Basic Scrape

Scrape a URL and get back LLM-ready markdown.

import requests

response = requests.post(
"https://api.geekflare.com/webscraping",
headers={"x-api-key": "YOUR_API_KEY"},
json={"url": "https://example.com"}
)
print(response.json())

Response

{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com",
    "device": "desktop",
    "format": ["html-llm"],
    "fileOutput": false,
    "blockAds": true,
    "renderJS": true,
    "stealth": false,
    "waitTime": 0,
    "extractionMode": "default",
    "test": { "id": "abc123" }
  },
  "data": "# Example Domain\n\nThis domain is for use in illustrative examples..."
}

Output Formats

Choose one or more output formats. You can request up to 3 formats in a single call.

Format	Description
`html`	Raw HTML
`markdown`	Clean Markdown
`json`	Structured JSON
`html-llm`	HTML stripped for LLM consumption
`markdown-llm`	Markdown stripped for LLM consumption
`text`	Plain text
`text-llm`	Plain text stripped for LLM consumption

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "format": ["markdown", "html", "text"]
    }
)

Response

{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com",
    "format": ["markdown", "html", "text"],
    "test": { "id": "abc123" }
  },
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "html": "<!DOCTYPE html><html><head><title>Example Domain</title>...",
    "text": "Example Domain\n\nThis domain is for use in illustrative examples..."
  }
}

File Output

Get a CDN URL instead of inline content. Useful for large pages or when you need to store the result.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "format": ["markdown"],
        "fileOutput": True
    }
)

Response

{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com",
    "format": ["markdown"],
    "fileOutput": true,
    "test": { "id": "abc123" }
  },
  "data": "https://cdn.geekflare.com/tests/webscraping/ZuyhINuAZPQQabbN.md"
}

JavaScript Rendering

Disable JS rendering for faster scrapes on static pages. Enabled by default.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "renderJS": False
    }
)

Stealth Mode

Bypass bot detection on protected pages. Slower but more reliable on heavily guarded sites.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "stealth": True
    }
)

Wait Time

Add a delay after page load to capture lazy-loaded content or bypass bot checks.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "waitTime": 2.5
    }
)

Proxy Routing

Route the request through a specific country’s IP address to bypass geo-blocks or scrape region-specific content.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "proxyCountry": "gb"
    }
)

Device Emulation

Emulate a mobile device to scrape mobile-specific content.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "device": "mobile"
    }
)

Structured Extraction — CSS Schema

Extract specific fields from a page using CSS selectors. Returns structured JSON.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/products",
        "format": ["json"],
        "extractionMode": "cssSchema",
        "extractionSchema": {
            "name": "Product Schema",
            "baseSelector": ".product",
            "fields": [
                {"name": "title", "selector": "h1.product-title", "type": "text"},
                {"name": "price", "selector": ".price", "type": "text"},
                {"name": "link", "selector": "a.product-link", "type": "attr", "attribute": "href"}
            ]
        }
    }
)

Response

{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com/products",
    "format": ["json"],
    "extractionMode": "cssSchema",
    "test": { "id": "abc123" }
  },
  "data": [
    {
      "title": "Running Shoe Pro",
      "price": "$129.99",
      "link": "/products/running-shoe-pro"
    },
    {
      "title": "Trail Runner X",
      "price": "$89.99",
      "link": "/products/trail-runner-x"
    }
  ]
}

Structured Extraction — XPath Schema

Use XPath expressions for more precise extraction.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/articles",
        "format": ["json"],
        "extractionMode": "xpathSchema",
        "extractionSchema": {
            "name": "Article Schema",
            "baseSelector": "//div[@class='article']",
            "fields": [
                {"name": "title", "selector": ".//h1/text()", "type": "text"},
                {"name": "author", "selector": ".//span[@class='author']/text()", "type": "text"}
            ]
        }
    }
)

Response

{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com/articles",
    "format": ["json"],
    "extractionMode": "xpathSchema",
    "test": { "id": "abc123" }
  },
  "data": [
    { "title": "How to Run Faster", "author": "Jane Smith" },
    { "title": "Best Trails in 2025", "author": "John Doe" }
  ]
}

Default Extraction — Static Fields

Inject static metadata fields alongside scraped content.

response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "format": ["json"],
        "extractionMode": "default",
        "extractionSchema": {
            "name": "Quick Fields",
            "fields": [
                {"title": "Category", "value": "Electronics"},
                {"title": "Country", "value": "India"}
            ]
        }
    }
)

All Parameters

Parameter	Type	Default	Description
`url`	string	required	Target URL
`device`	`desktop` \| `mobile`	`desktop`	Device to emulate
`format`	array	`["html-llm"]`	Output format(s). Up to 3.
`renderJS`	boolean	`true`	Execute JavaScript before extracting
`blockAds`	boolean	`true`	Block ads during scrape
`stealth`	boolean	`false`	Bypass bot detection
`waitTime`	number	`0`	Seconds to wait after page load
`fileOutput`	boolean	`false`	Return CDN URL instead of inline data
`proxyCountry`	string	—	Route through country ISO code (e.g. `us`, `gb`)
`extractionMode`	`default` \| `cssSchema` \| `xpathSchema`	`default`	Extraction mode (used when `format` includes `json`)
`extractionSchema`	object	—	Schema for structured extraction

​Basic Scrape

​Output Formats

​File Output

​JavaScript Rendering

​Stealth Mode

​Wait Time

​Proxy Routing

​Device Emulation

​Structured Extraction — CSS Schema

​Structured Extraction — XPath Schema

​Default Extraction — Static Fields

​All Parameters

Basic Scrape

Output Formats

File Output

JavaScript Rendering

Stealth Mode

Wait Time

Proxy Routing

Device Emulation

Structured Extraction — CSS Schema

Structured Extraction — XPath Schema

Default Extraction — Static Fields

All Parameters