Skip to main content
The Web Scraping endpoint scrapes any URL and returns clean page content in your preferred format. It handles JavaScript-heavy sites, blocks ads, rotates proxies, and can extract structured data using CSS or XPath selectors. Endpoint: POST https://api.geekflare.com/webscraping

Basic Scrape

Scrape a URL and get back LLM-ready markdown.
import requests

response = requests.post(
"https://api.geekflare.com/webscraping",
headers={"x-api-key": "YOUR_API_KEY"},
json={"url": "https://example.com"}
)
print(response.json())

{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com",
    "device": "desktop",
    "format": ["html-llm"],
    "fileOutput": false,
    "blockAds": true,
    "renderJS": true,
    "stealth": false,
    "waitTime": 0,
    "extractionMode": "default",
    "test": { "id": "abc123" }
  },
  "data": "# Example Domain\n\nThis domain is for use in illustrative examples..."
}

Output Formats

Choose one or more output formats. You can request up to 3 formats in a single call.
FormatDescription
htmlRaw HTML
markdownClean Markdown
jsonStructured JSON
html-llmHTML stripped for LLM consumption
markdown-llmMarkdown stripped for LLM consumption
textPlain text
text-llmPlain text stripped for LLM consumption
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "format": ["markdown", "html", "text"]
    }
)
{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com",
    "format": ["markdown", "html", "text"],
    "test": { "id": "abc123" }
  },
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "html": "<!DOCTYPE html><html><head><title>Example Domain</title>...",
    "text": "Example Domain\n\nThis domain is for use in illustrative examples..."
  }
}

File Output

Get a CDN URL instead of inline content. Useful for large pages or when you need to store the result.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "format": ["markdown"],
        "fileOutput": True
    }
)
{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com",
    "format": ["markdown"],
    "fileOutput": true,
    "test": { "id": "abc123" }
  },
  "data": "https://cdn.geekflare.com/tests/webscraping/ZuyhINuAZPQQabbN.md"
}

JavaScript Rendering

Disable JS rendering for faster scrapes on static pages. Enabled by default.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "renderJS": False
    }
)

Stealth Mode

Bypass bot detection on protected pages. Slower but more reliable on heavily guarded sites.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "stealth": True
    }
)

Wait Time

Add a delay after page load to capture lazy-loaded content or bypass bot checks.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "waitTime": 2.5
    }
)

Proxy Routing

Route the request through a specific country’s IP address to bypass geo-blocks or scrape region-specific content.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "proxyCountry": "gb"
    }
)

Device Emulation

Emulate a mobile device to scrape mobile-specific content.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "device": "mobile"
    }
)

Structured Extraction — CSS Schema

Extract specific fields from a page using CSS selectors. Returns structured JSON.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/products",
        "format": ["json"],
        "extractionMode": "cssSchema",
        "extractionSchema": {
            "name": "Product Schema",
            "baseSelector": ".product",
            "fields": [
                {"name": "title", "selector": "h1.product-title", "type": "text"},
                {"name": "price", "selector": ".price", "type": "text"},
                {"name": "link", "selector": "a.product-link", "type": "attr", "attribute": "href"}
            ]
        }
    }
)
{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com/products",
    "format": ["json"],
    "extractionMode": "cssSchema",
    "test": { "id": "abc123" }
  },
  "data": [
    {
      "title": "Running Shoe Pro",
      "price": "$129.99",
      "link": "/products/running-shoe-pro"
    },
    {
      "title": "Trail Runner X",
      "price": "$89.99",
      "link": "/products/trail-runner-x"
    }
  ]
}

Structured Extraction — XPath Schema

Use XPath expressions for more precise extraction.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/articles",
        "format": ["json"],
        "extractionMode": "xpathSchema",
        "extractionSchema": {
            "name": "Article Schema",
            "baseSelector": "//div[@class='article']",
            "fields": [
                {"name": "title", "selector": ".//h1/text()", "type": "text"},
                {"name": "author", "selector": ".//span[@class='author']/text()", "type": "text"}
            ]
        }
    }
)
{
  "timestamp": 1778737930991,
  "apiStatus": "success",
  "apiCode": 200,
  "meta": {
    "url": "https://example.com/articles",
    "format": ["json"],
    "extractionMode": "xpathSchema",
    "test": { "id": "abc123" }
  },
  "data": [
    { "title": "How to Run Faster", "author": "Jane Smith" },
    { "title": "Best Trails in 2025", "author": "John Doe" }
  ]
}

Default Extraction — Static Fields

Inject static metadata fields alongside scraped content.
response = requests.post(
    "https://api.geekflare.com/webscraping",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com",
        "format": ["json"],
        "extractionMode": "default",
        "extractionSchema": {
            "name": "Quick Fields",
            "fields": [
                {"title": "Category", "value": "Electronics"},
                {"title": "Country", "value": "India"}
            ]
        }
    }
)

All Parameters

ParameterTypeDefaultDescription
urlstringrequiredTarget URL
devicedesktop | mobiledesktopDevice to emulate
formatarray["html-llm"]Output format(s). Up to 3.
renderJSbooleantrueExecute JavaScript before extracting
blockAdsbooleantrueBlock ads during scrape
stealthbooleanfalseBypass bot detection
waitTimenumber0Seconds to wait after page load
fileOutputbooleanfalseReturn CDN URL instead of inline data
proxyCountrystringRoute through country ISO code (e.g. us, gb)
extractionModedefault | cssSchema | xpathSchemadefaultExtraction mode (used when format includes json)
extractionSchemaobjectSchema for structured extraction