Web Scraping Output Format

Geekflare Web Scraping API supports 6 output formats via the format parameter. Whether you are archiving full web pages, extracting structured data, or feeding Retrieval-Augmented Generation (RAG) pipelines, you can request the exact data structure you need.

The `-llm` Optimized Formats

For AI developers, we offer specialized -llm formats. These formats run our semantic cleaning engine before returning the response. They automatically strip out boilerplate, navigation bars, footers, sidebars, advertisements, and hidden elements, returning only the primary content of the page. Using -llm formats dramatically reduces noise, improves LLM inference accuracy, and saves massive amounts of tokens.

Format Reference

You can pass one of the following string values into the format parameter of your API request.

Format	Description	Best For	Avg. Token Savings*
`markdown-llm`	(Recommended for AI) The primary content of the page, converted to clean Markdown.	RAG pipelines, AI Agents, LLM.	~75% vs raw HTML
`text-llm`	The primary content of the page as raw text. Strips all HTML tags, Markdown formatting, and structural data.	Vector embeddings, traditional NLP, maximum token efficiency.	~85% vs raw HTML
`html-llm`	The primary content of the page in HTML format. Strips out all `<script>`, `<style>`, `<nav>`, and `<footer>` tags.	AI applications requiring DOM structure, semantic HTML parsing.	~60% vs raw HTML
`markdown`	The entire rendered page converted to Markdown, including navigation links, sidebars, and footer text.	Full-page archiving, layout analysis.	~60% vs raw HTML
`text`	The entire rendered page as raw text. Contains no HTML tags, but includes all boilerplate text.	Keyword density checks, regex matching.	~70% vs raw HTML
`html`	The raw, unmodified HTML DOM of the page as it appears in the browser.	Traditional web scraping, republishing.	-
`json`	Structured JSON output alongside the DOM.	Data analysts, structured databases.	-

*Token savings are estimates based on average news articles and blog posts compared to processing the raw HTML DOM.

Example Usage

To request an LLM-optimized Markdown response, simply set the format parameter in your JSON payload. API request to get Markdown LLM format:

POST https://api.geekflare.com/webscraping

{
  "url": "https://example.com/long-article",
  "format": "markdown-llm"
}

Choosing the Right Format for AI

If you are building an AI application, choosing between the -llm formats depends on your specific pipeline:

Use markdown-llm when you need to chunk data for a Vector Database. Markdown preserves ## Headings, which chunking algorithms use to keep contextual ideas together. It also preserves data tables.
Use text-llm when you are doing massive batch processing and need the absolute lowest token count possible, or when simply generating mathematical vector embeddings where structural tags are unnecessary.
Use html-llm when you are passing data to an LLM that has been specifically fine-tuned to read DOM structures and CSS classes.

​The -llm Optimized Formats

​Format Reference

​Example Usage

​Choosing the Right Format for AI

The `-llm` Optimized Formats

Format Reference

Example Usage

Choosing the Right Format for AI