Web Crawler

Configure website crawling as a Outpost source

In meinGPT (UI)

For most teams, setup is done directly in meinGPT without editing local config files.

Open admin settings in meinGPT
Go to Data Pools / Data Sources
Click Add Source and choose this source type
Configure credentials and scope in the UI
Save and trigger the first sync

If you do not run your own Outpost runtime, this is usually all you need.

On-Prem Runtime Configuration (Advanced)

data_pools:
  - id: website-docs
    type: webcrawler
    url: "https://example.com"
    scraping_method: basic
    max_depth: 2
    max_pages: 500
    output_format: markdown
    only_main_content: true

Configuration Options

Field	Type	Default	Required	Description
`id`	string	-	✅	Unique identifier for the data pool
`type`	string	-	✅	Must be `webcrawler`
`url`	string	-	✅	Crawl start URL
`scraping_method`	string	`basic`	❌	`basic` or `browser` (JS rendering)
`max_depth`	integer	`3`	❌	Maximum crawl depth
`max_pages`	integer	`500`	❌	Maximum number of pages
`include_paths`	array	null	❌	URL path patterns to include
`exclude_paths`	array	null	❌	URL path patterns to exclude
`wait_for_selector`	string	null	❌	CSS selector wait condition for browser mode
`page_timeout`	integer	`30`	❌	Page timeout in seconds
`delay_between_requests`	number	`1.0`	❌	Delay between requests
`concurrent_requests`	integer	`5`	❌	Parallel crawl requests
`retry_attempts`	integer	`2`	❌	Retry attempts
`proxy_server`	string	null	❌	Proxy endpoint
`proxy_username`	string	null	❌	Proxy username
`proxy_password`	string	null	❌	Proxy password
`user_agent`	string	null	❌	Custom User-Agent
`headers`	object	null	❌	Custom headers
`output_format`	string	`markdown`	❌	`markdown`, `html`, or `text`
`only_main_content`	boolean	`true`	❌	Keep only main content blocks
`max_age_hours`	integer	`24`	❌	Re-crawl freshness window

Synchronization

Outpost crawls from url and stores extracted content per discovered page.
Subsequent runs are incremental and respect freshness settings (max_age_hours).
Use include/exclude path filters to keep crawl scope deterministic.

Setup

Start with a small scope (max_depth, max_pages)
Add include/exclude rules for relevant sections
Use browser mode only when pages require JavaScript rendering
Re-run sync and review indexed output

Achtung

On-prem only: this source page is relevant when you operate your own Outpost runtime and configure data_pools yourself.

Was this page helpful?