Web Crawler

Configure website crawling as a DataVault source

In meinGPT (UI)

For most teams, setup is done directly in meinGPT without editing local config files.

  1. Open admin settings in meinGPT
  2. Go to Data Pools / Data Sources
  3. Click Add Source and choose this source type
  4. Configure credentials and scope in the UI
  5. Save and trigger the first sync

If you do not run your own DataVault runtime, this is usually all you need.

On-Prem Runtime Configuration (Advanced)

data_pools:
  - id: website-docs
    type: webcrawler
    url: "https://example.com"
    scraping_method: basic
    max_depth: 2
    max_pages: 500
    output_format: markdown
    only_main_content: true

Configuration Options

FieldTypeDefaultRequiredDescription
idstring-Unique identifier for the data pool
typestring-Must be webcrawler
urlstring-Crawl start URL
scraping_methodstringbasicbasic or browser (JS rendering)
max_depthinteger3Maximum crawl depth
max_pagesinteger500Maximum number of pages
include_pathsarraynullURL path patterns to include
exclude_pathsarraynullURL path patterns to exclude
wait_for_selectorstringnullCSS selector wait condition for browser mode
page_timeoutinteger30Page timeout in seconds
delay_between_requestsnumber1.0Delay between requests
concurrent_requestsinteger5Parallel crawl requests
retry_attemptsinteger2Retry attempts
proxy_serverstringnullProxy endpoint
proxy_usernamestringnullProxy username
proxy_passwordstringnullProxy password
user_agentstringnullCustom User-Agent
headersobjectnullCustom headers
output_formatstringmarkdownmarkdown, html, or text
only_main_contentbooleantrueKeep only main content blocks
max_age_hoursinteger24Re-crawl freshness window

Synchronization

  • Vault crawls from url and stores extracted content per discovered page.
  • Subsequent runs are incremental and respect freshness settings (max_age_hours).
  • Use include/exclude path filters to keep crawl scope deterministic.

Setup

  1. Start with a small scope (max_depth, max_pages)
  2. Add include/exclude rules for relevant sections
  3. Use browser mode only when pages require JavaScript rendering
  4. Re-run sync and review indexed output

Achtung

On-prem only: this source page is relevant when you operate your own DataVault runtime and configure data_pools yourself.

Was this page helpful?