Architecture

Crawler chain & vision OCR

Every URL-based source (url, sitemap, feed, auto) flows through the same multi-tier crawler chain. Each tier tries a different way to get readable text out of the page. The first tier whose extracted text clears the 200-character threshold wins; failures cascade to the next tier.

The chain (highest priority → last resort)

#	Tier	What it does	When it wins
1	`plain_http`	Plain Guzzle `GET` with a Mozilla user-agent. Runs through `UrlSafetyGuard` so private IPs and cloud metadata services are blocked.	Server-rendered pages (Blade apps, classic blogs, marketing sites with SSR or static export).
2	`cloudflare_browser_markdown`	`POST /browser-rendering/markdown`. Cloudflare spins up headless Chrome, waits for JS to settle, and emits clean markdown. We wrap it in a minimal HTML envelope so the extractor and chunker stay unchanged.	JS-heavy SPAs whose final DOM is text-shaped (most React / Vue / Svelte sites).
3	`cloudflare_browser`	`POST /browser-rendering/content`. Same headless Chrome, but returns raw rendered HTML. Our `ReadabilityExtractor` takes over.	Atypical layouts where the markdown extractor's heuristics drop content but the regex extractor catches it (tables that don't map cleanly to GFM, deep nested article structures).
4	`cloudflare_vision`	Two-stage: Full-page screenshot via `/browser-rendering/screenshot` (PNG). Workers AI multimodal call to `@cf/meta/llama-3.2-11b-vision-instruct` with an OCR-only prompt. Returns extracted text. Text is wrapped in `<p>` elements so the chunker treats each paragraph as its own unit.	Canvas-rendered slide decks, all-image landing pages, embedded PDF viewers, or any site whose meaningful content lives only in pixels.

How a tier "wins"

The chain measures extracted text length, not raw HTML length. Each tier's output is passed through ReadabilityExtractor (with the same chrome-stripping fallback used downstream). If the extracted text is at least 200 characters, that tier wins and its HTML is cached for 5 minutes against the source URL. If not, the chain falls forward.

This used to be a raw-length check, which let JS-only Inertia shells "win" with 50KB of empty <div> markup and then silently fail extraction downstream. The extraction-aware check fires the vision tier exactly when the extractor would have failed anyway — no wasted retries, no silent zeros.

Cost & safeguards

CLOUDFLARE_BROWSER_DAILY_LIMIT — shared counter for /content, /markdown, and /screenshot invocations. Default 0 (unlimited). Operators on a fixed Cloudflare bundle should set this to a per-day cap that matches their plan.
CLOUDFLARE_VISION_DAILY_LIMIT — separate counter for Workers AI vision calls (these consume Neurons, billed separately from Browser Rendering). Default 0.
CLOUDFLARE_VISION_MODEL — defaults to @cf/meta/llama-3.2-11b-vision-instruct. Override only if Cloudflare retires the model or you've negotiated a different one.
Both daily-cap guards throw a recoverable exception when the cap is hit, so the chain falls forward to the next tier instead of failing the whole crawl. Vision being capped just means the page that needed OCR doesn't get indexed — every other URL still flows.

What happens when every tier fails

The chain throws a RuntimeException whose message lists each tier and why it failed (HTTP code, exception message, or "extraction produced N chars under threshold"). CrawlPageJob::failed() catches this, runs the message through SourceErrorPresenter, and stamps a customer-readable line onto the source's error column (rendered in the admin's Sources dialog).

The `www` → apex certificate fallback

A very common site misconfiguration is a TLS certificate that covers the apex domain (example.com) but not its www host. Fetching https://www.example.com/… then fails every tier with cURL error 60 (SSL: no alternative certificate subject name matches target host name 'www.example.com') — and because that error is classed as a permanent failure, the page would never re-index. Left unhandled, the assistant answers from stale or missing content (a live case: a pricing page's "3,000 conversations / month" going unindexed).

So before giving up, the chain retries the apex host once: it strips the leading www. (preserving scheme, port, path and query) and re-runs the whole chain. The retry is gated strictly on the cert-mismatch signature — a genuinely broken www-only host (500, DNS failure, 404) is not handed a pointless second fetch — and it is self-terminating, since the apex URL has no www. prefix to fall back from. The recovery is logged as crawler.chain.www_apex_fallback.

Browserless deprecation

The BrowserlessClient tier was removed from the default chain in 2026-06 once the vision OCR tier landed. CF /markdown + /content + vision now cover the same ground free (no third-party billing line). The BROWSERLESS_URL / BROWSERLESS_TOKEN env vars stay readable for backwards compatibility — they no longer wire anything by default. A custom service provider can still bind BrowserlessClient manually if a self-hosted Browserless is preferred.