Architecture
Crawler chain & vision OCR
Every URL-based source (url, sitemap,
feed, auto) flows through the same multi-tier
crawler chain. Each tier tries a different way to get readable text out
of the page. The first tier whose extracted text clears the
200-character threshold wins; failures cascade to the next tier.
The chain (highest priority โ last resort)
| # | Tier | What it does | When it wins |
|---|---|---|---|
| 1 | plain_http |
Plain Guzzle GET with a Mozilla user-agent.
Runs through UrlSafetyGuard so private IPs and
cloud metadata services are blocked. |
Server-rendered pages (Blade apps, classic blogs, marketing sites with SSR or static export). |
| 2 | cloudflare_browser_markdown |
POST /browser-rendering/markdown. Cloudflare
spins up headless Chrome, waits for JS to settle, and emits
clean markdown. We wrap it in a minimal HTML envelope so
the extractor and chunker stay unchanged. |
JS-heavy SPAs whose final DOM is text-shaped (most React / Vue / Svelte sites). |
| 3 | cloudflare_browser |
POST /browser-rendering/content. Same headless
Chrome, but returns raw rendered HTML. Our
ReadabilityExtractor takes over. |
Atypical layouts where the markdown extractor's heuristics drop content but the regex extractor catches it (tables that don't map cleanly to GFM, deep nested article structures). |
| 4 | cloudflare_vision |
Two-stage:
<p> elements so the
chunker treats each paragraph as its own unit. |
Canvas-rendered slide decks, all-image landing pages, embedded PDF viewers, or any site whose meaningful content lives only in pixels. |
How a tier "wins"
The chain measures extracted text length, not raw
HTML length. Each tier's output is passed through
ReadabilityExtractor (with the same chrome-stripping
fallback used downstream). If the extracted text is at least 200
characters, that tier wins and its HTML is cached for 5 minutes
against the source URL. If not, the chain falls forward.
This used to be a raw-length check, which let JS-only Inertia
shells "win" with 50KB of empty <div> markup and
then silently fail extraction downstream. The extraction-aware
check fires the vision tier exactly when the extractor would have
failed anyway โ no wasted retries, no silent zeros.
Cost & safeguards
CLOUDFLARE_BROWSER_DAILY_LIMITโ shared counter for/content,/markdown, and/screenshotinvocations. Default0(unlimited). Operators on a fixed Cloudflare bundle should set this to a per-day cap that matches their plan.CLOUDFLARE_VISION_DAILY_LIMITโ separate counter for Workers AI vision calls (these consume Neurons, billed separately from Browser Rendering). Default0.CLOUDFLARE_VISION_MODELโ defaults to@cf/meta/llama-3.2-11b-vision-instruct. Override only if Cloudflare retires the model or you've negotiated a different one.- Both daily-cap guards throw a recoverable exception when the cap is hit, so the chain falls forward to the next tier instead of failing the whole crawl. Vision being capped just means the page that needed OCR doesn't get indexed โ every other URL still flows.
What happens when every tier fails
The chain throws a RuntimeException whose message lists
each tier and why it failed (HTTP code, exception message, or
"extraction produced N chars under threshold").
CrawlPageJob::failed() catches this, runs the message
through SourceErrorPresenter, and stamps a
customer-readable line onto the source's error column
(rendered in the admin's Sources dialog).
Browserless deprecation
The BrowserlessClient tier was removed from the default
chain in 2026-06 once the vision OCR tier landed. CF
/markdown + /content + vision now cover
the same ground free (no third-party billing line). The
BROWSERLESS_URL / BROWSERLESS_TOKEN env
vars stay readable for backwards compatibility โ they no longer wire
anything by default. A custom service provider can still bind
BrowserlessClient manually if a self-hosted Browserless
is preferred.