B Blengi docs

Architecture

Crawler chain & vision OCR

Every URL-based source (url, sitemap, feed, auto) flows through the same multi-tier crawler chain. Each tier tries a different way to get readable text out of the page. The first tier whose extracted text clears the 200-character threshold wins; failures cascade to the next tier.

The chain (highest priority โ†’ last resort)

#TierWhat it doesWhen it wins
1 plain_http Plain Guzzle GET with a Mozilla user-agent. Runs through UrlSafetyGuard so private IPs and cloud metadata services are blocked. Server-rendered pages (Blade apps, classic blogs, marketing sites with SSR or static export).
2 cloudflare_browser_markdown POST /browser-rendering/markdown. Cloudflare spins up headless Chrome, waits for JS to settle, and emits clean markdown. We wrap it in a minimal HTML envelope so the extractor and chunker stay unchanged. JS-heavy SPAs whose final DOM is text-shaped (most React / Vue / Svelte sites).
3 cloudflare_browser POST /browser-rendering/content. Same headless Chrome, but returns raw rendered HTML. Our ReadabilityExtractor takes over. Atypical layouts where the markdown extractor's heuristics drop content but the regex extractor catches it (tables that don't map cleanly to GFM, deep nested article structures).
4 cloudflare_vision Two-stage:
  1. Full-page screenshot via /browser-rendering/screenshot (PNG).
  2. Workers AI multimodal call to @cf/meta/llama-3.2-11b-vision-instruct with an OCR-only prompt. Returns extracted text.
Text is wrapped in <p> elements so the chunker treats each paragraph as its own unit.
Canvas-rendered slide decks, all-image landing pages, embedded PDF viewers, or any site whose meaningful content lives only in pixels.

How a tier "wins"

The chain measures extracted text length, not raw HTML length. Each tier's output is passed through ReadabilityExtractor (with the same chrome-stripping fallback used downstream). If the extracted text is at least 200 characters, that tier wins and its HTML is cached for 5 minutes against the source URL. If not, the chain falls forward.

This used to be a raw-length check, which let JS-only Inertia shells "win" with 50KB of empty <div> markup and then silently fail extraction downstream. The extraction-aware check fires the vision tier exactly when the extractor would have failed anyway โ€” no wasted retries, no silent zeros.

Cost & safeguards

  • CLOUDFLARE_BROWSER_DAILY_LIMIT โ€” shared counter for /content, /markdown, and /screenshot invocations. Default 0 (unlimited). Operators on a fixed Cloudflare bundle should set this to a per-day cap that matches their plan.
  • CLOUDFLARE_VISION_DAILY_LIMIT โ€” separate counter for Workers AI vision calls (these consume Neurons, billed separately from Browser Rendering). Default 0.
  • CLOUDFLARE_VISION_MODEL โ€” defaults to @cf/meta/llama-3.2-11b-vision-instruct. Override only if Cloudflare retires the model or you've negotiated a different one.
  • Both daily-cap guards throw a recoverable exception when the cap is hit, so the chain falls forward to the next tier instead of failing the whole crawl. Vision being capped just means the page that needed OCR doesn't get indexed โ€” every other URL still flows.

What happens when every tier fails

The chain throws a RuntimeException whose message lists each tier and why it failed (HTTP code, exception message, or "extraction produced N chars under threshold"). CrawlPageJob::failed() catches this, runs the message through SourceErrorPresenter, and stamps a customer-readable line onto the source's error column (rendered in the admin's Sources dialog).

Browserless deprecation

The BrowserlessClient tier was removed from the default chain in 2026-06 once the vision OCR tier landed. CF /markdown + /content + vision now cover the same ground free (no third-party billing line). The BROWSERLESS_URL / BROWSERLESS_TOKEN env vars stay readable for backwards compatibility โ€” they no longer wire anything by default. A custom service provider can still bind BrowserlessClient manually if a self-hosted Browserless is preferred.