Crawl Configuration
and Execution
How to set up your crawler so it sees what Google sees.
What It Is
Running a technical audit crawler is not plug-and-play. The crawl configuration — user agent, JavaScript rendering, authentication, crawl speed, and robots.txt handling — determines whether the crawler sees the site the same way Googlebot does. Misconfigured crawls produce misleading data: missing pages when JavaScript isn't rendered, false errors when authentication isn't bypassed, or incomplete coverage when crawl limits are set too low for the site's size.
Why It Matters
A crawl is only as useful as its configuration. Agencies running default-setting crawls on large or JavaScript-heavy sites produce audit data that doesn't reflect what Googlebot actually experiences. A React site crawled without JavaScript rendering returns a completely different page structure than what Google indexes. Knowing how to configure a crawl correctly for different site types is a core technical audit competency — without it, every finding is potentially invalid.
What Goes Wrong
Understanding where audits fail — and why — is the first step to executing them correctly.
No JavaScript Rendering on JS-Heavy Sites
Running a crawl without JavaScript rendering on React, Vue, Angular, or Next.js sites misses 50%+ of rendered content — the crawler sees empty HTML shells where pages should be.
Wrong User Agent Configured
Crawling with the default Screaming Frog user agent rather than Googlebot means robots.txt rules and server-side user-agent-specific responses aren't applied — the crawl doesn't see what Google sees.
Crawl Speed Too High for the Host
Setting crawl speed too aggressively on shared hosting triggers server-side rate limiting — the crawler receives 429 errors or soft-blocked responses that appear as page errors in the audit data.
XML Sitemap Not Imported
Spidering from the root URL alone misses orphaned pages that exist in the sitemap but have no inbound links — the most important pages to find are precisely those that don't surface in a standard crawl.
The Audit Playbook (Interactive SOP)
Check off each step to track your audit progress live!
Tools
-
Screaming Frog
Paid/Free tier | The primary crawl tool — configuration settings covered in this episode are all within Screaming Frog's Configuration menu -
Sitebulb
Paid | Better visual representation of crawl data and site structure — a strong complement to Screaming Frog for client-facing architecture work -
Botify or Lumar (DeepCrawl)
Enterprise | For sites with millions of pages requiring scheduled crawls, log file integration, and enterprise-scale reporting capabilities
Time Investment
Pro Tip
Always run two crawls — one respecting robots.txt and one ignoring it.
The comparison between the two crawls reveals what's blocked from Googlebot — including staging Disallow directives that weren't removed when the site went live, misconfigured security rules that block entire directories, and accidentally disallowed content sections. This two-crawl comparison is the fastest way to find critical crawl access issues that would otherwise require manually reading every robots.txt rule.