Crawl Setup

Crawl Configuration
and Execution

How to set up your crawler so it sees what Google sees.

Where to find it: Screaming Frog > Configuration settings

What It Is

Running a technical audit crawler is not plug-and-play. The crawl configuration — user agent, JavaScript rendering, authentication, crawl speed, and robots.txt handling — determines whether the crawler sees the site the same way Googlebot does. Misconfigured crawls produce misleading data: missing pages when JavaScript isn't rendered, false errors when authentication isn't bypassed, or incomplete coverage when crawl limits are set too low for the site's size.

Why It Matters

A crawl is only as useful as its configuration. Agencies running default-setting crawls on large or JavaScript-heavy sites produce audit data that doesn't reflect what Googlebot actually experiences. A React site crawled without JavaScript rendering returns a completely different page structure than what Google indexes. Knowing how to configure a crawl correctly for different site types is a core technical audit competency — without it, every finding is potentially invalid.

Common Audit Failure Points

What Goes Wrong

Understanding where audits fail — and why — is the first step to executing them correctly.

01

No JavaScript Rendering on JS-Heavy Sites

Running a crawl without JavaScript rendering on React, Vue, Angular, or Next.js sites misses 50%+ of rendered content — the crawler sees empty HTML shells where pages should be.

02

Wrong User Agent Configured

Crawling with the default Screaming Frog user agent rather than Googlebot means robots.txt rules and server-side user-agent-specific responses aren't applied — the crawl doesn't see what Google sees.

03

Crawl Speed Too High for the Host

Setting crawl speed too aggressively on shared hosting triggers server-side rate limiting — the crawler receives 429 errors or soft-blocked responses that appear as page errors in the audit data.

04

XML Sitemap Not Imported

Spidering from the root URL alone misses orphaned pages that exist in the sitemap but have no inbound links — the most important pages to find are precisely those that don't surface in a standard crawl.

Interactive Standard Operating Procedure

The Audit Playbook (Interactive SOP)

Check off each step to track your audit progress live!

Audit Progress: 0% Completed (0/7)

Tools

  • Screaming Frog
    Paid/Free tier | The primary crawl tool — configuration settings covered in this episode are all within Screaming Frog's Configuration menu
  • Sitebulb
    Paid | Better visual representation of crawl data and site structure — a strong complement to Screaming Frog for client-facing architecture work
  • Botify or Lumar (DeepCrawl)
    Enterprise | For sites with millions of pages requiring scheduled crawls, log file integration, and enterprise-scale reporting capabilities

Time Investment

30 minutes
Crawl Setup
30 min to 8 hours by site size
Crawl Execution

Pro Tip

Always run two crawls — one respecting robots.txt and one ignoring it.

The comparison between the two crawls reveals what's blocked from Googlebot — including staging Disallow directives that weren't removed when the site went live, misconfigured security rules that block entire directories, and accidentally disallowed content sections. This two-crawl comparison is the fastest way to find critical crawl access issues that would otherwise require manually reading every robots.txt rule.

Ep 1: Audit Setup and Pre-Crawl Intelligence Ep 3: Indexation and Coverage Audit