Crawl Setup

Crawl Configuration
and Execution

How to set up your crawler so it sees what Google sees.

Where to find it: Screaming Frog > Configuration settings

What It Is

Running a technical audit crawler is not plug-and-play. The crawl configuration — user agent, JavaScript rendering, authentication, crawl speed, and robots.txt handling — determines whether the crawler sees the site the same way Googlebot does. Misconfigured crawls produce misleading data: missing pages when JavaScript isn't rendered, false errors when authentication isn't bypassed, or incomplete coverage when crawl limits are set too low for the site's size.

Why It Matters

A crawl is only as useful as its configuration. Agencies running default-setting crawls on large or JavaScript-heavy sites produce audit data that doesn't reflect what Googlebot actually experiences. A React site crawled without JavaScript rendering returns a completely different page structure than what Google indexes. Knowing how to configure a crawl correctly for different site types is a core technical audit competency — without it, every finding is potentially invalid.

Common Audit Failure Points

What Goes Wrong

Understanding where audits fail — and why — is the first step to executing them correctly.

No JavaScript Rendering on JS-Heavy Sites

Running a crawl without JavaScript rendering on React, Vue, Angular, or Next.js sites misses 50%+ of rendered content — the crawler sees empty HTML shells where pages should be.

Wrong User Agent Configured

Crawling with the default Screaming Frog user agent rather than Googlebot means robots.txt rules and server-side user-agent-specific responses aren't applied — the crawl doesn't see what Google sees.

Crawl Speed Too High for the Host

Setting crawl speed too aggressively on shared hosting triggers server-side rate limiting — the crawler receives 429 errors or soft-blocked responses that appear as page errors in the audit data.

XML Sitemap Not Imported

Spidering from the root URL alone misses orphaned pages that exist in the sitemap but have no inbound links — the most important pages to find are precisely those that don't surface in a standard crawl.

Interactive Standard Operating Procedure

The Audit Playbook (Interactive SOP)

Check off each step to track your audit progress live!

Audit Progress: 0% Completed (0/7)

1. Set the User Agent to Googlebot

In Screaming Frog > Configuration > User-Agent, select Googlebot. This ensures the crawl sees the same robots.txt rules and server responses that Google does — including any IP or user-agent-based blocks or delivery differences.

2. Enable JavaScript Rendering for JS Framework Sites

For any site using React, Vue, Angular, or Next.js: enable JavaScript rendering in Configuration > Spider > Rendering. This switches the crawl to a headless browser mode that executes JavaScript before parsing the DOM — matching how Googlebot renders these pages.

3. Configure Crawl Speed Appropriately for the Host

Set crawl speed at Configuration > Speed: 2–5 requests per second for shared hosting, 10+ for dedicated servers or CDN-fronted sites. Check with the client's developer if unsure — crawling too fast can trigger downtime on low-resource hosting.

4. Import the XML Sitemap in Addition to Spidering

In Screaming Frog > Mode > List, import the XML sitemap URL in addition to running the standard spider crawl. This catches orphaned pages that Googlebot would find via sitemap but that never appear in the link-following crawl.

5. Run Two Crawls: One Respecting and One Ignoring robots.txt

First crawl: respect robots.txt (see what Googlebot sees). Second crawl: ignore robots.txt (see everything blocked). Compare results to identify accidentally blocked sections, staging directives still active on production, and misconfigured security rules.

6. Enable HTML Storage for Content Analysis

In Configuration > Spider > Extraction, enable 'Store HTML'. This allows Screaming Frog to run the near-duplicate content analysis and full-page content comparisons that would otherwise require re-crawling. Required for a thorough content quality audit.

7. Configure Authentication for Restricted Areas

If the site has a staging password, admin area, or login-gated content that should be audited: configure authentication in Configuration > HTTP Header or Configuration > Cookies. This ensures the crawl can access all relevant page types.

Tools

Screaming Frog
Paid/Free tier | The primary crawl tool — configuration settings covered in this episode are all within Screaming Frog's Configuration menu
Sitebulb
Paid | Better visual representation of crawl data and site structure — a strong complement to Screaming Frog for client-facing architecture work
Botify or Lumar (DeepCrawl)
Enterprise | For sites with millions of pages requiring scheduled crawls, log file integration, and enterprise-scale reporting capabilities

Time Investment

30 minutes

Crawl Setup

30 min to 8 hours by site size

Crawl Execution

Pro Tip

Always run two crawls — one respecting robots.txt and one ignoring it.

The comparison between the two crawls reveals what's blocked from Googlebot — including staging Disallow directives that weren't removed when the site went live, misconfigured security rules that block entire directories, and accidentally disallowed content sections. This two-crawl comparison is the fastest way to find critical crawl access issues that would otherwise require manually reading every robots.txt rule.

Ep 1: Audit Setup and Pre-Crawl Intelligence Ep 3: Indexation and Coverage Audit

Crawl Configurationand Execution

What It Is

Why It Matters

What Goes Wrong

No JavaScript Rendering on JS-Heavy Sites

Wrong User Agent Configured

Crawl Speed Too High for the Host

XML Sitemap Not Imported

The Audit Playbook (Interactive SOP)

1. Set the User Agent to Googlebot

2. Enable JavaScript Rendering for JS Framework Sites

3. Configure Crawl Speed Appropriately for the Host

4. Import the XML Sitemap in Addition to Spidering

5. Run Two Crawls: One Respecting and One Ignoring robots.txt

6. Enable HTML Storage for Content Analysis

7. Configure Authentication for Restricted Areas

Tools

Time Investment

Pro Tip

Crawl Configuration
and Execution