On Google Search Central, Gary Illyes tells us more about the inner workings of Googlebot and announces a change of location for its IP range files. Concrete information that can directly impact the way your site is crawled and indexed.
What to remember:
- Googlebot is not a single robot: it relies on a centralized infrastructure shared by dozens of Google services (Shopping, AdSense, etc.).
- Googlebot only downloads the first 2 MB of an HTML page (excluding PDF): anything above this threshold is ignored, not fetched, not rendered, not indexed.
- The IP range files of Google crawlers change location: you must migrate to /crawling/ipranges/ in the next 6 months.
- The order of elements in your HTML really matters: critical tags should appear as high as possible in the code.
Googlebot was never just one robot
This is one of the most stubborn SEO myths. In the 2000s, Google only had one product and therefore only one crawler, and the name “Googlebot” stuck. But today, Googlebot actually designates one client among others within a centralized crawl infrastructure.
When you see “Googlebot” in your server logs, you are only observing Google Search traffic. Dozens of other services, such as Google Shopping or AdSense, use this same infrastructure under separate crawler names. The list of main crawlers is documented on the site Google Crawling infrastructure.
The 2MB limit: understanding what Google actually downloads
Already mentioned by Google a few weeks agothis is the most technical point, and probably the most important for webmasters. Googlebot only downloads the first 2 MB of each HTML URL, including HTTP headers. For PDFs, the limit is set to 64 MB. For crawlers that do not specify a limit, the default is 15 MB.
What is actually happening:
- The fetch stops dead at 2 MB. Googlebot does not reject the page, it simply cuts off the download at the exact 2MB threshold. The retrieved portion is then passed to the indexing systems and Web Rendering Service (WRS) as if it were the full file.
- Anything beyond that is invisible. Bytes located beyond this threshold are not fetched, not rendered, not indexed. For Googlebot, they simply don't exist.
- Related resources are fetched separately. Each resource referenced in the HTML (excluding media, fonts and some exotic files) is downloaded by the WRS with its own byte counter, independent of the parent page.
For the vast majority of sites, 2 MB of HTML represents a considerable volume. But certain practices can pose a problem: base64 images integrated directly into the HTML, large blocks of inline CSS or JavaScript, or large menus placed at the start of the code. If these elements push your textual content or structured data beyond the threshold, Googlebot will never see them.
Rendering: what the Web Rendering Service does with these bytes
Once the bytes are recovered, the WRS takes over. It runs client-side JavaScript and CSS, much like a modern browser, to understand the final state of the page. It also processes XHR queries to better understand the textual content and structure of the page, but does not load images or videos.
Two important points to keep in mind: The WRS can only execute the code actually downloaded by the fetcher, and it works stateless. It clears local storage and session data between each request, which can impact the interpretation of JavaScript-dependent dynamic elements.
Best practices for optimizing the crawling of your pages
Google makes several directly actionable recommendations:
- Keep your HTML lean. Outsource CSS and JavaScript into separate files. These resources are fetched independently, with their own quota of 2 MB.
- Place your critical elements at the top of the document. Meta tags, title, canonicals, links and essential structured data must appear as early as possible in the HTML code to avoid the risk of falling under the threshold.
- Monitor your server logs. High response times encourage Google fetchers to automatically reduce the crawl frequency so as not to overload your infrastructure.
Google specifies that this 2 MB limit is not fixed and may evolve as the web transforms.
Change location for crawler IP range files
At the same time, Google announced the movement of its JSON files listing the IP ranges of its crawlers. These files, previously available under /search/apis/ipranges/ on developers.google.com, migrate to a more generic location: developers.google.com/crawling/ipranges/.
This change reflects a reality already mentioned: these IP ranges concern much more than just Googlebot Search. The old path will remain accessible during the transition period, but Google plans to remove it and redirect it within 6 months. The official documentation has already been updated to point to the new location.