Finding Information by Crawling
- We use software known as “web crawlers” to discover publicly available webpages. The most well-known crawler is called “Googlebot.”
- Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those webpages back to Google’s servers.
- The crawl process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they look for links for other pages to visit. The software pays special attention to new sites, changes to existing sites and dead links.
- Computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.
- Google doesn't accept payment to crawl a site more frequently for our web search results. We care more about having the best possible results because in the long run that’s what’s best for users and, therefore, our business.
- To learn about the tools and resources available to site owners, visit Webmaster Central.
- Check out a graphic illustrating the various phases of the search process, from before you search, to ranking, to serving results.
- Looking at 12,000 crawl errors staring back at you in Webmaster Tools can make your hopes of eradicating those errors seem like an insurmountable task that will never be accomplished.
- The key is to know which errors are the most crippling to your site, and which ones are simply informational and can be brushed aside so you can deal with the real meaty problems.
- The reason it’s important to religiously keep an eye on your errors is the impact they have on your users and Google’s crawler.
- Having thousands of 404 errors, especially ones for URLs that are being indexed or linked to by other pages pose a potentially poor user experience for your users. If they are landing on multiple 404 pages in one session, their trust for your site decreases and of course leads to frustration and bounces.
- You also don’t want to miss out on the link juice from other sites that are pointing to a dead URL on your site, if you can fix that crawl error and redirect it to a good URL you can capture that link to help your rankings.
- Additionally, Google does have a set crawl budget allotted to your site, and if a lot of the robot’s time is spent crawling your error pages, it doesn’t have the time to get to your deeper more valuable pages that are actually working.
- This section usually returns pages that have shown errors such as 403 pages, not the biggest problems in Webmaster Tools.
- For more documentation with a list of all the HTTP status codes, check out Google’s own help pages. Also check out SEO Gadget’s amazing Server Headers 101 infographic on Six Revisions.
- Errors in sitemaps are often caused by old sitemaps that have since 404’d, or pages listed in the current sitemap that return a 404 error. Make sure that all the links in your sitemap are quality working links that you want Google to crawl.
- One frustrating thing that Google does is it will continually crawl old sitemaps that you have since deleted to check that the sitemap and URLs are in fact dead.
- If you have an old sitemap that you have removed from Webmaster Tools, and you don’t want to be crawled, make sure you let that sitemap 404 and that you are not redirecting the sitemap to your current sitemap.
- Not found errors are by and large 404 errors on your site. 404 errors can occur a few ways:
- You delete a page on your site and do not 301 redirect it
- You change the name of a page on your site and don’t 301 redirect it
- You have a typo in an internal link on you site, which links to a page that doesn’t exist
- Someone else from another site links to you but has a typo in their link
- You migrate a site to a new domain and the subfolders do not match up exactly
- There is an excellent Webmaster Central Blog post on how Google views 404 pages and handles them in webmaster tools. Everyone should read it as it dispels the common “all 404s are bad and should be redirected” myth.
- These errors are more informational, since it shows that some of your URLs are being blocked by your robots.txt file so the first step is to check out your robots.txt file and ensure that you really do want to block those URLs being listed
- Sometimes there will be URLs listed in here that are not explicitly blocked by the robots.txt file.
- These should be looked at on an individual basis as some of them may have strange reasons for being in there.
- A good method to investigate is to run the questionable URLs through URI valet and see the response code for this. Also check your .htacess file to see if there is a rule that is redirecting the URL.
- If you have pages that have very thin content or look like a landing page these may be categorized as a soft 404. This classification is not ideal, if you want a page to 404 you should make sure it returns a hard 404, and if your page is listed as a soft 404 and it is one of your main content pages, you need to fix that page to make sure it doesn’t get this error.
- If a page takes too long to load, the Googlebot will stop trying to call it after a while. Check your server logs for any issues and check the page load speed of your pages that are timing out.
- Unreachable errors can occur from internal server errors or DNS issues. A page can also be labeled as Unreachable if the robots.txt file is blocking the crawler from visiting a page. Possible errors that fall under the unreachable heading are “No response”, “500 error”, and “DNS issue” errors.
0 Comments