One of the most important things that a lot of WordPress bloggers overlook is the necessity of setting up a robots.txt file. WordPress, by its very nature, creates a substantial risk of duplicate content. If you aren’t prepared to handle it appropriately it can hurt your search engine rankings and cost you traffic.
When a search engine crawls your blog it looks at every link it can find and indexes the content so it knows what your blog is all about. This is a good thing, because it allows you to show up in that search engine’s results.
The problem is that while the search engine’s spiderbot is crawling your site, it’s going to follow the links that each post lists to its different categories and tags. The spider could also easily find different archives pages as well. Add all of that to the home page of your blog and the actual post itself and you could have one piece of content popping up a dozen times or more. Now what the search engine has to do is decide which option is the most relevant, and it may or may not always choose what you want.
A good way to reduce the duplicate content is to make sure your blog displays only excerpts everywhere possible (everywhere but the post page). But that doesn’t completely solve the problem. The search engines will still be funding those same excerpts over and over again. So what do you do?
What is a Robots.txt File?
A robots.txt file is simply a list of “do’s and don’ts” you provide for search engines that crawl your site. By specifying these rules, you tell the bot what pages you don’t want them to crawl. Why do this? Because you don’t want the engine to crawl multiple pages that will all display the same duplicate content.
A robots.txt file is a simple text file (created using Notepad, Wordpad, or similar program) that is placed in the root of your website for visiting spiders to look at.
A good robots.txt file helps prevent duplicate content penalties by telling Google (and other search engines) what they should and should not bother looking at. You can tell the search engines to ignore category archives or tags pages for example. By eliminating the options the search engine has to crawl you increase the likelihood that the only place your content is indexed is the actual post page itself (which is ideal).
To the right you’ll see an example of the content of a robots.txt file.
“User-agent” is where you specify which spiders should follow the rules you are about to lay out. In this case an asterisk means “everyone.” You could declare different rules for different spiders, but that’s usually only necessary on more elaborate websites.
The lines that begin with “Disallow:” specify items you don’t want crawled; so “Disallow: category” tells the search engine not to index any category pages. You can also set “Allow” parameters for items that you specifically want crawled.
Look for all the different routes that can be used to find a certain piece of content on your blog and disallow the extras in your robots.txt file.
Why Do Some Pages I’ve Disallowed Still Get Indexed?
One of the common misconceptions about robots txt files is that they will prevent Google and other search engines from indexing the pages you specify. This is not actually the case. Robots.txt allows you to specify which pages should not be crawled. Indexing is different. Pages that don’t get crawled can still rank for keywords and show up in search results. In a nut shell if enough people link to a page that isn’t crawl-able, it will still rank for the keyword used in the text of those links… There are ways to prevent Google from indexing a page completely, but we won’t go into all of that here. Google’s Matt Cutts posted a great video on his blog that explains all of this in more detail.
Bottom line is this: a robots file is an easy to set up, yet incredibly useful tool to protect yourself against some duplicate content issues. There are still ways for those pages to be indexed, but generally only if they are linked to. Since most WordPress users would want to use robots.txt to prevent crawling of things like categories and tags, the risk of that happening is pretty minimal, since most of your readers are much more likely to link directly to the post itself.
Hopefully this post helps you get a robots file set up for you blog… post any feedback or questions in the comments!