How to Avoid Duplicate Content

Add Your Comments

We humans often find it frustrating to listen to people repeat themselves. Likewise, search engines are frustrated by websites that do the same. This problem is called duplicate content, defined as web content that is exactly duplicated or substantially similar, yet located at different URLs.

Indexing such content wastes the storage, processing capacity and computation time of a search engine. As a result, if a website contains an excess of duplicate content, at least some of its pages may be penalized. And while the SEO community “jury” is still out on whether or not there is an explicit penalty applied by the various search engines, everyone agrees that duplicate content can be detrimental to website rankings.

It is not necessary to eliminate all duplicate content to make a website search-engine-friendly, but it is desirable to eliminate as much of it as possible. There are many sources of duplicate content, but some are common to almost any website and thus, merit discussion.

Considering the ways in which duplicate content winds up on a website, it’s clear that some are an unavoidable by-product of providing a richer online experience. One of the most common in this category is the “printer friendly” page. Many websites provide two versions for every web page – one for the screen and one for printing. It is not incorrect to do so (unless you are a CSS-zealot), but all printer-friendly webpage URLs should be excluded from the view of search engine spiders, so as to not appear to be duplicate content. So, let’s examine ways to exclude URLs from the view of search engine spiders.

Exclusion of URLs

To effect exclusion, the web designer has two excellent tools at his disposal – robots.txt and meta tag exclusion. The preferable method to exclude content is using robots.txt, as it provides a centralized list of all URLs that should not be spidered and does not require that a spider download the page just to be informed that it should not be indexed in the first place.  This gives a spider more to do, wastes bandwidth, and slows it down.

Using Robots.txt

Using robots.txt is quite simple. There can only be one robots.txt, which is a plain text file located in the root directory of a website. Verify it does not exist first, and then create it if necessary.

The following robots.txt file disallows all spiders to access any URLs on a website:

# Forbid all spiders from browsing your site

User-agent: *
Disallow: /

Let’s examine how this file works. The line “User-agent: *” indicates that all of the “Disallow:” rules listed next apply to all spiders, denoted by the “*.”  The “Disallow: /” says that any URL starting with “/” is excluded. Since every URL listed in robots.txt must start from the root directory of the site “/,” every file is therefore excluded.

Note: There is no “Allow:” function to complement the “Disallow:” in the robots.txt standard. Certain search engines (Google and Yahoo! included) permit its use, but their interpretations of “proper” use may vary, and it is not part of the SEO standard.

Of course, the exclusion of “/” would not be terribly productive since, as SEO professionals, our goal is to get a site indexed – not entirely delisted! To eliminate only duplicate content, gather all URLs that are duplicates and include them in the robots.txt file.  Suppose that we wish to exclude the following files:

http://www.example.com/search-engine-optimization-is-fun.html

http://www.example.com/seo/ (directory)

To do so, we would use the following robots.txt file:

User-agent: *
Disallow: /search-engine-optimization-is-fun.html
Disallow: /seo/

Notice that the domain name is not a part of the “Disallow:” clause, and all lines begin with a forward slash, “/.” Also notice that the exclusion for the directory “seo” ends in a ‘/.” Otherwise, a hypothetical file, “seo.html” would be excluded as well – since seo.html starts withseo,” and any URL that starts with the string listed in the “Disallow:” clause is considered a match. The ending “/” prevents such a match.

If one wants to disallow content for a specific spider, add a “User-agent:” line for that spider, such as “googlebot,” then list exclusion rules specific to it.

For example, to exclude the URL http://www.example.com/no-googles-allowed.html for Google in addition to the exclusions for all search spiders, we use the following robots.txt file:

User-agent: *
Disallow: /search-engine-optimization-is-fun.html
Disallow: /seo/
User-agent: googlebot
Disallow: /search-engine-optimization-is-fun.html
Disallow: /seo/
Disallow: /no-googles-allowed.html

Please note that the rules for any specific user agent completely override the “User-agent: *” rules. Therefore, any rule under “User-agent: *” that should also be applied to googlebot must be repeated under “User-agent: googlebot.” Had we not, Google would be able to spider and index the directory “/seo/” and the file “/search-engine-optimization-is-fun.html.”

Below is a chart of the four most popular search engines and the names of their spiders:

Google Googlebot
Yahoo Slurp
MSN Search Msnbot
Ask Teoma

So that is basically how robots.txt works in regards to URL exclusions. What about the other method?

Using Meta Tag Exclusion

The second method, meta tag exclusion, involves placing the following HTML tag within the “<head>…</head>” section of your document:

<meta name=”robots” content=”noindex, nofollow” />

To exclude a specific spider, change “robots” to the name of the spider – for example “googlebot,” “msnbot,” or “slurp.” Use multiple meta tags to exclude multiple spiders; for example, to exclude “googlebot” and “msnbot:”

<meta name=”googlebot” content=”noindex, nofollow” />

<meta name=”msnbot” content=”noindex, nofollow” />

Although equivalent to “robots.txt” in most ways, since meta tag exclusion requires that a spider actually fetch the page to realize the page should not be indexed, it is not ideal. It should be used whenever “robots.txt” will not work effectively.  For example, consider the following URLs of our “printer friendly” pages:

http://www.example.com/products.php?product_id =1&print=1

http://www.example.com/products.php?product_id =99&print=1

http://www.example.com/products.php?product_id =1000&print=1

Here, robots.txt would be unwieldy since it would require one-thousand “Disallow” entries to exclude all of the printer-friendly pages. To accomplish the same result with meta tag exclusion, one would simply add the above meta tag to the pages or, likely, the program logic that generates the pages.

Meta tag exclusion can only be used to exclude HTML documents. One must use robots.txt for any non-HTML files that need to be excluded, regardless of difficulty or effectiveness.

Incidentally, if the URLs had been constructed the opposite way – as follows:

http://www.example.com/products.php?print=1&product_id =49

then exclusion would have been easily accomplished via robots.txt:

User-agent: *
Disallow: /products.php?print=1

This example highlights the importance of thinking about exclusion tactics when constructing new URLs and parameter strings.

With URL exclusion taking care of this type of duplicate content, let’s move on to more subtle instances where duplicate content can be problematic. They may be subtle, but they are just as insidious. One of these is what is termed a “canonicalization issue.” Thankfully, this class of problems has relatively easy solutions.

Canonicalization Issues

The class of issues that can lead to duplicate content problems is a concept called “canonicalization.” Briefly, canonicalization as it pertains to duplicate content is essential to help search engines choose the best URL when more than one of your URLs has the same content at trivially different URLs.

1.  www.example.com vs. example.com

A problem arises when both www.example.com and example.com point to the same web site IP address. A search engine considers URLs from these websites as two different pages. So, “example.com/some-page.html” is a different page than www.example.com/some-page.html. Strictly speaking, that is duplicate content – the same pages exist on two different domains. To rectify this, we use a few very simple mod_rewrite rules.

First, if it does not already exist, we must create a file named “.htaccess” in the root directory of our website. The first character, “.” indicates it is a hidden file and must be included in the file name. Add the following lines at the very beginning of the file:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301, L]

These mod_rewrite rules state that if a web page request comes in from domain.com, it should be permanently redirected via a 301-redirect to www.example.com, with all other variables remaining the same.

Google Sitemaps, now renamed “Webmaster Central,” has recently added a way to resolve this issue on Google’s end. The solution is accessed under Tools à Preferred Domain. Since not all search engines offer this solution, the permanent redirection method should be employed regardless.

2. “index.php” versus “/”

Another problem can arise when a webmaster links to two different URLs that reference the home or default page of a particular directory. A web server allows a user-agent to reference it as “index.php” or “/.”To avoid this, be consistent with how you reference the file. Ideally, since users typically link to a home page without the “index.php,” that is the URL that should be chosen. In addition, if the old index.php pages are referenced externally, they should be redirected to “/.” This is done with the following mod_rewrite rules:

RewriteEngine On
RewriteCond %{THE_REQUEST} ^GET\ /.*index\.(php|html)\ HTTP
RewriteRule ^(.*)index\.(php|html)$ /$1 [R=301,L]

Following these steps to will eliminate a substantial amount of duplicate content resulting from architecture problems. Of course, if your actual website copy is not original in the first place, that is a more fundamental problem beyond the scope of this article. The bottom line is – like humans, search engines like to learn from interesting resources with original information. If your site contains nothing new, or repeats itself ad nauseam, they may just skip right over you and look elsewhere for sites to index.

Add Your Comments

  • (will not be published)