Defusing The Duplicate Content Situation

1 comment

Synopsis – The Google Panda/Farmer update has affected far more websites than those that can strictly be called content farms. A variety of nuances around the issue of what is and isn’t duplicate content — and who (if anyone) should be penalized for having it one their site — have arisen in the discussions and analyses following the implementation of the new algorithm. As sites begin to see the effects of the update, many are beginning to realize that they are finally being pushed to deal with issues of duplicate or near-duplicate content on their sites, whether that duplicate content is solely their own or material pulled from others.

As an ecommerce site, you may face the problem of duplicate content on your website, even if all of your content is wholly original. For example, you may sell products that fall into certain categories that wind up with individual product pages on your site that are substantially the same. This type of duplicate content has always been something that diligent webmasters should work to eliminate and identify for search engine robots, so they know what they should and shouldn’t index.

In his article, “Defusing The Duplicate Content Situation,” Jaimie Sirovich revisits the basics of dealing with duplicate content and how to indicate to search engines which is the original. By identifying three basic situations that cover the majority of legitimate dealings with duplicate content, he provides a primer to canonicalization, faceted navigation related exclusion, and parameter exclusion.

The complete article follows …

Defusing The Duplicate Content Situation

“Last night somebody broke into my apartment and replaced everything with exact duplicates… When I pointed it out to my roommate, he said, “Do I know you?” – comic Stephen Wright.

Imagine that both the original and the exact duplicate of Stephen Wright remained in the apartment. How would his roommate decide which was the original? It turns out that, faced with this, humans and robots are remarkably similar — albeit for entirely different reasons. Google and Microsoft — Goliaths in computer science — admit they cannot really attack the duplicate content problem alone. As Microsoft said: “Our first and foremost advice is the same as it always has been: 301 redirects and good site design should be the primary focus of webmasters, with canonical tags picking up the slack when technical limitations impede other solutions.”

One could argue that rel=”nofollow” is a tacit admission that neither can they solve the problem of identifying paid links comprehensively. Still, Microsoft is correct. The best solution is to eliminate duplicate content in the first place. Only if it cannot be easily and permanently eliminated via the almighty 301 redirect should one use other solutions. What might those be?

1. MERGE THROUGH CANONICALIZATION

Most search marketers chant a “rel=canonical” mantra as if it will bring a new messiah, but the tag is really only suited to small-to-medium quantities of duplicate content. Imagine a friend telling you almost the exact same story several days in a row, but to be polite, he warns you he’s about to do so each time. Irritating? Yes, and it’s no less irritating to robots it turns out.

“Hey, Google, see those tens of thousands of pages you’ve been spidering for hours with megabytes of wasted bandwidth? It’s really 500 pages with a bunch of small changes that make it just different enough to make that difficult to spot!”

Canonicalization is on-page and therefore cannot solve duplication problems faced by large ecommerce sites in the context of faceted navigation. A good rule of thumb is that if a caveman can’t count it using the following age-old system for each set of offending pages to be canonicalized, it’s probably a candidate for another method.

Note that Bing does not currently officially support the canonical [preferred] tag, but plans full support by Q1 2011. Also note that Google supports a cross-domain version of the canonical tag, as well as a new metatag to indicate the original source of syndicated news articles when only the domain or partial URL is known (otherwise, if you know the full URL, rel=canonical is preferred).

2. ELIMINATE THROUGH PARAMETER HANDLING AND EXCLUSION

Although often ignored, at times this is the most effective and efficient method to deal with duplicate content. In fact, search engines have been using heuristics to identify semantically useless parameters since the days of URL-embedded session handling. Yahoo! was first to introduce the capability for webmasters to identify such parameters explicitly, but Google followed and the Bing-Yahoo! conglomerate has redundant capabilities for parameter exclusion. Google Webmaster Tools has an easy-to-use adjustment interface for indicating your preferences for handling up to 15 parameters in a URL, although no guarantee of implementation is offered.

Parameter exclusion can be seen as more similar to robots.txt exclusion, since it is pattern-based, not on-page. One could propose a non-proprietary solution using a syntax like so in robots.txt:

Disallow-parameter: tracking
Disallow-parameter: utm_campaign

However, since parameter exclusion is supported in some way by all major search engines, it is still viable to use so long as equivalent rules are carefully set up at each one.

3. ELIMINATE THROUGH PAGE-EXCLUSION

This method is sometimes the only workable method, either because the content is similar but not duplicated (e.g., sorted product catalogs) or as a result of faceted navigation (e.g., thousands of combinations of unique but undesirable pages). It may also be the only option left as a result of unfortunate web application design decisions.

The two available methods of exclusion are not equivalent — robots.txt is pattern-based, while meta exclusion is on-page like canonicalization. Meta exclusion is almost always the wrong decision, because it shares all the downsides of exclusion and none of the benefits of canonicalization. Link equity is not preserved as in canonicalization. Furthermore, since it is on-page, a bot must crawl all of the offending pages to know what to exclude. Some search marketers believe that the “noindex,follow” variation of this tag presents a benefit, but this is not widely accepted. Meta exclusion is only suited to small-to-medium quantities of undesirable content for reasons similar to those regarding canonicalization.

With either method, the content is actually eliminated, not canonicalized. Therefore, they are best suited for content that is potentially similar and undesirable, rather than duplicate. If it is truly duplicate content, the canonical tag is the more correct solution.

To clarify the three different solutions, let’s look at some examples.

1. Product in multiple categories

http://example.com/animals/poodle.html and http://example.com/dogs/poodle.html — In this case, the number of pages is only as large as the category the product is in, so a canonical tag will be the best solution. A savvy reader might note that a product with such a URL in a different category is only similar — not a true duplicate page.

2. Sorts and filters in faceted navigation

http://example.com/animals/?sort_by=price, etc. — Here the content cannot be canonicalized as it is actually not duplicate content. However, it is undesirable and should be excluded. In the case of faceting, the number of combinations of facets-based pages on which the product appears becomes extremely large and, therefore, all such URLs must be excluded.

3. Tracking and session parameters

The best choice is to use parameter exclusion, which works well because the offending parameter is semantically meaningless. Canonicalization will not work to eliminate the potentially infinite amount of duplicate content presented by session parameters, and be less optimal for the potentially large amount from tracking parameters.

Conclusion

Duplicate content creates a set of difficult problems. It’s difficult not only for web developers, but the search engines as well. Remember that there is only one best solution — eliminating the offending content in the first place. When that fails, the three options explored above are your best bets.

In January 2011, Google reiterated its focus on eliminating duplicate and otherwise meaningless content. It is incumbent upon us to clean shop as well. Google and Bing have both given us many tools to indicate duplicate and undesirable content, but not all tools are created equal. Knowing which tool to use (and when) is essential to improve your site’s effective crawl priority and eliminate the possibility of a false positive from any of these new algorithms aimed at eliminating such content.

Add Your Comments

  • (will not be published)

One Comment

  1. Do you mean to say sites that are selling resell rights products also fall under duplicate content category?