What is Duplicate/Thin Content & Why Does it Matter?
Google began taking duplicate, scraped and thin content very seriously on February 24th, 2011, when they launched their first Panda algorithm update. According to their Content Guidelines, Google defines duplicate content as:
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:
- Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
- Store items shown or linked via multiple distinct URLs
- Printer-only versions of web pages
According to their Affiliate Programs page, Google offers insight as to what they consider “thin content” within the context of affiliate websites, which can also be applied to eCommerce websites:
Google believes that pure, or “thin,” affiliate websites do not provide additional value for web users, especially if they are part of a program that distributes its content to several hundred affiliates. These sites generally appear to be cookie-cutter sites or templates with no original content. Because a search results page could return several of these sites, all with the same content, thin affiliates create a frustrating user experience.
Some examples of thin affiliates include:
- Pages with product affiliate links on which the product descriptions and reviews are copied directly from the original merchant without any original content or added value.
“Wait a second,” you might ask. “Didn’t you just create duplicate content by copying and pasting this text from Google’s own web pages?”
Not so fast. Let me explain. Duplicating portions of content is a natural part of the web. Whether a journalist is block-quoting text from another article (like I did above), or an eCommerce site is using the same product name as hundreds of other eCommerce websites, a small bit of duplicate content is inevitable.
What we should be worried about is having a large number of web pages on our websites that are mostly duplicate content, or product pages with such short product descriptions that the content can be deemed thin, and thus, not valuable (to neither Google, nor the reader).
Our job as website publishers and content managers is to ensure that we are providing the most robust information possible to our readers. When we take this approach, we are rewarded by Google since this meets their quality guidelines.
But not all duplicate content is editorially created. There are a wide range of technical situations which can lead to duplicate content issues for which Google is likely to penalize your website. We’ll dive into many of these situations within this chapter so that you’re fully prepared to avoid duplicate content across your entire website.
Internal Duplicate Content (On-Site)
Duplicate content can exist internally on an eCommerce site in a plethora of ways, both due to technical and editorial causes. We’ll dive into some of the more popular instances where internal duplicate content can rear its ugly head.
Internal “Technical” Duplicate & Low Quality Content
Canonical URLs, help search engines understand that there is only a single version of the page’s URL that should be indexed no matter what other URL versions are rendered in the browser, linked to from external websites, etc. Canonical URLs are extremely important in the case of tracking URLs, where tracking code (i.e. – affiliate tracking, social media source tracking, etc.) is appended to the end of a URL on the site (i.e. – ?a_aid=, ?utm_source, etc.). They are also very helpful in fine tuning indexation of category page URLs on eCommerce websites in instances where sorting, functional and filtering parameters are added to the end of the base category URLs to produce different ordering of products on a category page (i.e. – ?dir=asc, ?price=10-, etc.). Ensuring that the Canonical URL (in the <head> of the source code) is the same as the base category URL will prevent search engines from indexing these duplicate URLs.
URL/Page Type Visible URL Canonical URL Base Category URL http://www.domain.com/page-slug http://www.domain.com/page-slug Social Tracking URL http://www.domain.com/page-slug?utm_source=twitter http://www.domain.com/page-slug Affiliate Tracking URL http://www.domain.com/page-slug?a_aid=123456 http://www.domain.com/page-slug Sorted Category URL http://www.domain.com/page-slug?dir=asc&order=price http://www.domain.com/page-slug Filtered Category URL http://www.domain.com/page-slug?price=-10 http://www.domain.com/page-slug
It might also be beneficial to disallow crawling of the commonly used URL parameters via the /robots.txt file, in order to maximize crawl budget. Example:User-agent: * Disallow: *?dir=* Disallow: *&order=* Disallow: *?price=*
Shopping Cart Pages
When users add products to their cart on your eCommerce website, and views their cart, most CMS systems implement URL structures that are specific to the shopping cart experience. They might have “cart,” “basket,” or some other word as the unique identifier within these shopping cart URLs. It’s important to realize that these are not the types of pages that search engines wish to index, so identifying them and then setting them to “noindex,nofollow” via a meta robots tag or X-robots tag (and also disallowing crawling of them via the /robots.txt file) will help prevent search engines from indexing this low quality content.
Internal Search Results
Internal search result pages are produced when someone conducts a search using an eCommerce website’s internal search feature. They have no unique content, only repurposed snippets of content from other pages on your eCommerce website. Google’s own Matt Cutts has clearly stated that they do not want to send users from their search results to your search results (source). Instead, they want to send users to true content pages (product pages, category pages, static site pages, blog posts and articles). This is an extremely common issue with eCommerce websites. Many CMS systems do not set internal search result pages to “noindex,follow” by default, so a developer will need to apply this rule in order to fix this problem. It’s also recommended to disallow search bots from crawling internal search result pages within the /robots.txt file. It’s an easy fix, yet an important one since it can lead to ranking penalties under Google’s Panda algorithm if there are too many internal search results in Google’s index.
Duplicate URL Paths
How CMS systems handle URL structures where products are placed in multiple categories of a taxonomy can get tricky. For example, if a product is placed in both category A and category B, and if category directories are used within the URL structure of product pages, then the CMS could potentially create two different URLs for the same product.
As one can imagine, this can lead to devastating duplicate content problems for product pages, which are typically the highest converting pages on an eCommerce website. Common approaches to fix this are:
- Use root-level product page URLs (unfortunately this removes keyword-rich, category-level URL structure benefits and also limits trackability in Analytics software).
- Use /product/ URL directories for all products (which at least offers grouped trackability of all products in Analytics software).
- Use product URLs built upon category URL structures, but ensure that each product page URL has a single, designated canonical URL).
In some instances, this situation can also arise with sub-Category URLs where the products displayed might be exactly the same, or close to it. For example, a “Flashlights” sub-category might be placed under both /tools/flashlights/ and /emergency/flashlights/ on an Emergency Preparedness eCommerce website, and have mostly the same products. Taxonomy opinions aside, the same approach can be applied in these situations as with product pages. Also, ensuring that robust intro descriptions exist atop the category pages would help ensure that each similar sub-category page has unique content.
Product Review Pages
Many CMS systems come with built-in review functionality. Oftentimes, separate “review pages” are created to host all reviews for particular products, yet some (if not all) of the reviews are placed on the product pages, themselves. This can create duplicate content between the product pages, themselves, and the corresponding product review pages. These “review pages” should either be canonicalized to the main product page or set to “noindex,follow” via a meta robots or X-robots tag. The canonicalization method is preferred, just in case a link to a “review page” occurs on an external website, which will pass the link equity to the product page.
It’s also critical to ensure that review content is not duplicated on external sites when using 3rd party product review venders. For a deep dive into this topic, please read Product Review Venders—Solutions to Fit Your eCommerce SEO Needs.
WWW vs. Non-WWW URLs & Uppercase vs. Lowercase URLs
Just as the Post Office would consider 123 Race Avenue and 123 Race Street different home addresses, search engines consider http://www.domain.com and http://domain.com different web addresses. Therefore, it’s critical that one version of URLs is chosen for every page on the eCommerce website. 301 redirecting the non-preferred version to the preferred version is the recommended solution to avoid these technically created duplicate URLs, per Google.
Tip: Google also allows webmasters to set up both the www and non-www version of domains within Webmaster Tools, and to set the preferred domain.
Uppercase and lowercase URLs need to be handled in the same manner. If both render separately, then search engines can consider them different. It’s important to choose one format and 301 Redirect one version to the other. We have a helpful article that offers instruction on how to do this: How to Redirect Uppercase URLs to Lowercase URLs Using Htaccess.
Trailing Slashes on URLs
Similar to www and non-www URLs, search engines consider URLs that render both with a trailing slash, and without, to be different URLs. As an example, duplicate URLs are created when URLs such as /page/ and /page/index.html, or /page and /page.html, render the same content. It is especially problematic when /page and /page/ show the same content since, technically speaking, these two pages aren’t even in the same directory. Common approaches to fixing this problem are to either canonicalize both to a single version or 301 redirect one version to the other.
HTTPS URLs: Relative vs. Absolute Path
HTTPS (secure) URLs are typically created after a user has logged into an eCommerce website. Most times, search engines have no way of finding these URLs. However, there are instances where this is possible, such as when a logged in Administrator is updating content and navigational links. In this scenario, it’s common for the Administrator not to realize that embedded URLs include HTTPS instead of HTTP in the URLs. When relative path URLs (excluding the “http://www.domain.com” portion) are also used on the site (either in content or navigational links), it makes it all too easy for search engines to quickly crawl hundreds, if not thousands of HTTPS URLs, which are technically duplicates of the HTTP versions. The most common solutions to fix this consist of using absolute path URLs (including the “http://www.domain.com” portion) coupled with ensuring that canonical URLs always use the HTTP version. Using 301 redirects in these cases could easily break the user-login functionality, as the HTTPS URLs would not be able to be rendered.
Internal “Editorial” Duplicate Content
Shared Content Between Products
It’s easy to take shortcuts with product descriptions on eCommerce websites, especially with similar products. However, consider that Google is judging content of eCommerce websites similar to regular content sites. That alone should be enough to make a professional SEO realize that product page descriptions should be unique, compelling and robust–especially for mid-tier eCommerce websites who don’t have enough Domain Authority to compete with bigger competitors. Every little bit counts. Sharing short paragraphs, specifications and other content between product pages increases the likelihood that search engines will decrease their perception of a product page’s content quality and subsequently, ranking position.
Category pages on eCommerce websites typically include a title and product grid. This means that there is no unique content on these pages. The common solution to combat this is to add unique descriptions at the top of category pages (not the bottom, where content is given less weight by search engines) that describes what types are featured within the category. There is no magic number of words or characters to use, however the more robust the content is, the better chance the page will be able to maximize traffic from organic search results (due to long-tail keyword traffic). A benchmark of 100-300 words is common. It’s important to understand screen resolutions of your visitors and ensure that the product grid is not pushed below the fold on their browsers. Doing so could limit user discoverability of the product grid upon visiting the category page.
Tip: Intro descriptions on category pages offer a great opportunity to build deep links to related sub-category pages, related article content that may exist on the site, and popular products that deserve attention and link equity.
Home Page Duplicate Content
Every SEO should know that home pages typically have the most amount of incoming link equity, and thus serve as highly rankable pages in search engines. What many SEOs forget is that a homepage should be treated like any other page on an eCommerce website, content-wise. Always ensure that unique content fills the majority of home page body content, as a homepage consisting merely of duplicated product blurbs offers little contextual value to search engines to rank the home page as highly as possible for target keywords in search engines.
Tip: Online marketers also commonly use the homepage’s descriptive content in directory submissions and other business listings on external websites. Ensure that unique content is provided to these external websites instead. If this has already been done to a large extent, rewriting the home page descriptive content is the easiest way to fix the preexisting issue.
External Duplicate Content (Off-Site)
Duplicate content that exists between an eCommerce website and other eCommerce websites (and potentially even content websites) has become a real pain point in recent years. As Google clearly moves towards ranking websites more based on inbound link metrics (such as Domain Authority), websites with less inbound link equity are finding it extremely difficult to rank well in search engines when external duplicate content exists. Let’s dive into some of the most common forms of external (off-site) duplicate content that prevent eCommerce websites from ranking as well as they could in organic search.
Manufacturer Product Descriptions
When eCommerce websites copy product descriptions, supplied by the product manufacturer, and place them on their own product pages, they are put at an immediate disadvantage. In the search engines’ algorithmic analysis, these websites aren’t offering any unique value to users, so they choose to rank the big brand websites (who have more robust, and higher quality inbound link profiles), who may also be using the same product descriptions, higher instead. The only way to fix this is to embark upon the extensive task of rewriting existing product descriptions in addition to ensuring any new products are launched with completely unique descriptions. In our experiences, we’ve seen lower-tier eCommerce websites increase organic search traffic by as much as 50-100% by simply rewriting product descriptions for half of the website’s product pages–with no manual link building efforts.
For eCommerce websites whose products are very time-sensitive, meaning they come in and out of stock as newer models are released, a better approach can be to simply ensure that new product pages are only launched with completely unique descriptions. This ensures that internal staff time is used most wisely, and the highest ROI is received from these efforts. Rewriting the description for a product, which is going to be removed from the website in the near future, typically provides less return on investment than ensuring new products have unique descriptions for their full lifespan on the website. These are important considerations to take into account when planning out a product rewrite project.
Other ways of filling product pages with unique content include multiple photos (preferably unique photos, if possible), enhanced descriptions that offer more detailed insight into product benefits, product demonstration videos (users love videos), schema markup (to enhance SERP listings) and user-generated reviews.
Staging, Development or Sandbox Websites
Time and time again, Development teams forget, give little consideration to, or simply don’t realize that testing sites can be discovered and indexed by search engines, oftentimes creating exact duplicates of a live eCommerce website. Luckily, these situations can be easily fixed through different approaches:
- Adding a “noindex,nofollow” meta robots or X-robots tag to every page on the test site.
- Blocking search engine crawlers from crawling the sites via a “Disallow: /” command in the /robots.txt file on the test site.
- Password-protecting the test site, to prevent search engines from crawling it.
- Setting up these test sites separately within Webmaster Tools and using the “Remove URLs” tool in Google Webmaster Tools, or the “Block URLs” tool in Bing Webmaster Tools, to quickly get the entire test site out of Google and Bing’s index.
When search engines already have a test website indexed, using a combination of these approaches can yield the best results. One approach is to add the “noindex,nofollow” meta robots or X-robots tag, remove the entire site from search engines’ indexes via Webmaster Tools, and then add a “Disallow: /” command in the /robots.txt file.
For good reason, eCommerce websites see value in extending their products onto 3rd party shopping websites in order to extend their potential sales reach. What many eCommerce website marketing managers don’t realize is that this is creating duplicate content across these external domains. Oftentimes, an eCommerce website’s own products on 3rd party websites will end up outranking its own product pages when products are fed onto 3rd party websites with more authoritative inbound link profiles.
Consider the popular scenario where a product manufacturer, with its own eCommerce website (to sell its own products direct to consumers), feeds its products to Amazon to greatly increase sales. This scenario is highly plausible for revenue reasons. From an SEO perspective, serious problems have just been created, as Amazon is one of the most authoritative websites in the world and the product pages on Amazon are almost guaranteed to outrank the product pages on the manufacturers eCommerce website. Some may view this is as revenue displacement, but it clearly is going to put an in-house SEO’s job, or an SEO agency’s contract, in jeopardy when organic search traffic (and resulting revenue) plummets for the eCommerce website.
The solution to this problem is exactly what you would expect: ensure that product descriptions fed to 3rd party sites are different than what is placed on your eCommerce website. It’s recommended to give the manufacturer description to the 3rd party shopping websites, and write a more robust, unique description for your own eCommerce website. Always give your own website the edge when it comes to content. In cases where an eCommerce website is selling its own products, webmasters and marketers will need to decide whether to rewrite the 3rd party shopping feed description or the on-site description. Whichever is decided upon, just ensure that the most authoritative and robust description exists on-site.
Some eCommerce websites will also have blogs in order to provide more marketable content on their website, and some of them will even syndicate that content out to other websites (again, to extend their marketing reach). While this may seem like a great idea at first, it’s critical to realize that without proper SEO handling, this can also create external duplicate content. If the syndication partner is a more authoritative website (according to its inbound link profile), then it’s possible that the content on the syndication partner’s website will outrank (in search engines) the original content on the eCommerce website. There are a few different solutions to prevent this:
- Ensure that the syndication partner canonicalizes the content to the URL on the eCommerce site that it originated from. This is the best solution, as any inbound links to the content on the syndication partner’s website will be applied to the content on the eCommerce website. (hint, hint: link building!).
- Ensure that the syndication partner applies a “noindex,follow” meta robots or X-robots tag to the syndicated content on their site.
- Don’t partake in content syndication, and focus on other channels of traffic growth and brand development.
Oftentimes, low-quality scraper sites can steal content from eCommerce websites in order to generate traffic and drive sales through ads. While search engines have gotten much better at identifying these spammy sites, and filtering them out of their search results, they can still pose a problem.
The best way to handle this is to file a DMCA complaint with Google, or Intellectual Property Infringement with Bing, in order to alert these two search engines to the problem, and ultimately get these sites removed from search results.
Caveat: The content must be your own. If you’re using manufacturer product descriptions, you might have difficulty in convincing the search engines that the scraper site is truly violating your copyright. This might be a little easier if the scraper site is displaying your entire web page on their site, with clear branding of your website.
Duplicate content is not the only thing to be concerned with when it comes to search engines viewing your website as a quality website. Thin content (a page with little or no content) is not only terrible for user experiences, but it can get your eCommerce website penalized if the problem grows above the unknown threshold of what Google deems acceptable. Here are some examples of scenarios where thin content could occur.
Thin/Empty Product Descriptions
For large eCommerce websites, it can be easy to take shortcuts on product descriptions. Taking this approach, however, can severely limit both organic search traffic and conversion potential. Search engines are attempting to rank the best content for their users, and users (typically) want clear explanations of products to help them with their purchasing decisions. When product pages only include one or two sentences, this helps no one. The solution is to ensure that product descriptions are thorough and detailed as possible. Even when you think it might not be possible to write more (or much at all) about a product, there’s usually always a way.
Tip: One way to expand product descriptions is to jot down 5-10 questions that a customer might ask about the product, write down the answers, and then work them into the product description.
Nearly every website has outlying pages that were published as test pages, forgotten about, and now orphaned on the site. Guess who is still finding them? That’s right, search engines. Sometimes these pages can be duplicates of others, sometimes they can have partially written content, and sometimes they can simply be empty. Ensure that all published and indexable content on your website is strong and provides value to a user who might view it.
Thin Category Pages
During the taxonomy development phase, content managers can sometimes get carried away with category creation. If a category is only going to be a few products, or potentially none in the future, then don’t create it. Thinking in terms of the user, a category with only 1-3 products usually doesn’t provide the greatest browsing experience. Thinking in terms of the search engine–who thinks in terms of the user–too many of these thin category pages (coupled with other forms of duplicate and thin content) can lead a site to be penalized. The bottom line is to ensure that category pages are robust with both unique intro descriptions and sufficient product listings.
Thin content on category pages can also arise when drilling down into faceted category navigation until a page is reached with no products. These are called “stub pages,” and can lower search engines qualitative analysis of an eCommerce website when too many exist. A helpful solution to fixing this issue is to apply a conditional “noindex,follow” meta robots or X-robots tag to these pages whenever common verbiage (i.e. – “No products exist”) is used on the page by the CMS. For a deeper dive on this subject, we highly recommend reading this article, which offers nifty recommendations using AJAX navigation or a selective combination of meta robots tags and /robots.txt disallow commands to maximize crawl budget.
Tools for Finding & Diagnosing Duplicate Content
Discovering duplicate content can be one of the most difficult and time-intensive tasks in a technical audit of an eCommerce website. This section will cover some quick tips on how to speed up the process of uncovering duplicate and thin content in order to “know what to fix.”
Google Webmaster Tools
Many duplicate content issues (and even thin content issues) can be discovered through Google Webmaster Tools, which is free to set up on your website. Bing does not offer anywhere near the same level of investigative tools for the use of duplicate content analysis, so this section will focus solely on Google Webmaster Tools. Here are some of the top ways to use Google Webmaster Tools for the purpose of identifying duplicate and thin content:
- HTML Improvements – In this section, Google will point out specific URLs that have duplicate title tags and duplicate meta descriptions. Look for patterns, such as “Duplicate title tags” and “Duplicate meta descriptions” caused by category pages with URL parameters, orphaned pages with “Missing title tags,” etc.
- Index Status – In this section, Google will show a historical traffic graph of the number of pages from your eCommerce site in its index. If the graph spikes upward at any point in time, and there was no corresponding increase in content creation coinciding with it, it could be an indication that duplicate or low-quality URLs have somehow made their way into Google’s index en masse.
- URL Parameters – In this section, Google will tell you whether it’s having difficulty crawling and indexing your site. This section is nothing short of fantastic for identifying URL parameters (particularly for category pages) that could be leading to technically-created duplicate URLs. Use Google operators (we’ll get to this soon) to identify if Google has URLs from your eCommerce site with these parameters in its index, and determine whether it is duplicate/thin content or not.
- Crawl Errors – In this section, if your eCommerce website’s soft 404 errors have spiked, it could be an indication that many low quality pages have been indexed due to improper 404 error pages being produced (lacking 404 header status codes). Often times these pages will all have an error message as the only body content, and sometimes they have different URLs, which can cause technical duplicate content.
Search Query Operators (site:, inurl:, etc.)
Using search query operators in Google is one of the most effective ways of identifying duplicate and thin content, especially after potential problems have been identified from Webmaster Tools. The following operators are particularly helpful:
site: – This operator will show most URLs from your site indexed by Google, but not necessarily all of them. This is a quick way to gauge whether Google has an extremely excessive amount of URLs indexed for your site when compared to the number of URLs included in your sitemap (it should be an accurate depiction of the number of true content pages on your site, assuming that your sitemap is correctly populated with all of your true content URLs).
- Example – site:www.domain.com
inurl: – This operator is ideal to use in conjunction with the site: operator in order to discover if URLs with particular parameters are indexed by Google. As mentioned earlier, potentially harmful URL parameters (if they are creating duplicate content, and indexed by Google) can be identified in the URL Parameters section of Google Webmaster Tools. Use this operator to discover if Google has them indexed.
- Example – site:www.domain.com inurl:?price=
This operator can also be used in “negative” fashion to identify if non-www URLs are indexed by Google (assuming that the www version of URLs is preferred).
- Example – site:domain.com -inurl:www
intitle: – This operator will show all URLs indexed by Google that have specific words in the meta title tag. This can be particularly helpful when attempting to identify duplicates of a particular page, such as a product page that may also have a “review page” indexed by Google.
- Example – site:www.domain.com intitle:Maglite LED XL200
Plagiarism, Crawler & Duplicate Content Tools
There are a number of very helpful 3rd party tools to help additionally identify duplicate and low-quality content that search engines could easily index. The following are some of the more popular tools to use for these purposes:
- Copyscape – This tool is particularly useful at identifying external “editorial” duplicate content. Copyscape can crawl a website’s sitemap and compare all URLs within it to the rest of Google’s index, looking for instances of plagiarism. For the specific needs of eCommerce websites, this is particularly helpful at identifying the worst-offending product pages when it comes to copied and pasted manufacturer product descriptions. Exporting the data as a CSV file, and sorting by risk score allows for quick prioritization of the pages with the most duplicate content. Try this tool at www.copyscape.com.
- Screaming Frog – This tool is very popular with advanced SEO professionals, as it crawls a website and helps to identify potential technical issues that could exists with duplicate content, improper redirects, error messages, etc. Exporting the crawl and segmenting the duplicate content issues can provide a lot of additional insight not provided by Google Webmaster Tools. Download this tool at http://www.screamingfrog.co.uk.
- Siteliner – This tool offers a quick way to identify pages on your eCommerce site with the most internal duplicate content. The percent of duplicate content returned by this tool crawling your website pages is determined by how much unique content exists on each particular page in comparison to the repeated elements of each web page (header, sidebar, footer, etc.). This tool is particularly helpful at finding thin content pages. Try this tool at www.siteliner.com.
Experience, Intuition & CMS Knowledge
While the various tools and technical tips recommended above are extremely helpful at identifying duplicate, thin, and low-quality content, nothing compares to years of experience in identifying, diagnosing and fixing duplicate content problems. As you work through identifying these specific issues on your website, you’ll be developing a wealth of knowledge that can be used and re-used in the future to continue cleaning up these issues, and preventing in the future. There’s only one way to get to that point–get started!
- Google’s Official Advice on Duplicate Content
- Why Use Unique eCommerce Website Copy? (by Rick Ramos of seOverflow)
- Four SEO Best Practices for Using a Content Delivery Network (CDN) (by Everett Sizemore of seOverflow)
- Duplicate Content Guide from Kern Media (by Dan Kern)
- Duplicate Content in a Post Panda World (by Dr. Pete Meyers of Moz)
- eCommerce SEO: Product Variations, Colors and Sizes (by Adam Audette of RKG)
- The Complete Guide to Mastering Duplicate Content Issues (by Stoney G’deGeyter via SEJ)