+ Introduction – Duplicate Content Defined
+ How Google Penalizes Duplicate Content
+ Syndication: Duplicate Content Across Domains
+ Duplicate Content on the Same Domain
+ How to Find & Fix Duplicate Content
Duplicate content is bad. Using the same content, either in total or partial form, on your website leads to a poor user experience, and triggers a red flag in Google’s search algorithm. In the old days of SEO, duplicate content was often used as a cheap trick to get more keywords and more content on your website, so Google evolved a system to weed out the spammers who violated best practices by doing this. Today, if you’re caught using duplicate content, your domain authority could suffer and your keyword rankings could drop.
Fortunately, Google is pretty fair about the issue. The company understands that the majority of duplicate content issues don’t come about as a malicious attempt to cheaply increase rank. In actuality, most instances of duplicate content are accidents or are overlooked by webmasters. Still, having too much repeated content on your site can be damaging, and it’s in your best interest to run a test to see if there is any duplication on your site.
Ever since I started getting my feet wet in SEO, this question has swirled around forums and blogs. Somewhere, someone out there perpetuated the idea that having the same content on page A of your website as page B of your website would cause your site to be penalized in search engine rankings. This idea began to percolate in the internet marketing community because a bunch of spammers realized that when they had a piece of content (ie, an article) that was getting a lot of search traffic, they could fill up every page of their website with the same content in order to pull even more traffic from the search engines. Obviously, the same article blatantly duplicated across hundreds of pages within a single domain is a malicious attempt to gain search engine traffic without actually adding any value. Google caught on pretty quickly to this method and fixed its algorithms to detect duplicate content and display only one version of it in the search rankings. Websites that engaged in this blatant activity were de-indexed and cried up a river across forums and blogs throughout the internet marketing community. Thus was born the fear of the “duplicate content penalty.”
However, in the vast majority of cases, duplicate content is non-malicious and simply a product of whichever CMS (content management system) the website happens to be running on. For example, WordPress (the industry-standard CMS) automatically creates “Category” and “tag” pages which list all blog posts within certain categories or tags. This creates multiple URLs within the domain that contain the same content.
1) Google may decide to let me off with a “warning” and simply choose not to index 99 of my 100 duplicate posts, but keep one of them indexed. NOTE: This doesn’t mean my website’s search rankings would be affected in any way.
2) Google may decide it’s such a blatant attempt at gaming the system that it completely de-indexes my entire website from all search results. This means that, even if you searched directly for “Example.com” Google would find no results.
So, one of those two scenarios is guaranteed to happen. Which one it is depends on how egregious Google determines your blunder to be. In Google’s own words:
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don’t follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.
This type of non-malicious duplication is fairly common, especially since many CMSs don’t handle this well by default. So when people say that having this type of duplicate content can affect your site, it’s not because you’re likely to be penalized; it’s simply due to the way that web sites and search engines work.
Most search engines strive for a certain level of variety; they want to show you ten different results on a search results page, not ten different URLs that all have the same content. To this end, Google tries to filter out duplicate documents so that users experience less redundancy.
So, what happens when a search engine crawler detects duplicate content? (from https://searchengineland.com/search-illustrated-how-a-search-engine-determines-duplicate-content-13980)
Google is fairly open about its duplicate content policies. According to their reports, if Google encounters two different versions of the same web page, or content that is appreciably similar to onsite content elsewhere, it will randomly select a “canonical” version to index. The example they give is this: imagine you have a standard web page and a printer-friendly version of that same web page, complete with identical content. Google would pick one of these pages at random to index, and completely ignore the other version. This doesn’t imply anything about suffering a penalty, but it’s in your best interest to make sure Google is properly indexing and organizing your site.
The real trouble comes in when Google suspects your content of being maliciously or manipulatively duplicated. Basically, if Google thinks your duplicated content was an effort to fool their ranking algorithm, you’ll face punitive action. It’s in your best interest to clear up any errors well in advance to prevent such a fate for your site.
Sometimes, the same piece of content can appear word-for-word across different URLs. Some examples of this include:
All these examples result from content syndication. The Web is full of syndicated content. One press release can create duplicate content across thousands of unique domains. But search engines strive to deliver a good user experience to searchers, and delivering a results page consisting of the same pieces of content would not make very many people happy. So what is a search engine supposed to do? Somehow, it has to decide which location of the content is the most relevant to show the searcher. So how does it do that? Straight from the big G:
When encountering such duplicate content on different sites, we look at various signals to determine which site is the original one, which usually works very well. This also means that you shouldn’t be very concerned about seeing negative effects on your site’s presence on Google if you notice someone scraping your content.
Well, Google, I beg to differ. Unfortunately, I don’t think you’re very good at deciding which site is the originator of the content. Neither does Michael Gray, who laments in his blog post “When Google Gets Duplicate Content Wrong” that Google often attributes his original content to other sites to which he syndicates his content. According to Michael:
However the problem is with Google, their ranking algo IMHO places too much of a bias on domain trust and authority.
And I agree with Michael. For much of my internet marketing career, I have syndicated full articles to various article directories in order to expand the reach of my content while also using it as “SEO fuel” to get white hat backlinks to my websites. According to Google, as long as your syndicated versions contain a backlink to your original, this will help your case when Google decides which piece is the original. Here’s proof:
First, a video featuring Matt Cutts, a well-known blogger and former search engine algorithm engineer for Google:
The discussion on syndication starts at about 2:25. At 2:54 he says you can tell people that you’re the “master of the content” by including a link from the syndicated piece back to your original piece.
In cases when you are syndicating your content but also want to make sure your site is identified as the original source, it’s useful to ask your syndication partners to include a link back to your original content.
Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to use the noindex meta tag to prevent search engines from indexing their version of the content.
Now, what I think is interesting from this last quote from Google is that they actually admit that the piece of content they choose may not be the right one. In my experience, it’s very likely not to pick the right one if the site that originated the content is relatively young or has a low PageRank. So this raises the next big issue:
In a past life, I syndicated tons of my articles to EzineArticles only to see Google credit them with higher search results for my content, even when I made fully sure that Google had indexed my content at its original location prior to submitting it to Ezine. Vanessa Fox, who previously worked at Google and built Webmaster Central, attempts to tackle this question in her blog post, “Ranking as the Original Source for the Content you Syndicate.”
Unfortunately, she concludes that, basically, there’s nothing you can do to ensure that you do. She suggests:
Create a different version of the content to syndicate than what you write for your own site. This method works best for things like product affiliate feeds. I don’t think it works as well for things like blog posts or other types of articles. Instead, you could do something like write a high level summary article for syndication and a blog post with details about that topic for your own site.
Rewriting a piece of content is not my definition of syndication. That’s just rewriting an article in different words and distributing it. Almost all information circulating on the web has already been posted elsewhere anyway; even this blog post is composed of a ton of information that I found elsewhere on the internet. So to me, writing a new article that says the same thing in different words and distributing that to syndication partners isn’t really syndication of the original article. It’s syndication of a different article. So we’re still left with the question of the results of syndicating the exact same content that already appears on your website: what are the effects of doing so? Can it harm my rankings in any way?
To me, this is the most important question surrounding duplicate content. Before I jump into that analysis, let’s consider an important foundational question.
The internet really operates on a simple economy of give-and-take. The two commodities that are exchanged are unique content and backlinks. Unique Content is defined as content which Google does not identify as duplicate. There are various theories about where exactly Google draws the line of deciding whether content should be considered duplicate, but one figure I’ve heard tossed around a lot is 30%. Basically, according to the 30% theory, if Google identifies that more than 30% of a particular piece of content appears elsewhere across the internet, it’ll be categorized as duplicate. Now, I can’t attest to the accuracy of this figure, so take it for what it’s worth. There’s also various duplicate content-detection software such as CopyScape which is designed to help Webmasters check to see if their content has been stolen and duplicated across other domains. This is also a good tool to use to determine whether your content is likely to be considered duplicate by Google. And that’s what really matters.
But I’ve gotten a bit off track, so let’s get back to the discussion of why you’d want to syndicate content. I mentioned the internet economy of backlinks and unique content. Unique content is desirable because it will be indexed by Google, giving that particular Website another instance of its “name in the hat” so to speak. Basically, the more content a website has indexed, the more chances it has of being returned in Google’s search results for relevant queries.
But what about backlinks? Backlinks are simply links from any other website to your own. Search engines consider it a “vote” when one website links to another. This vote is used to determine authority & relevance in Google’s search results. In fact, it’s thought that backlinks are the single most-important factor in determining how your website should rank for a given query. There are a ton of factors that play into backlinks and how much their “vote” counts for, but I’ll get into that in a future blog post. For now, what you need to know is that backlinks are valuable because they improve your rankings in the search engines, and that means more traffic to your website.
OK, so now we’ve covered the basic commodities of the micro-economy of the Web. This is important because when you syndicate your content, assuming you have included a backlink in it linking back to your original source, you get a backlink from each and every website to which your content was syndicated. Awesome, right?
Maybe not. The first question is how highly Google values a backlink from a piece of content that is known to be duplicate content. Frankly, I don’t know. On the one hand, it’s easy to syndicate content to a bunch of auto-accept blogs if your sole goal is to get backlinks, and this says nothing about the quality of your content or how much the originator of the content should be rewarded. On the other hand, syndication can also be a great indicator of the quality of a particular piece of content. After all, why would it be syndicated so much if it weren’t really great?
In the end, Google probably has signals for how it answers these two questions, but the real answers are probably only known by the software engineers that coded the algorithm. Many folks try to boost the value of their syndicated content by engaging in content “spinning” which is perfectly legitimate as long as it’s not the garbage that’s often spouted out by automated software. I’ll go into more depth about content spinning in a later post. For now, we’re still trying to answer the question of whether syndicating content exactly as it appears on your own website is a good idea or a bad idea. After careful testing I’ve come to the following conclusion:
I know, I know. That’s not the answer you wanted. Allow me to explain.
I own over 50 domains, and I like to do a lot of testing across them. I spent a couple hours last night performing searches for my content that I had syndicated to various other blogs and directories. And what I found was both disappointing and encouraging.
The disappointing part was that, in many cases, my syndicated content outranked my own original content. Even if a site ranked higher than mine for my own content had a backlink to my site, the originator of the content, it was like Google completely ignored that backlink and still gave more credit to the other sites. In some cases, my own site’s version of the content was nowhere to be found, obviously falling into Google’s duplicate URL cluster and being filtered out of the search results. This means that by syndicating my content, I actually, in effect, got my own content de-indexed.
This is pretty much the worst possible scenario, but it happened. Sometimes, at least. And that’s the weird part; sometimes, my content was recognized as the original content and received the highest ranking. With other sites and pieces of content, it ranked second behind a high-authority site, usually EzineArticles. So I have to conclude the following:
When you syndicate your content, it might:
Well, that pretty much covers all the bases, doesn’t it? These are all the results I observed when looking at my own sites and the results of syndicating articles that originated on those sites. Basically, I can conclude that Google just doesn’t always get it right. And, Google doesn’t like to do anything with any sort of consistency. The last thing they want is for us SEOs to completely figure out their algorithm, because once that happens, the integrity of their search results will be destroyed as folks manipulate them all to hell.
The encouraging part was when I discovered that the backlinks from the syndicated content definitely helped my sites’ rankings for my target keywords. So there is definitely at least somevalue of backlinks originating from content which Google has labeled as “duplicate.”
So, the question remains: Should I syndicate my content?
Let’s look at the benefits of doing so:
So, syndicating your content is risky. You can definitely get the best of both worlds if Google decides your site is the originator of the content, thereby rewarding your content with the top position in the search results and also getting all the juicy backlinks that play into your overall rankings for specific keywords. But if Google gets it wrong (and it does, quite often, contrary to what they might think), you risk having your content never rank for relevant search engine queries.
And this really worries me, because I’ve always held the opinion that there’s nothing else someone else can do to harm the rankings of a particular website. After analyzing these results, I fear I’ve found a loophole in my own argument; if someone else visits my website, copies all my content, and syndicates it around the Web, it’s possible that the sites to which my content was syndicated will actually rank higher for it than my own site. Google tries to address this problem here as well as in the Matt Cutts video:
In most cases, a webmaster has no influence on third parties that scrape and redistribute content without the webmaster’s consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our webmaster guidelines. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.
Again, unfortunately I have to point out that in my own experience, repeatedly, I’ve seen my own content rank worse than the sites to which it was syndicated. So even though Google thinks it’s good at identifying the original source of the content, my data suggest otherwise. In time, we can only hope that Google improves this aspect of its algorithm; there’s certainly nothing more we can do as Webmasters. Instead, you just have to understand the benefits and drawbacks of syndication and decide whether you’re comfortable with taking on the risks of having Google wrongly identify ownership of your content.
Here are a couple tips to minimize the risk of Google getting it wrong (in theory):
What about taking Vanessa’s suggestion and re-writing your content before syndicating it?
This would definitely solve the problem of possibly getting your own content essentially de-indexed when Google wrongly attributes content ownership, but there are some major problems with it too:
The final word is that, unless you are really blatantly duplicating your content across tons of URLs within the same domain, there’s nothing to worry about. One of your URLs on which the duplicated content resides will be indexed and chosen as the “representative” of that URL cluster. When users perform search queries in the search engines, that particular piece of content will display as a result for relevant queries, and the other URLs in the dupe cluster will not. Simple as that.
However, the other side of the coin is duplicate content across different domains. And that’s a whole different monster. Ready to tackle it? Here we go.
Traditional duplicate content is the type of content that comes to mind intuitively when you hear the phrase. It is content identical to, or highly similar to, content that exists elsewhere on the web (usually on your own site). There are a handful of reasons a site would intentionally duplicate this content:
All of these situations are deceitful, sometimes to users and sometimes to Google, and for the most part, webmasters know to stay far away from these practices. If you engage in them, you probably deserve a penalty.
I call it “sneaky” duplicate content because of how easily it can sneak up on you. You have no intention of creating duplicate pages, but they can happen anyway. Usually, this is due to a technical hiccup or an unwitting reproduction; for example:
Unfortunately, most of these instances can arise naturally as you build and modify your website, unless you’ve specifically taken preventative action to stop it.
Your first reaction to this evaluation may be one of dismissal. You don’t copy your content from one page to another. You take meticulous care to make sure every page of your site is originally written, with no duplicated phrases or sections.
Unfortunately, there’s still a risk for you. What Google registers as “duplicate content” isn’t always what a user sees as duplicate content. A user browsing through your pages may never encounter a repeated phrase, but Google may crawl your site and find dozens of repetitions in your title tags, or you may have multiple non-canonized URLs hosting the same on-page content. Even if you feel confident that you haven’t directly influenced some form of duplicate content, it’s worth checking your site just to be sure.
Fixing duplicate content is relatively easy. Finding it is the hard part. Like I mentioned above, duplicate content can be tricky to detect—just because you don’t have any repeated content from a user experience perspective doesn’t mean you don’t have repeated content from a search algorithm’s perspective.
Your first step is a manual one; go through your site and see if there are any obvious repetitions of content. As an example, do you have an identical paragraph concluding each of your services pages? Rewrite it. Did you re-use a section of a past blog post in a new post? Make a distinction. Once you’ve completed this initial manual scan, there are two main tools you can use to find more, better hidden instances of duplicated content.
Perform Your Own Search
First, you can perform a search to see through Google’s eyes. Use a Site: tag to restrict your search to your site only, and follow up with an intitle: tag to search for a specific phrase. It should look a little something like this:
This search will generate all the results on your given site that correlate to your chosen phrase. If you see multiple identical results, you know you have a duplicate content problem.
Check Webmaster Tools
A simpler way to check for duplicate content is to use Google Webmaster Tools to crawl your site and report back on any errors. Once you’ve created and verified your Webmaster Tools account, head to the Search Appearance tab and click on “HTML Improvements.” Here, you’ll be able to see and download a list of duplicate meta descriptions and title tags. These are common and easily fixable issues that just require a bit of time to rewrite.
To determine whether a sample of duplicate content is going to pull down your rankings, first you have to determine why you are going to publish such content in the first place.
It all boils down to your purpose.
If your goal is to try to punk the system by using a piece of content that has been published elsewhere, you’re bound to get penalized. The purpose is clearly deceptive and intended to manipulate search results.
This is what Google has to say about this sort of behavior:
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results.
For 5 cents per search, you can have Copyscape vet an entire piece for you. But if your budget won’t allow that kind of expenditure, you can still use Copyscape for free. The catch with free Copyscape is that you’ll have to publish the content online first to retrieve its URL.
Copy and paste the URL of your newly published content in Copyscape’s search box. What Copyscape does is scan the entire interwebs for any copies of the content you’ve just published.
Copyscape is a reliable tool that many publishers depend on heavily to check for quality and originality. There are other tools very similar to Copyscape that you can use for the same purpose, such as Plagiarism Detect and InterNIC.
Checking for duplicate content is fairly easy and simple. It’s an indispensable SEO task for beginners, but no one should take it for granted. With the right set of tools, you can comfortably ensure that your content is unique well before you publish it online.
And by providing your readers with high-quality and unique content, you will have furnished great value.
Once you’ve identified the critical areas of duplication on your site, you can start taking action to correct them. The sooner you take corrective action, the sooner you’ll start rebounding from the negative effects. Fortunately, Google also makes it easy for you to find and correct duplicate content on your site. When you log into Google Webmaster Tools, head to “Search Appearance,” and then “HTML Improvements.” This will allow you to generate a list of any pages that Google detects as being duplicated. Once you have this list, you can begin eliminating the duplicate errors one by one with any of the following methods:
Let’s do a brief recap. “Duplicate content” can refer to plagiarized material, copied content for the purposes of site inflation, but more importantly for the average user, pages that Google indexes twice. These duplicate forms of content are easy to track down with Google Webmaster Tools and fix with canonicalization adjustments or redirects, but if they go unnoticed, they can cumulatively bring your rankings down. Be proactive and scout for duplicate content at least once every few months—unless your site management process is flawless, it’s probably that duplicate content will surface when you least expect it.
In the end, it all comes down to testing on a massive scale, getting solid data and making decisions based on that data. So here’s what I’m going to do. I’m going to run a huge test and then update this post with my results. At the beginning of the post I mentioned that I am soon launching a massive Website with tons of unique content. I’m going to syndicate it all, completely unedited, as far and wide as I possibly can. As I do so, I’ll monitor traffic sources to see what keywords people are using to find my content. Then, I’ll replicate those keyword queries in Google and see where my site ranks in the search results. This should be the definitive test for the merits of syndication.
Thanks for sticking with me through this crazy post!