An image sowing a toy robot

Solving SERP-Outranking of Stolen Content

The Problem

Over the years I had to deal with several websites suffering from the following SEO-problem: sometimes old domains having a latent backlink issue which will become painful visible out of the blue. Hacked domains are mirroring the content of well ranking documents and outranking or even swapping its place in the SERPs with the original document. This results in a steep drop of traffic for the affected documents. This issue is unfortunately caused in parts by the plundered domain itself usually because of low quality back-links or other domain related issues. The only good thing about this issue is that it is a “wild SEO experiment” showing that the content is from good quality and generally working as intended to boost SEO.

After dealing with this issue fore some time I found one day a domain where the attacker uploaded PHP code to a windows server not running PHP at all. Thanks to this mistake I had the opportunity to analyze and download the source code of such a malicious content mirror. Later some of the hacked domains also shared some code found on their servers to. It is such a simple but effective hack: the attacker used some known loopholes in outdated software libraries to automatically upload an obfuscated PHP script which will be mirroring stolen content. I learned a few interesting things after decoding it:

  1. The whole software was a “copy and paste” from some Chinese PHP forums (including bugs and comments), pointing to Chinese script-kiddies build to do the “task” with es less effort as possible.
  2. Its function was to steal content from websites ranking to a list of keywords in Google SERPs, which ware configured in a variable.
  3. Early versions didn’t even have some form of caching and did a “live content theft” when a search engine crawler was visiting the affected URL.
  4. After Google’s efforts around “page speed” the hackers implemented some slightly more advanced version with caching to speed up the content delivery to search engine-bots
  5. For every visitor not a spider the script did a redirect to one shop offering fake brand products like sunglasses or handbags.
  6. The scraped keywords (and so the content) had nothing to do with the offered products, and so the conversion for this attack was probably really, really bad.
  7. In the content of the hacked domain some internal links were injected, invisible to ordinary visitors to let the parasitic keyword-documents be found by the search engines and for better rankings.
  8. The important lesson was: the content was stolen in a technical simple, automated method without any advanced form of obfuscation beyond user-agent spoofing. If the request from the spam-script is identified before it gets the content, the whole chain could be interrupted.

The Conventional Solution

The conventional solution to solve this issue and get the positions in the SERPs back is a combination of activities: they are partly directed against the hacked spam-domain but you need also to tackle the issues on the original domain making it vulnerable for this type of attack:

The Spam-Domain

In most cases these domains are hacked and the owners, often small business, are unaware of the issues. Filing Google spam reports didn’t solve any case of this problem I ever had to deal with in my career. I never heard back and the spam domains will outrank the originals until they might get kicked after months with a mayor index-update. 90% of the domains remove the content after some days when you manage to contact the owner and making them aware of the issue. It often helps to build up some pressure by noting the copyright issues involved. But as the internal links to the webspam are also removed search-engine spiders are not crawling the URLs for months and the parasitic documents will stay a long time in the index, causing issues for the originals. Educating the owners how to remove these documents with instruments like the GSC (Google Search Console) is often fruitless because they are technically completely overwhelmed by this issues.

The Victim-Domain

The victim-domain with the affected content and positions is vulnerable in the first hand due to issues with their back-link profile. But fixing this is only easy if it is caused by leftover bad links from historic link buying. Combining link sources from many tool vendors together with GSC data and finding bad links is expensive, labour intense and error prone. Also often not delivering the desired results: either to much positive links are removed together with the bad ones or to much bad links are left. Cleaning link profiles is often not working as intended and could be the road to SEO-hell.

Additionally, it could help to build up a positive link profile for the affected documents, but link-building is neither an easy or fast SEO task.

The New Method

Naturally the other methods still have to be done in case of content outranking. Especially a problematic link-profile is a severe SEO issue on its own and needs to be cleaned carefully. But the best method to tackle content outranking by spam-domains is to prevent them from getting access to the content firsthand. 

How could this be done, as the content is accessible for everybody with a web browser? Also search engines require websites owners to provide users and crawlers with the same content, even though the spammers/hackers prove how limited capabilities of the search engines to detect malicious behavior in reality is. What really would be needed is a way to distinguish on every request to a website if this access is from a real user with a browser, a valid search-engine bot or an other side a possible malicious bot. Luckily there is an established method to test if a crawler access is from a valid search-engine bot via “forward and reverse DNS lookup”. Such a test could be added as a pre filter with only little overhead on server side or directly in a CMS. The hard issue is to distinguish between bots in general and humans. If this is solved one could deliver the content only to humans and valid crawlers. The rest of the “visitors” still get a working website but without the valuable content.

This recurring problem inspired me to have a deeper look into what features requests from users and search-engines share ad how they differ. I started to track raw HTTP headers on a test domain and noticed huge differences which were easy to spot with some research and training as a human. As the HTTP protocol is text based and the problem was solvable by a human it turned out to be a problem really similar to spam classification of emails. After some tests I had a simple prototype of a perceptron based WINNOW classification working with more than 99.0% accuracy on a test corpus of manually labeled samples. I choose this algorithm because it had a really good track record in email spam classification.

After the encouraging results I decided to turn this from an experiment to a working Apache module for live classification of Bot vs. Human. I had to solve several Problems to be successful:

  1. Speed issues: the classification should not noticeable slow down the delivery of websites on a host. As a generic high speed solution an Apache module was developed. Apache web server is one of the most common server software, it is modular and allows the development of plugins (called “modules”) and it is written in lightning fast C program language.
  2. Data-protection and privacy issues: I had to collect data samples for manual classification of training data in a form compatible with the then upcoming GDPR. The solution was to hash the automatically extracted raw features from the whole headers and store them with a one way cryptographic hash function. In parallel I was logging this features together with obfuscated raw HTTP-headers. The human readable part of the headers was striped from any information originally making it unique or assignable to a real user. Additionally data was only collected when the classification result was undecided within some margin.
  3. Updating the module: as the hashes of the features ware still text it turned out the classification was working without any difference with the hashed features as with the raw text features. For updating the classification model a backend containing a small web interface, a database and a PHP version of the classification module was developed where the anonymized raw headers of possible false positives could be examined and the encrypted features be reclassified. The resulting updated classification-model was then uploaded on the web server.

The results were astonishing:

  1. The live classification on the web-server was done in less than 200 milliseconds on average.
  2. The classification-model was very robust and updates needed only rarely.
  3. Misclassifications were rare: only some very suspect access attempts via proxy server could not be classificated automatically as bot or human. But also for a human operator this was also indistinguishable.
  4. After some simple tweaks to the custom CMS it was suppressing the actual content of every document if the access was not from a valid search engine crawler or clearly a browser, as a result no content to steal the problem was solved! 
  5. After this method was implemented only rarely content was copied manually from users and pasted into forums or blogs, not causing any SEO issues.