Black Hole SEO: Desert Scraping
In my introduction post to Black Hole SEO I hinted that I was going to talk about how to get “unique authoritative content.” I realize that sounds like an oxymoron. If content is authoritative than that means it must be proven to work well in the search engines. Yet if the content is unique than it can’t exist in the search engines. Kind of a nasty catch-22. So how is unique authoritative content even possible? Well to put it simply, content can be dropped from the search engines’ index.
That struck a cord didn’t it? So if content can be in the search engines one day and be performing very well and months to years down the road no longer be listed, than all we have to do is find it and snag it up. That makes it both authoritative and as of the current moment, unique as well. This is called Desert Scraping because you find deserted and abandoned content and claim it as your own. Well, there’s quite a few ways of doing it of course. Most of which is not only easy to do but can be done manually by hand so they don’t even require any special scripting. Let’s run through a few of my favorites.
Archive.org Alexa’s Archive.org is one of the absolute best spots to find abandoned content. You can look up any old authoritative articles site and literally find thousands of articles that once performed in the top class yet no longer exist in the engines now. Let’s take into example one of the great classic authority sites, Looksmart.
1. Go to Archive.org and search for the authority site you’re wanting to scrape.
2. Select an old date, so the articles will have plenty of time to disappear from the engines.
3. Browse through a few subpages till you find an article on your subject that you would like to have on your site.
4. Find an article that fits your subject perfectly.
5. Do a SITE: command in the search engines to see if the article still exists there.
**6. If it no longer exists just copy the article and stake your claim. **
See how easy it is? This can be done for just about any old authority site. As you can imagine there’s quite a bit of content out there that is open for hunting. Just remember to focus on articles on sites that performed very well in the past, that ensures a much higher possibility of it performing well now. However, let’s say we wanted to do this on a mass scale without Archive.org. We already know that the search engines don’t index each and every page no matter how big the site is. So all we have to do is find a sitemap.
Sitemaps If you can locate a sitemap than you can easily make a list of all the pages on a domain. If you can get all the pages on the domain and compare them to the SITE: command in the search engines than you can return a list of all the pages/articles that aren’t indexed.
-
Locate the sitemap on the domain and parse it into a flat file with just the urls.
-
Make a quick script to go through the list and do a SITE: command for each URL in the search engines.
-
Anytime the search engine returns a result total of greater than 0, just delete the url off the list.
-
Verify the list by making sure that each url actually does exist and consists of articles you would like to use.
There is one inherent problem with the automatic way. Since it’s grabbing the entire site through its sitemap than you are going to get a ton of negative results, like search queries and other stuff they want indexed but you want no part of. So it’s best to target a particular subdirectory or subdomain within the main domain that fits your targeted subject matter. For instance if you were wanting articles on Automotive, than only use the portion of the sitemap that contains domain.com/autos or autos.domain.com.
There are quite a few other methods of finding deserted content. For instance many big sites use custom 404 error pages. A nice exploit is to do site:domain.com “Sorry this page cannot be found” then lookup the cached copy in another search engine that may not of updated the page yet. There is certainly no shortage of them. Can you think of any others?
Cheers
Comments (157)
These comments were imported from the original blog. New comments are closed.
Another favorite of mine : Wikipedia… Just copy their content, store them for 2/3 months, you can be sure they will have changed by then on wikipedia
My current “magic” recipe is the following : I take wikipedia articles I scraped a few months ago, and I mix them with a few sentences from the current search engine results… I never produce supplemental pages that way!!!!
From experience, I can tell you that a good search engine ranking doesn’t depend on content but on inbound links! You could scrape almost anything, as long you have lots of links, you’ll rank well.
This is more to have easily unique content!
hi eli, are you producing a tool for squirt members or should we code something like this ourself?
regards,
RRF
@guerilla
Then if your worried the content is used…
Simple.
“blah blah this is the content from the page your worried about still being in the index blah blah”
Plop that into google and bam… This will ensure if its still there or not.
Brilliant idea and premise Eli.
Thanks Brian. I kinda knew that. I was just pointing out that unless you are grabbing content from a defunct site/company, it’s important to be more cautious with how you check the content’s availability.
The usefulness of this idea is based upon being the only person with the forgotten content. Otherwise, we can just scrape current authoritative sources.
Very good idea Eli.
Dude. Awesome idea. I have some nice sites that I’m going to scrape for content..you have just filled up my calender for the next week or so.. Thanks!
Eli, you rock.
nothing,
he just want to collect backlinks via comments which has nothing to do with the topic and he put some keywords in the entry sentences cause he want to monetize this “brand new tipps” hrhr via adbride.
One thing to keep in my mind though is that the abandoned content could be scraped itself.
I quickly found the site you used in your post Eli and ran a sentence through google with “” around it and sure enough it was taken from another site word by word.
Hey Eli
Great idea. Question:
You said “Just remember to focus on articles on sites that performed very well in the past, that ensures a much higher possibility of it performing well now.”
What criteria are you using to judge for the content’s performance?
It can’t be ranking in the SE, since it’s not there, so are you talking about views on the site or something?
@Dropout: you could always rewrite it, or “spin” it in your preferred content rewriter or pay someone $5 to rewrite it.
Obviously it would depend on whether you’re intending to build massive/smaller content sites as to the feasibility of the last method.
Hi Eli,
Well you only showed the search site command to check if that page is still on the old site’s domain. But that in itself doesn’t mean it’s not on their page (unless it’s a totally defunct site). For example, they could have moved the content to a different page, or even “sold” their content to be used on other sites. As such that information is not so much scraping but stealing since it may be actively used on a valid site at the moment to full rights. As someone mentioned in comments above, at least I’d recommend running a search on a phrase or two to make sure it’s really gone from places (at least for people who care). Sites can easily do their own searches too, and if one is too greedy they’ll find you stole their content in a heartbeat.
Enjoying your posts, recent reader Eli. Is that pronounced “El-ee” or “e-Lie” in your case?
Eli told me that his name is pronounced like a Jihadist Warcry: E-li-li-li-li-li-li…
…..but I digress….
There probably should have been a “dummies beware” kind of disclaimer for this post. Why? Because you are correct in saying that the content you’ve selected could be content that has been resold or shuffled to another site. But that alone does not invalidate this technique. It just means that you are responsible for doing your own due-diligence.
great article. what’s the take on the wikipedia comment? is it legit to grab a wiki article and rewrite it as your own?
do a lot of sites currently do this with wikipedia content? the only big site i am aware of doing this practice is answers.com
Hi Eli, that’s a wonderful idea. I had this in the back of my mind too. Taking and reproducing data from the invisible web. But never thought it would be this easy.
There are so many websites hosting on free domains out there that form part of the invisible web. If only there was a method to find them out. You think that is possible?
ooo, about.com thats a toughy. They NEVER loose content. They have every static page they’ve ever created still there and kickin’. So I definitely wouldn’t waste too much time on them.
How about sites that list detailed product information and are always changing inventory. E-com sites?
Hmm thanks for that one, I’ll look around for some of those.
The stuff I’m trying to scrape is in the dating niche… so hahaha.
Question for you guys… I though that one of the main reasons that pages on sites like wikipedia rank so well is because they are on the wikipedia domain(authority site), not because their content is the greatest(SEO-wise). Am I incorrect in my thinking?
On a side note, I created a wiki article last week for one of my sites’ search terms. Anyone know how long it takes to get a wiki article indexed?
Thanks, MarkJ
Here’s another play on this idea:
===
P.S. Eli - I just came across this site the other day. This isn’t my business model, but it’s fantastic reading material - gets my brain humming. I’d love to see a “Blue Hat SEO Guide to Web Programming” recommended book list.
What about comparing the content with what can be found in Copyscape.com? As an additional simpel first check.
Just upload the retrieved content in one big (or more) HTML page(s), upload and enter the URL in Copyscape (or another content checker). If duplicate results are found then you know for sure the content is not unique.
Maybe I am late for the party, but based on comment by “Flow Chart Dude” frmo 2007-06-21 I made this little simple tool for scraping dropped DMOZ sites:
live demo at dev.mediaworks.cz/dmoz_dropped.php source at dev.mediaworks.cz/dmoz_dropped.phps
Great info on getting new content. But I have one question.
I didn’t see a response to this previously asked question. How do you know that the content you choose was ranked well and was authoritative?
thanks
Marc
Another great article !
Thanks for posting
Center of Public Integrity reported in 2005 that the lobbyists in the U.S. had spent nearly $ 13 billion since 1998 to influence members of Congress. Political parties in the United States obtains significant funding from private companies, beyond the funding that accrues candidates in an electoral context.
Contributions to the so-called Political Action Committees detected by the Centre for Responsive Politics, and the public contribution round from 2004, this was the largest donors to the Democratic Party’s Political Action Committee: