Page 1 of 1

How does Google indexes/crawls restricted pages? + Bonus

Posted: Thu Jan 05, 2012 5:51 am
by Keio
Have you ever looked for something in google only to click on some result which leads to a restricted page or forum?
How does google indexes that page if normally one would need an account in order to see its content?

Bonus question: you know how the google result page is presented, let's work with the following example and assume i searched for "x" then I get:

Result 1 Title
Relevant text for result 1.

Result 2 Title
Relevant text for result 2.

etc.

Well have you ever noticed that in some cases, by changing your query a little (let's say "~x") you can get a different "Relevant text" for the same pages:

Result 1 Title
Different Relevant text for result 1.

Result 2 Title
Relevant text for result 2.

So. the question is: how can I see all the available "Relevant text" that google has stored for one particular site/page?
Some of you will immediately say to click on the "Cache" link. Well... that works most of the time. But other times the cached text just ins't the same as the one that was shown on the "Relevant text" and other times the pages simply doesn't exist anymore and absolutely nothing matches. (which really makes me wonder how their cache actually works)


If this questions are not clear enough please tell and I'll try to find some examples.

Re: How does Google indexes/crawls restricted pages? + Bonus

Posted: Thu Jan 05, 2012 9:41 am
by Montyphy
Keio wrote:Have you ever looked for something in google only to click on some result which leads to a restricted page or forum?
How does google indexes that page if normally one would need an account in order to see its content?


A few possibilities:

1. The site's content changed since Google crawled the site.

2. The site's restrictions changed since Google crawled the site.

3. The site grants memberless access to content based on the provided user agent. Typically this will be the user agents used by crawlers of popular web indexers (Google's list can be found here). As for how to exploit this just look into ways of spoofing your user agent. Shouldn't be too difficult as it's been popular for a long time although the slight problem is if the site also performs the IP check.

4. The site has 'signed up' to a program with Google, although this tends to be pretty much point 3 with the addition of checks on referrals, cookies, and possibly uniqueness.

Keio wrote:So. the question is: how can I see all the available "Relevant text" that google has stored for one particular site/page?
Some of you will immediately say to click on the "Cache" link. Well... that works most of the time. But other times the cached text just ins't the same as the one that was shown on the "Relevant text" and other times the pages simply doesn't exist anymore and absolutely nothing matches. (which really makes me wonder how their cache actually works)


Unless you create something to carefully tweak your search queries then merge the results I doubt it will be possible to do what you want. You could try using services like The Wayback Machine to see if they have the content you want but if it's members only or behind a paywall you may be out of luck.

Posted: Thu Jan 05, 2012 8:14 pm
by Keio
Even though you are right with the 4 possibilities I was implicitly omitting the first two.
Instead, the 3rd and 4th are the answers I was looking for. Thank you.

I sincerely didn't think they only relied on the user-agent. Also checking the IP seems more reasonable, as you say.
As for the 4th possibility I don't think I ever saw a website that used the "First click free for web search" method. But it could also be possible explanation.
I'm more inclined to think that most websites will check if the request comes from googlebot or not. And allow it correspondingly.

Your link about uniqueness is also very interesting and something I was not aware of. :)

For the google results, it seems unpractical that there is no way to do it. The way back machine doesn't crawl very often, even if it's worth checking in some cases.

If I find a way to do this I'll post again.

Posted: Fri Jan 06, 2012 11:18 am
by MrBunsy
Keio wrote:I sincerely didn't think they only relied on the user-agent. Also checking the IP seems more reasonable, as you say.


Devs are lazy, I betcha the simplest solution is the one used :P