How does Google indexes/crawls restricted pages? + Bonus

The place to hang out and talk about totally anything general.
Keio
level0
Posts: 8
Joined: Mon Jan 10, 2011 11:29 am

How does Google indexes/crawls restricted pages? + Bonus

Postby Keio » Thu Jan 05, 2012 5:51 am

Have you ever looked for something in google only to click on some result which leads to a restricted page or forum?
How does google indexes that page if normally one would need an account in order to see its content?

Bonus question: you know how the google result page is presented, let's work with the following example and assume i searched for "x" then I get:

Result 1 Title
Relevant text for result 1.

Result 2 Title
Relevant text for result 2.

etc.

Well have you ever noticed that in some cases, by changing your query a little (let's say "~x") you can get a different "Relevant text" for the same pages:

Result 1 Title
Different Relevant text for result 1.

Result 2 Title
Relevant text for result 2.

So. the question is: how can I see all the available "Relevant text" that google has stored for one particular site/page?
Some of you will immediately say to click on the "Cache" link. Well... that works most of the time. But other times the cached text just ins't the same as the one that was shown on the "Relevant text" and other times the pages simply doesn't exist anymore and absolutely nothing matches. (which really makes me wonder how their cache actually works)


If this questions are not clear enough please tell and I'll try to find some examples.
Montyphy
level5
level5
Posts: 6747
Joined: Tue Apr 19, 2005 2:28 pm
Location: Bristol, England

Re: How does Google indexes/crawls restricted pages? + Bonus

Postby Montyphy » Thu Jan 05, 2012 9:41 am

Keio wrote:Have you ever looked for something in google only to click on some result which leads to a restricted page or forum?
How does google indexes that page if normally one would need an account in order to see its content?


A few possibilities:

1. The site's content changed since Google crawled the site.

2. The site's restrictions changed since Google crawled the site.

3. The site grants memberless access to content based on the provided user agent. Typically this will be the user agents used by crawlers of popular web indexers (Google's list can be found here). As for how to exploit this just look into ways of spoofing your user agent. Shouldn't be too difficult as it's been popular for a long time although the slight problem is if the site also performs the IP check.

4. The site has 'signed up' to a program with Google, although this tends to be pretty much point 3 with the addition of checks on referrals, cookies, and possibly uniqueness.

Keio wrote:So. the question is: how can I see all the available "Relevant text" that google has stored for one particular site/page?
Some of you will immediately say to click on the "Cache" link. Well... that works most of the time. But other times the cached text just ins't the same as the one that was shown on the "Relevant text" and other times the pages simply doesn't exist anymore and absolutely nothing matches. (which really makes me wonder how their cache actually works)


Unless you create something to carefully tweak your search queries then merge the results I doubt it will be possible to do what you want. You could try using services like The Wayback Machine to see if they have the content you want but if it's members only or behind a paywall you may be out of luck.
Keio
level0
Posts: 8
Joined: Mon Jan 10, 2011 11:29 am

Postby Keio » Thu Jan 05, 2012 8:14 pm

Even though you are right with the 4 possibilities I was implicitly omitting the first two.
Instead, the 3rd and 4th are the answers I was looking for. Thank you.

I sincerely didn't think they only relied on the user-agent. Also checking the IP seems more reasonable, as you say.
As for the 4th possibility I don't think I ever saw a website that used the "First click free for web search" method. But it could also be possible explanation.
I'm more inclined to think that most websites will check if the request comes from googlebot or not. And allow it correspondingly.

Your link about uniqueness is also very interesting and something I was not aware of. :)

For the google results, it seems unpractical that there is no way to do it. The way back machine doesn't crawl very often, even if it's worth checking in some cases.

If I find a way to do this I'll post again.
User avatar
MrBunsy
level5
level5
Posts: 1081
Joined: Mon Apr 24, 2006 4:40 pm
Location: Southampton
Contact:

Postby MrBunsy » Fri Jan 06, 2012 11:18 am

Keio wrote:I sincerely didn't think they only relied on the user-agent. Also checking the IP seems more reasonable, as you say.


Devs are lazy, I betcha the simplest solution is the one used :P

Return to “Introversion Lounge”

Who is online

Users browsing this forum: No registered users and 10 guests