Recently I am working on a web page analyzer, which divide the whole
page into blocks according to all kinds of properties of elements, such
as position, size, background color, etc.
My partners will use this page analyzer to make our experimental search
engine crawl and index the web on a semantic block basis rather than on
a page basis.
My problem is, the HTML renderer I am using, that is MSHTML, cannot
handle scripts such as VBScript, JavaScript, etc. In fact, those
scripts run as a response to user's certain action and operate on the
HTML tree. But my page analyzer can only analyze static HTML tree with
the help of MSHTML.
In a word, the page analyzer doesn't understand scripts.
I don't know whether google's crawler understand the script elements in
a page, but I guess not. Because Google's Adsense cannot handle Ajax
pages.
The truth is that, as more and more pages in Internet become
interaction-rich, the communication between server and browser has
changed.
In the days of static page, the server sends a source file to the
client, and the crawler can pretend to be a normal customer.
but when web 2.0 comes, the server sends reples that cannot be
understood if your browser is not waiting for this specified kind of
reply. The crawler cannot pretend any more.
New kinds of crawler comes, or search engines disappear.
Am I right?