Should be reading more and writing less, but well...

Thursday, February 17, 2005


Anchor Text and Focused Crawling

Its been a while since I have blogged anything technical.

These days, I am working on the open source search engine, Nutch. Before I get into what I am doing, let me explain why, in the last sentence, I put the phrase "open source search engine" as a part of the href tag. Search engines use anchor text extensively to figure out what a page is about. For example, the home page of Tejaswi doesn't have the phrase "home page" anywhere. So, by looking at the anchor text of all the in-links to a page, the search engine figures out what the content of the page might be about. This is a latent way of identifying the content of a page: by looking at what in-links call it. Now, when I say "the open source search engine Nutch" in the anchor text and link to nutch.org, that phrase gets associated with the site, and helps someone searching for an open source search engine, but has no clue about Nutch itself.

Currently, I am working on the crawler part of the search engine. The crawler/spider is an offline process that goes all over the web and gets pages for the search engine to index. The idea is to start the crawler with a set of seed pages. The crawler then starts indexing the textual content of each page, and recursively crawls each page's out-links. This goes on ad-infinitum. This part is pretty standard, and is already implemented. My job is to ensure that the crawl is not ad-hoc, ie. not all out-links are crawled. I am trying to "focus" the crawl so that only pages pertinent to certain topics get crawled, and subsequently indexed. Topics like "cycling", "art cinema", "photography", "BDSM" etc. Why do we need to focus a crawl?

Google currently claims that it indexes 8 billion webpages. According to recent estimates, un-indexed pages outnumber indexed pages by a factor of 4-5. This means that there are at at least 33 billion pages out there that Google can index, but is not indexing. Why not? well, for one, more pages doesn't necessarily mean better search results. Good number of pages representing a broad range of topics means better search results. This is where a focused crawl might be preferred over an ad-hoc crawl. If you are really interested, take a look at my advisor's Focused Crawling page for more information.

In other news, read Jeremy Zawodny's post on Mark Jen to know about the Google employee who got fired for blogging some company internals. All corporate bloggers out there....you reading this?


Comments: Post a Comment

<< Home


February 2004   July 2004   August 2004   September 2004   October 2004   November 2004   December 2004   January 2005   February 2005   March 2005   April 2005   May 2005   June 2005   July 2005   August 2005   September 2005   October 2005   November 2005   December 2005   January 2006   February 2006   March 2006   April 2006   May 2006   June 2006   July 2006   August 2006   September 2006   October 2006   November 2006   January 2007   April 2007   May 2007   June 2007   November 2007   December 2007   January 2008   March 2008   June 2008   February 2009   June 2009   February 2010   November 2010  

Quick index to blog-posts I like (from my personal website)

This page is powered by Blogger. Isn't yours? Statscounter is generating statistics of this page