Thursday, October 13, 2011

10/6/2011

Mercator's way of maintaining URL frontier is as follows:
(1)Extracted URLs enter front queue
(2)Each URL goes into a front queue based on its Priority. (priority assigned Based on page importance and Change rate)
(3)URLs are shifted from Front to back queues. Each Back queue corresponds To a single host. Each queue Has time te at which the host  Can be hit again
(2)URLs removed from back Queue when crawler wants A page to crawl.

 
-Shu