Tweet Notes (CSE 494/598 F11): 10/06/2011

Monday, October 10, 2011

10/06/2011

Web crawling can face (develop) serious issues if not handled properly.

1. All legal crawlers are supposed to follow Robot Exclusion Protocol where they should first crawl to the robot.txt at the server end and exclude the paths/folders listed there.

2. If a web page is crawled very frequently, then that website may become slow or down.

3. Creating a personal crawler for a specific purpose, like looking for only basketball game related webpages is a classification problem.

4. If a URL is given, finding back-links (like inverted index), that is finding the web-pages that are pointing to this web-page is challenging.

-Rashmi Dubey