Tweet Notes (CSE 494/598 F11): 11/22/2011

Wednesday, November 23, 2011

11/22/2011

The web is not random at all. Structure exists, it just isn't well defined.

Figuring out this structure and extracting information from it is
highly useful for relating things such as documents that might
reference each other, but do not actually contain a hyperlink.

Again, we are not trying to fully understand what the documents mean,
we just want to obtain a limited amount of information.

A extraction mode could do things such as segmentation,
classification, clustering, and association.

Information extraction falls somewhere between IR and NLP. In a sense
IE is a limited subject specific version of NLP.

-Thomas Hayden