web scraper

Tue Nov 21 19:43:11 PST 2006

On Fri, Nov 17, 2006 at 08:16:13AM +0100, Roger Oberholtzer wrote:
> 
> I have been toying with the idea of setting up a web scraper. Not for
> anything untowards. Just to track current information and activities
> related to parameters we measure. Perhaps peek a bit at the competition.
> IBM has a short paper on the concept with a few ruby examples. But they
> are very limited. Mainly, it was how to read web documents and find HTML
> tags. That is the easy part. The hard part is finding the docs in the
> first place. I know google gets one very far. It is just that I want to
> automate this for a number of interesting items. Perhaps I really need a
> meta search engine. Early days here.
> 
> Anyone been there, done that? Or know where it is being done?

Google has an API you can use, but I have no idea if it will do what you want vis-a-vis
finding pages of interest.

A poor man's scraper might involve running "w3m --dump" on each URL of interest if all
you want is content or "w3m --source" on each interesting URL if you want to entire
page.  Perl has modules for scraping HTML screens, too.

Conceptually, you need (at least) two components, one to find "interesting URLs" and
one to suck up those URLs.

Kurt
-- 
How come financial advisors never seem to be as wealthy as they
claim they'll make you?