web scraper
Roger Oberholtzer
roger
Tue Nov 21 23:46:34 PST 2006
On Tue, 2006-11-21 at 23:50 -0500, Ric Moore wrote:
> On Tue, 2006-11-21 at 22:43 -0500, Kurt Wall wrote:
> > On Fri, Nov 17, 2006 at 08:16:13AM +0100, Roger Oberholtzer wrote:
> > >
> > > I have been toying with the idea of setting up a web scraper. Not for
> > > anything untowards. Just to track current information and activities
> > > related to parameters we measure. Perhaps peek a bit at the competition.
> > > IBM has a short paper on the concept with a few ruby examples. But they
> > > are very limited. Mainly, it was how to read web documents and find HTML
> > > tags. That is the easy part. The hard part is finding the docs in the
> > > first place. I know google gets one very far. It is just that I want to
> > > automate this for a number of interesting items. Perhaps I really need a
> > > meta search engine. Early days here.
> > >
> > > Anyone been there, done that? Or know where it is being done?
> >
> > Google has an API you can use, but I have no idea if it will do what you want vis-a-vis
> > finding pages of interest.
> >
> > A poor man's scraper might involve running "w3m --dump" on each URL of interest if all
> > you want is content or "w3m --source" on each interesting URL if you want to entire
> > page. Perl has modules for scraping HTML screens, too.
> >
> > Conceptually, you need (at least) two components, one to find "interesting URLs" and
> > one to suck up those URLs.
> It's been done and is ready for transport, Captain! Just click on your
> FireFox /tools / addons and look for "Stumble"... it'll add a bar to
> firefox and when you click on stumble it'll take you to places you want
I have been a StumbleUpon user. Currently it is not working on my
system. But the topics I am interested in are rather obscure. Things
like research related to the World Bank's International Roughness Index
(one of many dozen of this sort of thing). I would like to somehow
follow new equipment available to measure this (read 'competitors'), or
requests for offers that require this as a provided item (read
'opportunities'). This sort of thing. I can and do google. But that
takes time. Especially separating new from old. As an aside, I must
check if google has an option to sort hits by date.
Hence an automated method.
I am guessing it will eventually involve using search engines like
google to search for topics, and then go through these hits looking for
whatever is wanted.
--
Roger Oberholtzer
OPQ Systems AB
Ramb?ll Sverige AB
Kapellgr?nd 7
P.O. Box 4205
SE-102 65 Stockholm, Sweden
Tel: Int +46 8-615 60 20
Fax: Int +46 8-31 42 23
More information about the Linux-users
mailing list