Using wget to snapshot a web site

Kurt Wall kwall
Wed Dec 8 19:10:22 PST 2004


On Wed, Dec 08, 2004 at 09:35:37PM -0600, Alan Jackson took 38 lines to write:
> On Wed, 07 Dec 2005 15:00:49 -0600
> Michael Hipp <Michael at Hipp.com> wrote:
> 
> > I'm trying to use wget to grab an offline copy of this website so I can 
> > refer to it when doing development without Internet access.
> > 
> >     http://wiki.wxpython.org/index.cgi/FrontPage
> > 
> > But all the links in that page all look like this:
> > 
> >     <a href="/index.cgi/ObstacleCourse">ObstacleCourse</a>
> > 
> > I can't find any combination of options for wget which will cause it to 
> > follow these links. I presume it's because the link is written like an 
> > absolute link when it is actually more of a relative link.
> > 
> > Anyone know how to get wget to grab these or another tool which might do 
> > the job?

How about wget's -E option?

       -E
       --html-extension
           If a file of type application/xhtml+xml or text/html
           is downloaded and the URL does not end with the regexp
           \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix
           .html to be appended to the local filename.  This is
           useful, for instance, when you're mirroring a remote
           site that uses .asp pages, but you want the mirrored
           pages to be viewable on your stock Apache server.
           Another good use for this is when you're downloading
           CGI-generated materials.  A URL like
           http://site.com/article.cgi?25 will be saved as arti?
           cle.cgi?25.html.

Kurt
-- 
I am not now, nor have I ever been, a member of the demigodic party.
	-- Dennis Ritchie


More information about the Linux-users mailing list