Using wget to snapshot a web site

Michael Hipp Michael
Wed Dec 8 19:28:49 PST 2004


Kurt Wall wrote:
> On Wed, Dec 08, 2004 at 09:35:37PM -0600, Alan Jackson took 38 lines to write:
> 
>>On Wed, 07 Dec 2005 15:00:49 -0600
>>Michael Hipp <Michael at Hipp.com> wrote:
>>
>>
>>>I'm trying to use wget to grab an offline copy of this website so I can 
>>>refer to it when doing development without Internet access.
>>>
>>>    http://wiki.wxpython.org/index.cgi/FrontPage
>>>
>>>But all the links in that page all look like this:
>>>
>>>    <a href="/index.cgi/ObstacleCourse">ObstacleCourse</a>
>>>
>>>I can't find any combination of options for wget which will cause it to 
>>>follow these links. I presume it's because the link is written like an 
>>>absolute link when it is actually more of a relative link.
>>>
>>>Anyone know how to get wget to grab these or another tool which might do 
>>>the job?
> 
> 
> How about wget's -E option?
> 
>        -E
>        --html-extension
>            If a file of type application/xhtml+xml or text/html
>            is downloaded and the URL does not end with the regexp
>            \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix
>            .html to be appended to the local filename.  This is
>            useful, for instance, when you're mirroring a remote
>            site that uses .asp pages, but you want the mirrored
>            pages to be viewable on your stock Apache server.
>            Another good use for this is when you're downloading
>            CGI-generated materials.  A URL like
>            http://site.com/article.cgi?25 will be saved as arti?
>            cle.cgi?25.html.

I've been using that, but al it does is change the file extension once it has 
been downloaded. I need something to tell wget to follow a link that looks 
like /index.cgi/WhatEver. For ref, here is what my wget command most recently 
looked like (all one line):

wget -p -E -k -np -m -nH -N -U googlebot -PwxPython 
http://wiki.wxpython.org/index.cgi/FrontPage

It wgets the first page (FrontPage.html) and then happily quits. Phooey.

Michael


More information about the Linux-users mailing list