Using wget to snapshot a web site
Michael Hipp
Michael
Wed Dec 8 19:28:49 PST 2004
Kurt Wall wrote:
> On Wed, Dec 08, 2004 at 09:35:37PM -0600, Alan Jackson took 38 lines to write:
>
>>On Wed, 07 Dec 2005 15:00:49 -0600
>>Michael Hipp <Michael at Hipp.com> wrote:
>>
>>
>>>I'm trying to use wget to grab an offline copy of this website so I can
>>>refer to it when doing development without Internet access.
>>>
>>> http://wiki.wxpython.org/index.cgi/FrontPage
>>>
>>>But all the links in that page all look like this:
>>>
>>> <a href="/index.cgi/ObstacleCourse">ObstacleCourse</a>
>>>
>>>I can't find any combination of options for wget which will cause it to
>>>follow these links. I presume it's because the link is written like an
>>>absolute link when it is actually more of a relative link.
>>>
>>>Anyone know how to get wget to grab these or another tool which might do
>>>the job?
>
>
> How about wget's -E option?
>
> -E
> --html-extension
> If a file of type application/xhtml+xml or text/html
> is downloaded and the URL does not end with the regexp
> \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix
> .html to be appended to the local filename. This is
> useful, for instance, when you're mirroring a remote
> site that uses .asp pages, but you want the mirrored
> pages to be viewable on your stock Apache server.
> Another good use for this is when you're downloading
> CGI-generated materials. A URL like
> http://site.com/article.cgi?25 will be saved as arti?
> cle.cgi?25.html.
I've been using that, but al it does is change the file extension once it has
been downloaded. I need something to tell wget to follow a link that looks
like /index.cgi/WhatEver. For ref, here is what my wget command most recently
looked like (all one line):
wget -p -E -k -np -m -nH -N -U googlebot -PwxPython
http://wiki.wxpython.org/index.cgi/FrontPage
It wgets the first page (FrontPage.html) and then happily quits. Phooey.
Michael
More information about the Linux-users
mailing list