Archiving websites with wget

Sometimes there is a great article, some useful information or anything else on a web site you want to preserve. When downloading the page in the browser by using something like Save as ... this doesn’t work properly all the time. But don’t worry, there is a solution for this.

How archiving works

Simply spoken it’s necessary to download the page only.
But what exactly does this mean? This means archiving the following components:

  • website
  • images
  • javascript
  • media
  • and so on …

Today numerous sites consist of content delivered by various domains. That could be a content delivery network like akamai, a service provider for user tracking, or social media integration for e.g. Google or Facebook. Exactly this mashup makes it extremely difficult to correctly download a website.

For solving this challenge there is a tool called wget that runs on a unix commandline. Of course there are other tools that would solve the same thing, but no other tool is as basic and widely available. Furthermore wget can be used for many, many other things but downloading a website. In case you are interested in all the power that is provided by wget simply have a look into the manual.

Introduction to wget

The following section describes some basic operations supported by wget. In order to choose which operation is reasonable please have a look at it before running a website download powered by wget.

The section below describes some examples that might be helpful archiving a website. Furthermore it’s possible to run some kind of a batch download that archives full trees starting with one website.

In the table below there are some really useful commands for using wget:

Command Description
-k, --convert-links Convert links to the local files
-p, --page-requisites Download all files necessary for proper site display
-H, --span-hosts Load files from other domains, too.
-r, --recursive Recursive, per default –level=5
-l, --level=depth Maximum recursion depth

For offline useage it necessary to use at least -k and -p. The first command converts all links to local links. In case a file is now available within the download folder, all links to that file are linked to the downloaded one. The second command fetches all the content that is necessary by the website. For example javascript, images or media is downloaded.

Running wget

After running wget all the data that was downloaded will be stored into one or numerous folders. The following section describes downloading the wget manual that is located at the following url: https://www.gnu.org/software/wget/manual/wget.html.

For running the download simply run

wget -kp https://www.gnu.org/software/wget/manual/wget.html.

After that there will be a folder named www.gnu.org. You can find the manual in the subfolder www.gnu.org/software/wget/manual. There will be file called wget.html.

The requested website ist stored within that file. Furthermore all the necessary assets that are required for representing the website are stored within that folder structure. Such assets could be a .css file containing stylesheets, javascript within a .js file or simply images that are used within the page. In case you want to open that page later ( of course you want this 😉 ) simply open the www.gnu.org/software/wget/manual/wget.html file with your browser and get a working copy of the desired content. In this case it’s a html representation of the wget manual.

Examples

This describes a number of scenarios that might be possible.

Archiving a website

In case you want to download one website only

wget -k -p http://www.example.com/page.html

This only downloads the page itself and all the content that’s necessary for a proper offline display of site. In case a website requires data located on other domains this might stop the archived website working.

Archiving a multi domain website

E.g. there is some content stored on other domains too, this option is necessary. For that situation simply add the -H parameter in order to download content that is located in other domains too. Modern website often consist of multi domain content. So this is the safest solution for archiving a website. Very often content like fonts, javascript or video is provided by completely different domains. Videos maybe are provided by youtube, fonts by google apis or javascript by the framework website itself.

wget -H -k -p http://www.example.com/page.html

The disadvantage of using this option might be amount of data that will be downloaded. This results in an extended download time as well as an increased amount of data required by the download.

Downloading numerous websites

Batch downloading a website and all linked pages.

This downloads the website like in the example above, but it also does the same for all the linked content in the first order. For example there is a link to http://www.example.com/second.html, this page will be archived too. When aiming to archive a complete manual that consists of numerous pages this might be the preferred solution.

wget -r --level=1 -H -k -p http://www.example.com/page.html

Conclusion

It’s quite simple to archive a website. A mashup used for composing a website distributes the content of a website over numerous domains. But there is a solution for this too. wget is a powerful tool that empowers to fully download single website and a tree of websites for conveniently downloading a lot of connected web pages that are connected by one site. Furthermore there is no need to have a special tool for reading the website archives.

Sources

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.