Sometimes there is a great article, some useful information or anything else on a web site you want to preserve. When downloading the page in the browser by using something like
Save as ... this doesn’t work properly all the time. But don’t worry, there is a solution for this.
How archiving works
Simply spoken it’s necessary to download the page only.
But what exactly does this mean? This means archiving the following components:
- and so on …
Today numerous sites consist of content delivered by various domains. That could be a content delivery network like akamai, a service provider for user tracking, or social media integration for e.g. Google or Facebook. Exactly this mashup makes it extremely difficult to correctly download a website.
For solving this challenge there is a tool called
wget that runs on a
unix commandline. Of course there are other tools that would solve the same thing, but no other tool is as basic and widely available. Furthermore
wget can be used for many, many other things but downloading a website. In case you are interested in all the power that is provided by
wget simply have a look into the manual.
The following section describes some basic operations supported by
wget. In order to choose which operation is reasonable please have a look at it before running a website download powered by
The section below describes some examples that might be helpful archiving a website. Furthermore it’s possible to run some kind of a batch download that archives full trees starting with one website.
In the table below there are some really useful commands for using
||Convert links to the local files|
||Download all files necessary for proper site display|
||Load files from other domains, too.|
||Recursive, per default –level=5|
||Maximum recursion depth|
For offline useage it necessary to use at least
wget all the data that was downloaded will be stored into one or numerous folders. The following section describes downloading the
wget manual that is located at the following url:
For running the download simply run
wget -kp https://www.gnu.org/software/wget/manual/wget.html.
After that there will be a folder named
www.gnu.org. You can find the manual in the subfolder
www.gnu.org/software/wget/manual. There will be file called
The requested website ist stored within that file. Furthermore all the necessary assets that are required for representing the website are stored within that folder structure. Such assets could be a
.js file or simply images that are used within the page. In case you want to open that page later ( of course you want this 😉 ) simply open the
www.gnu.org/software/wget/manual/wget.html file with your browser and get a working copy of the desired content. In this case it’s a html representation of the
This describes a number of scenarios that might be possible.
Archiving a website
In case you want to download one website only
wget -k -p http://www.example.com/page.html
This only downloads the page itself and all the content that’s necessary for a proper offline display of site. In case a website requires data located on other domains this might stop the archived website working.
Archiving a multi domain website
E.g. there is some content stored on other domains too, this option is necessary. For that situation simply add the
wget -H -k -p http://www.example.com/page.html
The disadvantage of using this option might be amount of data that will be downloaded. This results in an extended download time as well as an increased amount of data required by the download.
Downloading numerous websites
Batch downloading a website and all linked pages.
This downloads the website like in the example above, but it also does the same for all the linked content in the first order. For example there is a link to
http://www.example.com/second.html, this page will be archived too. When aiming to archive a complete manual that consists of numerous pages this might be the preferred solution.
wget -r --level=1 -H -k -p http://www.example.com/page.html
It’s quite simple to archive a website. A mashup used for composing a website distributes the content of a website over numerous domains. But there is a solution for this too.
wget is a powerful tool that empowers to fully download single website and a tree of websites for conveniently downloading a lot of connected web pages that are connected by one site. Furthermore there is no need to have a special tool for reading the website archives.