Skip to main content

Downloading a website as HTML files

Posted in Development and Tools

This is another one of those notes-to-self blog posts, for something I do every now and then, but not enough that I know how to do it off the top of my head.

I’ve been moving the handful of the excellent clients I still do freelance work for from Perch, my previous go-to tool to build a site with, to new platforms. Most of them have moved to a statically built website where I act as Webmaster:

  • writing only the code that needs to be written
  • not putting in extra work to design the content management system (CMS) experience
  • ensuring content is marked up correctly
  • questioning updates where necessary
  • keeping them accountable for content updates

The place I usually start when rebuilding a site statically is to download the entire outgoing website as the rendered HTML pages. I then host the site to a new platform (almost always Netlify) and systematically refactor the site in a new framework (normally Eleventy). This allows me to work steadily over the course of a few weeks, which is important now that I’m not freelancing full time, so I can spread the work over evenings and weekends.

But how do you go about downloading a whole website? Turns out it’s quite easy using Wget.

Getting set up

First up, installing Wget. There are lots ways to install it, but I did it with Homebrew with:

brew install wget

I then opened Terminal and navigated to the directory I wanted to download the site into.

It’s worth noting that the site download will be bundled in a folder, so you don’t need to be too careful for fear that the root directory files and folders will be dumped in your location, mixing with any existing files and folders in there. This makes it easy to throw out if you need to tweak the configuration options.

Running Wget

Then it’s just a case of running Wget with those options I mentioned:

wget --recursive --domains=www.example.com --page-requisites --adjust-extension www.example.com

Here’s what’s going on with that command:

  • Download every page of the website (--recursive)
  • Don’t follow any links outside of the website (--domains www.example.com)
  • Download all of the assets, like images, CSS, JavaScript, etc. (--page-requisites)
  • Add the .html extension to all HTML files (--adjust-extension), even if the website files don’t have an extension or use something else like .php
  • Finish with the URL to download (www.example.com)

It’s worth mentioning that there are shorthand versions for all of these; here’s how it would look:

wget -r -D example.com -p -E www.example.com

I don’t use Wget enough to commit those to memory, so I prefer the more descriptive method so that I know what’s going on without referencing the documentation.

Run locally without a server

If you prefer not to upload the website to a server straight away, and view it on your machine, you probably don’t want to go to the extra hassle of running a server locally. There’s an option (--convert-links or -k) to rewrite all internal URLs so that they’re relative, rather than absolute or root relative, which allows you to simply open a website file in your browser and navigate around:

wget --recursive --domains=www.example.com --page-requisites --convert-links --adjust-extension www.example.com

Downloading a specific directory

To download a specific area of the website, just the blog, for example, you can add it to the URL and add --no-parent (-np) just before the value:

wget --recursive --domains=example.com --page-requisites --convert-links --adjust-extension --no-parent www.example.com/blog

Downloading from multiple locations

If any assets are served from a different server/domain (maybe your images are on a CDN), you can add it to your --domains list like this:

wget --recursive --domains=www.example.com,exampleimages.cdn.com --page-requisites --convert-links --adjust-extension www.example.com

Windows URL compatibility

To make sure the URLs work on Windows, add --restrict-file-names=windows.

Loads more options

There are a load of options I haven’t mentioned, but I find the first command on this page usually gets me what I want.

Accessibility in your inbox

I send an accessibility-centric newsletter on the last day of every month, containing:

  • A roundup of the articles I’ve posted
  • A hot pick from my archives
  • Some interesting posts from around the web

I don’t collect any data on when, where or if people open the emails I send them. Your email will only be used to send you newsletters and will never be passed on. You can unsubscribe at any time.

More posts

Here are a couple more posts for you to enjoy. If that’s not enough, have a look at the full list.

  1. Using iframes to embed arbitrary content is probably a bad idea

    The iframe element is a way to embed one website inside of another. Useful for things like maps or videos, but not so much for other content.

  2. Avatars and alt text

    I really enjoyed Nicolas Steenhout’s recent article on Alt text for avatars or user photos. But there is a context where I would break his rule…