Hypnobooru

Stem_Cell

Tue, Nov 27 '12, 18:01

How to download a website easily and thoroughly

I think this is worth a post, because in the current day and age, you can't trust your beloved sites to be there by tomorrow. So it's good if you know how to make a backup, if someone had done one of hypnochan we'd all be a bit less sad about it.

So, there are several ways to make a comprehensive website backup. If you are on Windows, as most people are, you *can* give httrack a shot:
http://www.httrack.com/
...but honestly, I had issues with it. I consider myself a very tech-savvy person, and couldn't get it to work just right, which is a shame because it offers a GUI and it's relatively simple.

If you have a Linux, BSD or similar machine nearby (or a virtual machine even), things are so much more simple and so much better.
Just open a terminal/console, and run this:

wget -mk -w 1 -e robots=off -np http://yay.sleepimay.com/

With your URL of choice. The URL given is not a real one, but if it was, it took some 10 minutes to mirror that one.

The result of running this is having the site completely contained in a folder you can compress and share, re-host, whatever. Of course it doesn't work if the site is made of dynamic content (the content wouldn't be dynamic anymore), but it's fair as a mirror.

If you are running Windoze and still want that goodness, there are wget binaries compiled for Win32, which you can use (the command will be the same), I'm pretty sure.
http://users.ugent.be/~bpuype/wget/

Also, one upside of wget, and the reason why I use it to mirror even simple websites (which I could use httrack for, as it works well on simple sites), is that you can get cookies from your browser and make it look EXACTLY like a browser, so you can have 100% certainty of what-you-browse-is-what-you-get.

For example, this is it pretending to be Firefox:
wget -mk -w 1 --header='User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0' --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header='Accept-Language: en-US,en;q=0.5' --header='Accept-Encoding: gzip, deflate' --header='Proxy-Connection: keep-alive' --header='Cookie: key=value' -e robots=off -np http://example.com/dir/

(that is just one command line, the --header parameters tell it to send headers Firefox uses, but probably just the user agent one is enough, so try that first if it looks too big for you)

Note that you'd have to change the "Cookie:" part to fit your particular case, by copying cookies from your browser. It's common to have some cookie that tells the site you're over 18, for example, and without that you might just get a warning page and nothing else. Also, both of these wget examples have a switch that tells them not to crawl below a directory. That means that if it's a chan and you want to just mirror /base/, that's what it'll do.

For the list of options, check the wget manual (type "man wget" in a console). There are other more complex tools (such as curl and pavuk), but wget is enough for most if not all of our use cases.

One more word: keep in mind this strains the server. Sharing is caring. If you share a mirror (use something like 7-zip to archive it), that means someone else doesn't have to do it, hence keeping the server and it's owner happy.
(Of course there are cases like this one, when you can't share that fictional link's result here, but it's luckily just 85 fictional megabytes).

If something wasn't well explained just ask.

Vanndril

Wed, Nov 28 '12, 06:05

That's....interesting. o.O
Good to know.

Stem_Cell

Wed, Nov 28 '12, 13:02

By the way, one site that's one complete bitch to crawl is the hypnopics-collective gallery. God, why they had to set up such an horrendous gallery software is beyond me (a booru is so much better to browse!)

However, I once managed to crawl all the picture URLs, which means I had a list of links to ALL the pics (at the time) and could automate a download for them, but it was so effing big, gosh, just the HTML was what, hundereds MBs? I gave up.

Much cleaner! I mean, you open an image there, and it uses a goddamn popup, I hate that.

Anno1404 said:
That's why they change the gallery software to a new one in January ;)

Really? I'm looking forward then!

Anno1404 said:
And I know one idiot who tried to download google.com! He had fun....once (XDXDXD)

Lol. Nice thing about that command I mentioned is that it won't cross domains or even go up in the hierarchy, so it's pretty safe to run :)