blog advertising is good for you


blog advertising is good for you
User login

Advanced Web Comic Leeching

I’d pretend that I have a more productive use for these ideas, but I don’t. I wanted to nab all of Order of the Stick and xkcd because I hate clicking through pages and then scrolling down for long images only to scroll back up for the next one. I figured that if I had them all, then I could just open up a batch, set Preview’s window once, and then thumb through them offline. To balance the karma from such a leech event I’ll just get some things from each comic’s shop (regexp/science) and the universe will still be happy and sunny and annoying and stuff.

I started with OoTS and went to the first comic. Looking at the URL for the comic image itself, I saw that it was a sequential number, one for each comic. Well, that’s remarkably easy as I then used curl to download them with a move like: curl -O http://that.com/ic/images/comic[000-999].jpg. That tells curl to grab a thousand images in sequence and requires no actual thought.

However, the last hundred or so didn’t come through like that, so I went to the page and pondered why. When I checked the URL of the image to see if it had moved, I realized that someone was on to people like me. It’s in the same place, but it’s been named with a random hash for a filename. Good job, you just made this a challenge.

Bring Out the Holy Hand Grenade!

While the author changed up the filename, he was still doing some things that are predictable: uploading to generally the same place and then putting that link on a page dedicated to that one graphic, linked from a master page. So, all I needed was a tool that could take a URL, get all the links off of it, check all the linked pages for images, filter out the images I cared about, and then download them.

Curiously, Automator does that well. Using only Safari’s innate web-foo, I was able to draft up a set of actions to pull this off. I’ll show you how I made it so you can see a little of how to troubleshoot this kind of procedure.

Get a List of Pages

Each image is on its own page, so we need a list of them. To do that, we resort to any and all indexes the comic has. In this case, it’s just one URL, but larger/older ones may have several index pages you’d add in this step. Take the URL and use the “Get Specified URLs” action.

200705021756

Get a List of Links

Now add in the “Get Link URLs from Webpages” action from Safari and ensure the checkbox is checked. This gives you a list of all the links from all the URLs you gave it.

Now, we could keep going, but we’d like to know what URLs it found here. An easy way to do this is to put the output on the screen somehow, and TextEdit happens to have an action that takes most any input and makes a document out of it. Drop in “New TextEdit Document” as action #3 and then give it a run.

200705021758

Narrow the List

That’s a lot of links. More than that, it’s a lot of links to stuff we don’t care about. Rather than cleaning up the output from this workflow, let’s clean up the input a little. Safari has an action for us here, too. Drag in “Filter URLs” above the TextEdit action and set it to Entire URL begins with and then take the unique part of the URLs you want right up to where they start changing and put that here.

Give it a run and see if it cleaned out the trash URLs.

Get the Image URLs

To download the images, we need their URLs and Safari is there for us again with “Get Image URLs from Webpage” so drag that in before the TextEdit action. Run the script once to see what the URLs look like and then drag in a “Filter URLs” below it to keep only the URLs that match the comic. Tweak this until you get just a list of comic URLs in the TextEdit document.

Unleash the Hounds

When the list of image URLs looks good, remove the TextEdit action and add a “Download URLs” action to the end in its place. Set the save destination as you desire, save the workflow in case of disaster, and then let it rip. Safari will now take that clean list of image links and save them to the folder you designated.

Related Article

If you choose to use this for images other than comics, you may also be interested in another article I wrote here a while back… Smiling

Average rating
(3 votes)
About Adam Knight
Adam Knight's picture

Author Biography

Adam Knight is one of the founders of Mac Geekery and is a geek at heart. Programmer by day, hacker by night, his daily life revolves around the Macintosh platform, which he has been a user and programmer for since the early days of System 7 when his LCII replaced his Apple //c.

In-between tech jobs, he’s managed to learn the basics of any web hacker: PHP, MySQL, Perl, Apache, Linux, *BSD, and the intricacies of ./configure —prefix=~/bombshelter/. Today, codepoet is concentrating on blogging again, writing some software for the Mac by himself (including Notae) and for his company (such as Switchblade) and has a few other toys coming out soon.

Bug him over AIM or email [link fixed].

Yes, comics, those are the kinds of graphics i’m after. <wink wink>

Thanks Adam. Good tip! I’ll try it out tonight.

Adam Knight's picture

I really did create it for the reason I said, but after creating it my mind did drift a little to its other possible uses and that would be where the related article link came in. Eye-wink

I should really play with automator more… I did the same thing essentially but with php and about 2 hours of work Sad

param string $url the url param string $path the local path
param array $elems an array of tag names to check param array $qualifiers limit the urls based on path elements, file prefixes, extensions, etc
@param boolean $preserve_path preserve the same folder structure

load the dom
get elements by tag name for each $elems
check each url against each $qualifiers – if valid push to array
foreach the array curling the files

It took awhile but i got what i wanted… muahahaha. I think i would have rather used automator probably would have teken me 30 minutes or less to set up live and learn.

Adam Knight's picture

It would have been faster, yes, but your solution allows for multiple criteria for selecting the image files to fetch. It’s a bit more powerful. Smiling

This is cool, i didn’t know Automator could do that. I mostly use Ruby+Mechanize for any content grabbing. You can parse the site easily, send forms etc. This for example grabs all xkcd comics:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

base_url = "http://xkcd.com"
save_path = "/Users/rob/Desktop/xkcd/"
site = agent.get(base_url+"/archive")

def filenameize s
     s.downcase.gsub(" ", "_").gsub(/[^a-z0-9_]/, "").squeeze("_")
end

site.links.each do |link|
   if link.href.match(/\/c[0-9]+\.html/)
       site = agent.get(base_url+link.href)
       img = site.search(".s img")[1].to_s.scan(/src="([^"]+)"/).to_s
       file = agent.get(img)
       file.save_as(save_path + filenameize(link.text) + "." + file.response['content-type'].scan(/\/(.*)/).to_s)
       puts link.text + "... saved"
   end
end

Or you could write a grabber that harvests Yahoo Groups for images, for example :>

Mechanize Doc

I have been looking for a way to do something like this for a long time (also a fan of the Order of the Stick). Unfortunately, when I run the workflow after anywhere from 40 to 100 files it just stops for some reason. I have tried the same setup for other sites, but they all have the same problem to varying degrees.

What may be happening is that you are bumping up against the bandwidth limitation for your IP address that some sites impose. I have also seen this happen.

I’ve gone through this, and it grabs all the images very nicely.

My problem is that they don’t seem to be in the right order in the folder… I want to read through Order of the Stick on my way home, but if they aren’t in order, it’d be a pain. Any suggestions?

If you use this for xkcd, you’ve missed the real punchline to every comic, which is the mouseover tooltip.

Post new comment
The content of this field is kept private and will not be shown publicly.
12 + 1 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.