Tuesday, January 29, 2008

Things that keep me up at night..

I've tried to cut back on the amount of personal stuff that I publish here, and limit it just to professional topics. So, no more discussions about politics, what I ate for breakfast or drunken debauchery.

This, however, rides the fine line between personal and professional. The following three things make me lose sleep

  1. Everyone's favorite layer 2 protocol, ethernet, has the destination MAC address before the source MAC address. I've heard and speculated that this is for speed. Instead of having to jump ahead another 6 bytes, small devices that have limited processing power can read in just 6 bytes and be able to make forwarding decisions accordingly. Why, though, does this practice only exist at this layer? Sure, the "source then destination" thing is burned into our brain, but what performance gains could be obtained had, say, destination IP addresses been placed earlier in IPv4?
  2. Why are UDP and TCP checksums so crazily different than, say, IP or ICMP checksums? A checksum exists so that protocols that decide to implement it can be sanity checked -- essentially, has this layer and the stuff it is responsible for been damaged in transit? IPv4, not caring about the payload, only checksums its header values. It is up to higher level protocols to check themselves before they wreck themselves. ICMP takes a slightly more polite approach and checksums its header values and payload. UDP and TCP, however, take this grossly different approach -- in addition to computing a checksum that uses their respective header values and payloads, these two decide to also include the source and destination IP addresses in the checksum. WTF? A protocol should be able to compute its own checksum by utilizing whatever information is available to it. This includes its header values and whatever is in its payload, if anything. Data lower in the stack is simply not available in any sort of simplistic fashion, yet TCP and UDP decide to be special.

Sunday, January 20, 2008

Hawler, the Ruby crawler

Over the years I've thrown together various bits of code that have crawling functionality built into them. There was termite, used to find backup copies, renames or common temporary locations of your entire web site. There was indexfinder, used to crawl your site and find anything that looked like a directory listing. There was also htcomment, used to ferret out all of the comments found in your html.

These tools were all fairly well tested and worked quite well, but everytime I dusted off the code and fixed a bug or added functionality, my CS books would scowl at me. The core crawling code was literally cut and pasted between tools. The problem with this is obvious -- a bug or missing bit of functionality in the core crawler code had to be fixed in numerous places. Design at its worst.

Starting maybe a month ago I decided to fix this problem. The result is Hawler, a Ruby gem that encapsulates all of what I deem to be core web crawling functionality into an easy to use package. The result is that I can now focus more on writing the code that is unique to each particular task and not have to worry as much about the crawler bits. Its usage is quite simple, as described in the README.

As an example of Hawler's usage, I've put together two tools that I've found quite useful so far. First is htgrep. It is exactly what it sounds like: grep for the web. How many times does the word shot occur within 1 hop of www.latimes.com? Lets find out, but sleep 1 second between each request (got to play nice) and utilize HEAD (-p) to only harvest links from pages that are likely to have them in the first place:

$  htgrep shot www.latimes.com -r 1 -s 1 -p  |wc -l  
43

Only 43? A peaceful day in LA! What about the distribution of HTTP error codes on spoofed.org? Use htcodemap:

$  htcodemap spoofed.org -r
Done -- codemap is spoofed.org-codemap.png

The result? Not too shabby:

What about drawing rediculous maps of relationships within a website? Well, assuming you have enough RAM (blame graphviz/dot, not me!), enjoy htmap. An example, here is a fairly deep crawl and mapping of spoofed.org:

$ htmap spoofed.org -r 2 Done, map is spoofed.org-map.png, spoofed.org-map.dot

The image is here.

I expect that a lot of cool tools will be born from Hawler, and I'll be sure to link and post them as they turn up. Until then, enjoy!

Comments and suggestions are very much welcome.