Showing posts with label ruby. Show all posts
Showing posts with label ruby. Show all posts

Saturday, January 7, 2012

Improving Nokogiri XPath operations through predictive hashing

A significant portion of my day job involves me writing large amounts of code to massage, manipulate and transform various different data sources from third parties into things that ultimately result in my "bread and butter", so to speak. But, its the weekend. What on earth am I doing writing about this now? Well, honestly, its because I lose sleep over things like ugly, unreliable or poorly performing code. In some situations, my code has one or more of those three attributes for reasons that aren't really under my control -- vendors. Other times, honestly, its because I am a perfectionist and my code is never good enough.

One bit of code that you could call "old faithful" is code that I use to parse XML files from a particular vendor. The files group data by year, cover 2002 until now, 2012, and are from 2 to 22M in size. My weapon of choice in this battle is Ruby 1.9.3 and Nokogiri. The majority of the operations that are performed against this data is searching for children of the root node that have a particular attribute. So, assuming we had a sample document like this:

<store>
  <food id="ID-2001-0001">
    <nutrition>
      <calories>100</calories>
    </nutrition>
  </food>
  <food id="ID-2001-1234"/>
</store>

The bulk of the searches would be things like looking for food with ID ID-2011-0123, or, in XPath, '//store/food[@id="ID-2011-0123"]'. In reality, the documents are considerably larger -- ~6000 entries in each document, each entry from a few hundred bytes to several kilobytes. And a batch run of results in an average of 500 lookups across basically all years.

Memory and CPU are not really a concern, but time is. While the documents are numerous, large and complex, it was never acceptable that the batch searches would take minutes or more. For a while, I was using the same "seat of the pants" tuning that shadetree mechanics use -- it sure feels faster! But, is it really faster?

From its initial inception, it was clear that opening the correct document, using XPath to locate the entry and then repeating this process was not going to work, because the code would quickly get OOM killed. So, I investigate some different strategies for optimizing these lookups:

  1. Caching the resulting Nokogiri documents, using a hash for each year after they were first opened. Subsequent lookups for that year would not have to incur reprocessing. XPath was still used within individual documents to locate entries.
  2. As a new year was encountered, opening the corresponding document and caching all of the entries for that year using the entry ID as a key in a hash, and the entry element itself as the value.
  3. Similar to the second approach, however *ALL* possible entries were cached. First, all possible entries were hashed using their predicted ID (ID-<year>-<4 digit sequence number>) as the key and nil as their value. Then, all valid IDs were parsed from the document we obtained and hashed using the approach in #2. The result is that all possible XML entries had a place in the hash.
  4. Similar to the first approach, but instead of using XPath, walking the document and selecting nodes based on simple comparison operations. This was a technique I found on Jud's blog

Now, to answer the burning question of which is faster, I utilized Ruby's Benchmark. Because the searches are often for IDs that don't exist, I sampled the results 10 times of searching for 100 random entries to searching for 100 only valid entries for each approach. The results were impressive, hopefully correct, and a little curious:

Approach User time System time Total time Real time
Random, Caching by year6.3800000.0400006.4200006.457234
Good, Caching by year42.3100000.05000042.36000042.502168
Random, Caching only valid entries74.1600000.290000 74.45000074.750306
Good, Caching only valid entries5.1900000.050000 5.2400005.272732
Random, Caching all entries9.0700000.0800009.1500009.194298
Good, Caching all entries5.8000000.0300005.8300005.865078
Random, using a walker14.9900000.09000015.08000015.143777
Good, using a walker55.7500000.06000055.81000055.998989

Saturday, November 28, 2009

Racket 1.0.6 Released

Over the Thanksgiving holiday and thanks to the fact that I've been trapped indoors for two weeks, I've made some major improvements to Racket, released in version 1.0.6. For those not in the know, Racket is a Ruby Gem used for reading, writing and handling raw packets in an intuitive manner. Between 1.0.2 and 1.0.6, there have been countless changes, including but not limited to:
  • Full ICMPv6 support
  • Much improved IPv6 support, thanks largely to Daniele Bellucci
  • Revamped, more efficient ICMP support (basically copied all the cool things from ICMPv6)
  • All encoding/decoding classes moved under their respective layer in Racket::L3, etc.
  • Large documentation, test and example improvements
So, as usual, `gem install --source http://spoofed.org/files/racket racket` to install it and then take a stroll through the docs and examples. Enjoy!

Sunday, March 2, 2008

Racket -- Ruby Raw Packet Fun

This is one of those projects that I've been sitting on for a good 6+ months. Only over the last 2-3 have things really started to come together. I am happy to release Racket, a Ruby gem designed for crafting and analyzing raw packets.

Towards the end of the initial development of Racket, I caught wind of Scruby because that is what Metasploit 3 is using for much (most?) of its raw packet duties. In the TMTOWTDI spirit, I kept up development and actually think that Racket's purpose is a bit different than that of Scruby.

Installation is fairly simple:

gem install --source http://spoofed.org/files/racket racket

Documentation and examples are published but need some touching up. Among some of the more amusing/useful examples are:

  • cdp-spew: exactly what it sounds like. Creates and floods the network with random Cisco Discovery Protocol (CDP) packets
  • hsrp_takeover: passively listens for and actively performs "takeovers" for all discovered Hot Standby Router Protocol (HSRP) instances
  • tcp2udp: Listens for any tcp traffic and turns the packet back around, sending it back at the source as a UDP datagram. No point

Racket requires that you have Joel VanderWerf's BitStruct and Marshall Beddoe's PcapRub installed.

Enjoy! Comments or suggestions are welcomed.

Sunday, January 20, 2008

Hawler, the Ruby crawler

Over the years I've thrown together various bits of code that have crawling functionality built into them. There was termite, used to find backup copies, renames or common temporary locations of your entire web site. There was indexfinder, used to crawl your site and find anything that looked like a directory listing. There was also htcomment, used to ferret out all of the comments found in your html.

These tools were all fairly well tested and worked quite well, but everytime I dusted off the code and fixed a bug or added functionality, my CS books would scowl at me. The core crawling code was literally cut and pasted between tools. The problem with this is obvious -- a bug or missing bit of functionality in the core crawler code had to be fixed in numerous places. Design at its worst.

Starting maybe a month ago I decided to fix this problem. The result is Hawler, a Ruby gem that encapsulates all of what I deem to be core web crawling functionality into an easy to use package. The result is that I can now focus more on writing the code that is unique to each particular task and not have to worry as much about the crawler bits. Its usage is quite simple, as described in the README.

As an example of Hawler's usage, I've put together two tools that I've found quite useful so far. First is htgrep. It is exactly what it sounds like: grep for the web. How many times does the word shot occur within 1 hop of www.latimes.com? Lets find out, but sleep 1 second between each request (got to play nice) and utilize HEAD (-p) to only harvest links from pages that are likely to have them in the first place:

$  htgrep shot www.latimes.com -r 1 -s 1 -p  |wc -l  
43

Only 43? A peaceful day in LA! What about the distribution of HTTP error codes on spoofed.org? Use htcodemap:

$  htcodemap spoofed.org -r
Done -- codemap is spoofed.org-codemap.png

The result? Not too shabby:

What about drawing rediculous maps of relationships within a website? Well, assuming you have enough RAM (blame graphviz/dot, not me!), enjoy htmap. An example, here is a fairly deep crawl and mapping of spoofed.org:

$ htmap spoofed.org -r 2 Done, map is spoofed.org-map.png, spoofed.org-map.dot

The image is here.

I expect that a lot of cool tools will be born from Hawler, and I'll be sure to link and post them as they turn up. Until then, enjoy!

Comments and suggestions are very much welcome.