Saturday, January 7, 2012

Improving Nokogiri XPath operations through predictive hashing

A significant portion of my day job involves me writing large amounts of code to massage, manipulate and transform various different data sources from third parties into things that ultimately result in my "bread and butter", so to speak. But, its the weekend. What on earth am I doing writing about this now? Well, honestly, its because I lose sleep over things like ugly, unreliable or poorly performing code. In some situations, my code has one or more of those three attributes for reasons that aren't really under my control -- vendors. Other times, honestly, its because I am a perfectionist and my code is never good enough.

One bit of code that you could call "old faithful" is code that I use to parse XML files from a particular vendor. The files group data by year, cover 2002 until now, 2012, and are from 2 to 22M in size. My weapon of choice in this battle is Ruby 1.9.3 and Nokogiri. The majority of the operations that are performed against this data is searching for children of the root node that have a particular attribute. So, assuming we had a sample document like this:

<store>
  <food id="ID-2001-0001">
    <nutrition>
      <calories>100</calories>
    </nutrition>
  </food>
  <food id="ID-2001-1234"/>
</store>

The bulk of the searches would be things like looking for food with ID ID-2011-0123, or, in XPath, '//store/food[@id="ID-2011-0123"]'. In reality, the documents are considerably larger -- ~6000 entries in each document, each entry from a few hundred bytes to several kilobytes. And a batch run of results in an average of 500 lookups across basically all years.

Memory and CPU are not really a concern, but time is. While the documents are numerous, large and complex, it was never acceptable that the batch searches would take minutes or more. For a while, I was using the same "seat of the pants" tuning that shadetree mechanics use -- it sure feels faster! But, is it really faster?

From its initial inception, it was clear that opening the correct document, using XPath to locate the entry and then repeating this process was not going to work, because the code would quickly get OOM killed. So, I investigate some different strategies for optimizing these lookups:

  1. Caching the resulting Nokogiri documents, using a hash for each year after they were first opened. Subsequent lookups for that year would not have to incur reprocessing. XPath was still used within individual documents to locate entries.
  2. As a new year was encountered, opening the corresponding document and caching all of the entries for that year using the entry ID as a key in a hash, and the entry element itself as the value.
  3. Similar to the second approach, however *ALL* possible entries were cached. First, all possible entries were hashed using their predicted ID (ID-<year>-<4 digit sequence number>) as the key and nil as their value. Then, all valid IDs were parsed from the document we obtained and hashed using the approach in #2. The result is that all possible XML entries had a place in the hash.
  4. Similar to the first approach, but instead of using XPath, walking the document and selecting nodes based on simple comparison operations. This was a technique I found on Jud's blog

Now, to answer the burning question of which is faster, I utilized Ruby's Benchmark. Because the searches are often for IDs that don't exist, I sampled the results 10 times of searching for 100 random entries to searching for 100 only valid entries for each approach. The results were impressive, hopefully correct, and a little curious:

Approach User time System time Total time Real time
Random, Caching by year6.3800000.0400006.4200006.457234
Good, Caching by year42.3100000.05000042.36000042.502168
Random, Caching only valid entries74.1600000.290000 74.45000074.750306
Good, Caching only valid entries5.1900000.050000 5.2400005.272732
Random, Caching all entries9.0700000.0800009.1500009.194298
Good, Caching all entries5.8000000.0300005.8300005.865078
Random, using a walker14.9900000.09000015.08000015.143777
Good, using a walker55.7500000.06000055.81000055.998989

No comments: