A significant portion of my day job involves me writing large amounts of code to massage, manipulate and transform various different data sources from third parties into things that ultimately result in my "bread and butter", so to speak. But, its the weekend. What on earth am I doing writing about this now? Well, honestly, its because I lose sleep over things like ugly, unreliable or poorly performing code. In some situations, my code has one or more of those three attributes for reasons that aren't really under my control -- vendors. Other times, honestly, its because I am a perfectionist and my code is never good enough.
One bit of code that you could call "old faithful" is code that I use to parse XML files from a particular vendor. The files group data by year, cover 2002 until now, 2012, and are from 2 to 22M in size. My weapon of choice in this battle is Ruby 1.9.3 and Nokogiri. The majority of the operations that are performed against this data is searching for children of the root node that have a particular attribute. So, assuming we had a sample document like this:
<store> <food id="ID-2001-0001"> <nutrition> <calories>100</calories> </nutrition> </food> <food id="ID-2001-1234"/> </store>
The bulk of the searches would be things like looking for food with ID ID-2011-0123, or, in XPath, '//store/food[@id="ID-2011-0123"]'. In reality, the documents are considerably larger -- ~6000 entries in each document, each entry from a few hundred bytes to several kilobytes. And a batch run of results in an average of 500 lookups across basically all years.
Memory and CPU are not really a concern, but time is. While the documents are numerous, large and complex, it was never acceptable that the batch searches would take minutes or more. For a while, I was using the same "seat of the pants" tuning that shadetree mechanics use -- it sure feels faster! But, is it really faster?
From its initial inception, it was clear that opening the correct document, using XPath to locate the entry and then repeating this process was not going to work, because the code would quickly get OOM killed. So, I investigate some different strategies for optimizing these lookups:
- Caching the resulting Nokogiri documents, using a hash for each year after they were first opened. Subsequent lookups for that year would not have to incur reprocessing. XPath was still used within individual documents to locate entries.
- As a new year was encountered, opening the corresponding document and caching all of the entries for that year using the entry ID as a key in a hash, and the entry element itself as the value.
- Similar to the second approach, however *ALL* possible entries were cached. First, all possible entries were hashed using their predicted ID (ID-<year>-<4 digit sequence number>) as the key and nil as their value. Then, all valid IDs were parsed from the document we obtained and hashed using the approach in #2. The result is that all possible XML entries had a place in the hash.
- Similar to the first approach, but instead of using XPath, walking the document and selecting nodes based on simple comparison operations. This was a technique I found on Jud's blog
Approach | User time | System time | Total time | Real time |
---|---|---|---|---|
Random, Caching by year | 6.380000 | 0.040000 | 6.420000 | 6.457234 |
Good, Caching by year | 42.310000 | 0.050000 | 42.360000 | 42.502168 |
Random, Caching only valid entries | 74.160000 | 0.290000 | 74.450000 | 74.750306 |
Good, Caching only valid entries | 5.190000 | 0.050000 | 5.240000 | 5.272732 |
Random, Caching all entries | 9.070000 | 0.080000 | 9.150000 | 9.194298 |
Good, Caching all entries | 5.800000 | 0.030000 | 5.830000 | 5.865078 |
Random, using a walker | 14.990000 | 0.090000 | 15.080000 | 15.143777 |
Good, using a walker | 55.750000 | 0.060000 | 55.810000 | 55.998989 |