Ruby XML Parsing Benchmarks

April 23rd, 2008 Misc

Benchmark. Benchmark. Benchmark.

I had heard it a million times.

So rather than just blindly trust that Hpricot was faster than REXML for my next project I whipped up a quick benchmark.

$ ruby test/benchmarks/xml_parsing.rb 
Rehearsal -------------------------------------------
REXML     3.300000   1.190000   4.490000 (  4.637418)
Hpricot   9.010000   3.370000  12.380000 ( 12.645412)
--------------------------------- total: 16.870000sec

              user     system      total        real
REXML     3.320000   1.100000   4.420000 (  4.456072)
Hpricot   8.870000   3.340000  12.210000 ( 12.438566)

Wow - at least with the type of XML data I'll be working with, REXML is massively faster. See the full benchmark code. See update below...

P.S. libxml claims to be much faster than REXML. But I found it poorly documented and very unstable (at least on Leopard it would continually crash IRB with bus errors - yikes). If anyone else has had better luck with libxml drop a note in the comments...

Update 4-24-08

Lee commented with a nice diff showing a much cleaner way to use Hpricot which makes it much faster than REXML!

$ ruby test/benchmarks/xml_parsing.rb 
Result are identical
Rehearsal -------------------------------------------
REXML     3.370000   1.160000   4.530000 (  4.712390)
Hpricot   1.280000   0.390000   1.670000 (  1.703311)
---------------------------------- total: 6.200000sec

              user     system      total        real
REXML     3.360000   1.090000   4.450000 (  4.716999)
Hpricot   1.270000   0.380000   1.650000 (  1.684723)

Here is the new benchmark code.

Thanks Lee!

Now, anyone want to add a libxml section to the benchmark?

--- --- ---

5 Comments

  1. Comment by Jeremy on 04/24/08
    Wow -- surprising and useful; thanks for making the effort!
  2. Comment by ay on 04/24/08
    I think that you should rewrite to "hpricot_fetch(e/:listid)" from "hpricot_fetch(doc/:listid)" at line 27.
  3. Comment by Pazu on 04/24/08
    Come on, this isn't fair. hpricot isn't a strict XML parser -- it's an HTML parser, a much more complex, harder to optimize for markup language.
  4. Comment by Matt on 04/24/08
    libxml *is* way faster than REXML. And the documentation is usable enough, IMHO. Can't comment on its stability on Leopard as I'm still running Tiger. P.S. Have you taken a look at the REXML::XPath (and delegates) source? Scary stuff.
  5. Comment by Lee on 04/24/08
    I switched the Hpricot section to access the nodes similar to the way you did it with REXML and got the following results:
    Rehearsal -------------------------------------------
    REXML     0.960000   0.000000   0.960000 (  0.961413)
    Hpricot   0.370000   0.000000   0.370000 (  0.377359)
    ---------------------------------- total: 1.330000sec
    
                  user     system      total        real
    REXML     0.960000   0.000000   0.960000 (  0.954501)
    Hpricot   0.330000   0.000000   0.330000 (  0.331897)
    
    Here is the diff from the pastie to my version:
    --- 185724.txt	2008-04-24 09:00:58.000000000 -0600
    +++ rexml_h_bm.rb	2008-04-24 09:00:42.000000000 -0600
    @@ -8,25 +8,31 @@
     class Parse
       def self.rexml
         doc = REXML::Document.new(XML)
    +    ary = []
         REXML::XPath.each(doc, '/*/*/*') do |node|
           case node.name
           when 'ItemQueryRs'
             node.elements.each do |element|
    -          rexml_fetch(element, 'ListID')
    +          ary << rexml_fetch(element, 'ListID')
             end
           end
         end
    +    ary
       end
       
       def self.hpricot
    -    doc = Hpricot(XML)
    -    response_element = doc.search('*[@requestid]').first
    -    case response_element.name
    -    when 'itemqueryrs'
    -      doc.search('itemserviceret|itemnoninventoryret|itemotherchargeret|iteminventoryret').each do |e|
    -        hpricot_fetch(doc/:listid)
    +    doc = Hpricot.XML(XML)
    +    ary = []
    +    response_element = doc.search('/*/*/*').each do |node|
    +      next unless node.elem?
    +      case node.name
    +      when 'ItemQueryRs'
    +        node.containers.each do |element|
    +          ary << hpricot_fetch(element/'ListID')
    +        end
           end
         end
    +    ary
       end
       
       # rexml helper
    @@ -43,6 +49,16 @@
       end
     end
     
    +rexml_results = Parse.rexml
    +hpricot_results = Parse.hpricot
    +if rexml_results == hpricot_results
    +  puts "Result are identical"
    +else
    +  puts "Results are not the same!"
    +  puts "REXML values: #{rexml_results.inspect}"
    +  puts "Hpricot values: #{hpricot_results.inspect}"
    +end
    +
     TIMES = 10
     Benchmark.bmbm do |x|
       x.report('REXML') { TIMES.times { Parse.rexml } }
    

Commenting is closed for this article.