Internal encoding conversion benchmarks

Ruby has a (somewhat justified) reputation for being slow but one of the great things about it is that it is easy to write extensions for it in lightening-fast C. Despite the fact that for some time now C has been "the new assembly", extensions written in C integrate remarkably well with Ruby’s pure garbage-collected, object-oriented goodness.

I am working on Rails application at the moment and pretty much all the textual content is going to be entered and stored as wikitext, my favorite free-form markup. I could do the parsing in Ruby but I want this to be fast, so I didn’t even bother prototyping it in Ruby first: I dived straight into C from the beginning, using an ANTLR-generated lexer (using the C target) and a custom parser/transformer on top of that in hand-coded C.

The core (Java) target of ANTLR itself uses UTF-16 but the C target uses UTF-16’s obsolete half-brother, UCS-2. Actually, it apparently uses UTF-32 under the covers but the interface it exposes works with UCS-2. The problem with UCS-2 is that it can’t represent all Unicode characters, only a subset of them, because it doesn’t have a surrogate mechanism like UTF-16 does. For precisely this reason it is easier to deal with because you have a guarantee that all characters are exactly 16 bits, but it sucks because it means that there are some valid Unicode input strings that you can’t convert to UCS-2, which in turn means that your ANTLR-generated C recognizer won’t be able to process them. UTF-32 doesn’t have this problem because it is both fixed width and large enough to represent all of the Unicode codepoints, but the down side is that it takes up a little more space (probably not much of an issue with a wikitext parser, where most slabs of input text aren’t ever going to occupy more than a few KB). In any case once the ANTLR C target’s API settles down into something stable I’ll consider making the switch to UTF-32 if it doesn’t require too much work (or I’ll just throw the whole ANTLR thing out the window and write a scanner using Ragel).

But I digress. I wanted to talk about benchmarking.

All that stuff about ANTLR was just to background to explain that in my wikitext parser I have to convert from the input encoding (UTF-8) to UCS-2 for internal processing purposes. Then at the end I convert back to UTF-8 and return the results.

I initially did this with the iconv module that comes with the Ruby standard library. Then, seeing as it is relatively simple to convert back and forth between UCS-2 and UTF-8 I decided to write some internal conversion methods in pure C, thus eliminating the dependency on iconv and hopefully being faster to boot. If I ever move to UTF-32 the methods will be easily extended.

So I just did some benchmarking to compare the results. I basically converted in both directions with both implementations (the custom internal one and the iconv one), and I tried with four different classes of input: short strings with only ASCII characters in them, short strings with non-ASCII characters, big slabs with only ASCII characters, and big slabs with non-ASCII characters.

Each line in the following tables shows the time taken for 100,000 iterations; the first batch are from the "rehearsal" run and then there’s the "real thing":

Sanity checking: ....
Rehearsal --------------------------------------------------------------------
iconv (short ASCII to UCS-2)       0.490000   0.000000   0.490000 (  0.500046)
internal (short ASCII to UCS-2)    0.120000   0.010000   0.130000 (  0.116604)
iconv (short UTF-8 to UCS-2)       0.560000   0.000000   0.560000 (  0.561324)
internal (short UTF-8 to UCS-2)    0.130000   0.000000   0.130000 (  0.140209)
iconv (longer ASCII to UCS-2)      9.080000   0.030000   9.110000 (  9.222262)
internal (longer ASCII to UCS-2)   1.240000   0.010000   1.250000 (  1.263230)
iconv (longer UTF-8 to UCS-2)     12.940000   0.040000  12.980000 ( 13.131333)
internal (longer UTF-8 to UCS-2)   3.480000   0.020000   3.500000 (  3.548404)
iconv (short ASCII to UTF-8)       0.500000   0.000000   0.500000 (  0.503098)
internal (short ASCII to UTF-8)    0.130000   0.000000   0.130000 (  0.159429)
iconv (short UTF-8 to UTF-8)       0.560000   0.000000   0.560000 (  0.558365)
internal (short UTF-8 to UTF-8)    0.140000   0.000000   0.140000 (  0.148177)
iconv (longer ASCII to UTF-8)      9.150000   0.030000   9.180000 (  9.230961)
internal (longer ASCII to UTF-8)   1.520000   0.000000   1.520000 (  1.563547)
iconv (longer UTF-8 to UTF-8)     11.840000   0.040000  11.880000 ( 12.011918)
internal (longer UTF-8 to UTF-8)   4.670000   0.010000   4.680000 (  4.722022)
---------------------------------------------------------- total: 56.740000sec

                                       user     system      total        real
iconv (short ASCII to UCS-2)       0.490000   0.000000   0.490000 (  0.498087)
internal (short ASCII to UCS-2)    0.110000   0.000000   0.110000 (  0.117257)
iconv (short UTF-8 to UCS-2)       0.570000   0.000000   0.570000 (  0.570418)
internal (short UTF-8 to UCS-2)    0.130000   0.010000   0.140000 (  0.145102)
iconv (longer ASCII to UCS-2)      9.080000   0.020000   9.100000 (  9.217037)
internal (longer ASCII to UCS-2)   1.240000   0.010000   1.250000 (  1.264873)
iconv (longer UTF-8 to UCS-2)     12.940000   0.040000  12.980000 ( 13.138847)
internal (longer UTF-8 to UCS-2)   3.480000   0.010000   3.490000 (  3.526458)
iconv (short ASCII to UTF-8)       0.500000   0.000000   0.500000 (  0.509422)
internal (short ASCII to UTF-8)    0.140000   0.000000   0.140000 (  0.147287)
iconv (short UTF-8 to UTF-8)       0.560000   0.010000   0.570000 (  0.570325)
internal (short UTF-8 to UTF-8)    0.150000   0.000000   0.150000 (  0.170726)
iconv (longer ASCII to UTF-8)      9.170000   0.020000   9.190000 (  9.268911)
internal (longer ASCII to UTF-8)   1.520000   0.010000   1.530000 (  1.554178)
iconv (longer UTF-8 to UTF-8)     11.840000   0.030000  11.870000 ( 12.044493)
internal (longer UTF-8 to UTF-8)   4.670000   0.020000   4.690000 (  4.744175)

I was pleasantly surprised at the speed-up. As far as I know, the Ruby Iconv module is just a wrapper for the iconv library (which is written in C, I believe), so I wouldn’t expect it to be so slow. And there are no clever optimization tricks in my implementation or shortcuts; it’s basically written based on the information here and here.

Observations:

  • For short strings the internal implementation is about 4 to 5 times faster
  • For longer strings the gap widens to as much as 7 times faster
  • Unsuprisingly, for both implementations processing input in the ASCII range is faster
  • Slightly more interestingly, for the internal implementation converting to UCS-2 is faster than converting from it; the fluctuations in the case of the iconv implemenation aren’t as clear