Among the new features shipped with Ruby 2.4 is improved Unicode support. Specifically, methods like upcase
and downcase
work as expected, turning "ä" to "Ä" and back. This made me curious: what other Unicode improvements have been made since 2013
when I read André Arko's blog post Strings in Ruby are UTF-8 now… right??
I tested all of Ruby's string methods, not looking for technical errors but for violations of the "principle of least surprise." Specifically, my assumptions were that:
- Unique characters are unique: "e" and "ë" are different, just like "e" and "E" are.
- Single characters count as single characters, no matter how they're represented in unicode. This means that "e" and "ë" are each a single character, even though the latter is represented by two code points.
- Characters are immutable. Reversing a string of characters shouldn't alter the individual characters.
- Whitespace is treated as whitespace. Even those tricky unicode whitespace characters.
- Digits are treated as digits. The number 2 is always the number 2 no matter how it's written.
Unfortunately, most of Ruby's string manipulation methods fail these tests. If you're working with Unicode strings, you therefore have to be extremely careful which ones you use.
NOTE: After publication, some readers pointed out that many of the failures I mentioned wouldn't have happened if I would have normalized the unicode test strings. This is true. However strings aren't automatically normalized by Ruby or Rails (in any of the apps I tested). These tests were always meant to illustrate the worst-case and I think they're still useful in that regard.
Unicode tests with Ruby 2.4.0
Method | Test | Expected | Result | Verdict |
---|---|---|---|---|
#% | "%s" % "noël" |
"noël" |
"noël" |
OK |
#* | "noël" * 2 |
"noëlnoël" |
"noëlnoël" |
OK |
#<< | "noël" << "ë" |
"noëlë" |
"noëlë" |
OK |
#<=> | "ä" <=> "z" |
-1 |
-1 |
OK |
#== | "ä" == "ä" |
true |
true |
OK |
#=~ | "ä" =~ /a./ |
nil |
0 |
Beware! |
#[] | "ä"[0] |
"ä" |
"a" |
Beware! |
#[]= | "ä"[0] = "u" |
"u" |
"u" |
OK |
#b | "ä".b.encoding.to_s |
"ASCII-8BIT" |
"ASCII-8BIT" |
OK |
#bytes | "ä".bytes |
[97, 204, 136] |
[97, 204, 136] |
OK |
#bytesize | "ä".bytesize |
3 |
3 |
OK |
#byteslice | "ä".byteslice(1) |
"\xCC" |
"\xCC" |
OK |
#capitalize | "ä".capitalize |
"Ä" |
"Ä" |
OK |
#casecmp | "äa".casecmp("äz") |
-1 |
-1 |
OK |
#center | "ä".center(3) |
" ä " |
"ä " |
Beware! |
#chars | "ä".chars |
["ä"] |
["a", "̈"] |
Beware! |
#chomp | "ä
".chomp |
"ä" |
"ä" |
OK |
#chop | "ä".chop |
"" |
"a" |
Beware! |
#chr | "ä".chr |
"ä" |
"a" |
Beware! |
#clear | "ä".clear |
"" |
"" |
OK |
#codepoints | "ä".codepoints |
[97, 776] |
[97, 776] |
OK |
#concat | "ä".concat("x") |
"äx" |
"äx" |
OK |
#count | "ä".count("a") |
0 |
1 |
Beware! |
#crypt | "123".crypt("ää") == "123".crypt("aa") |
false |
false |
OK |
#delete | "ä".delete("a") |
"ä" |
"̈" |
Beware! |
#downcase | "Ä".downcase |
"ä" |
"ä" |
OK |
#dump | "ä".dump |
"\"a\\u0308\"" |
"\"a\\u0308\"" |
OK |
#each_byte | "ä".each_byte.to_a |
[97, 204, 136] |
[97, 204, 136] |
OK |
#each_char | "ä".each_char.to_a |
["ä"] |
["a", "̈"] |
Beware! |
#each_codepoint | "ä".each_codepoint.to_a |
[97, 776] |
[97, 776] |
OK |
#each_line | "ä".each_line.to_a |
["ä"] |
["ä"] |
OK |
#empty? | "ä".empty? |
false |
false |
OK |
#encode | "ä".encode("ASCII", undef: :replace) |
"a?" |
"a?" |
OK |
#encoding | "ä".encoding.to_s |
"UTF-8" |
"UTF-8" |
OK |
#end_with? | "ä".end_with?("ä") |
true |
true |
OK |
#eql? | "ä".eql?("a") |
false |
false |
OK |
#force_encoding | "ä".force_encoding("ASCII") |
"a\xCC\x88" |
"a\xCC\x88" |
OK |
#getbyte | "ä".getbyte(2) |
136 |
136 |
OK |
#gsub | "ä".gsub("a", "x") |
"ä" |
"ẍ" |
Beware! |
#hash | "ä".hash == "a".hash |
false |
false |
OK |
#include? | "ä".include?("a") |
false |
true |
Beware! |
#index | "ä".index("a") |
nil |
0 |
Beware! |
#replace | "ä".replace("u") |
"u" |
"u" |
OK |
#insert | "ä".insert(1, "u") |
"äu" |
"aü" |
Beware! |
#inspect | "ä".inspect |
"\"ä\"" |
"\"ä\"" |
OK |
#intern | "ä".intern |
:ä |
:ä |
OK |
#length | "ä".length |
1 |
2 |
Beware! |
#ljust | "ä".ljust(3, "_") |
"ä__" |
"ä_" |
Beware! |
#lstrip | " ä".lstrip |
"ä" |
"ä" |
OK |
#match | "ä".match("a") |
nil |
# |
Beware! |
#next | "ä".next |
"ä" |
"b̈" |
Beware! |
#ord | "ä".ord |
97 |
97 |
OK |
#partition | "händ".partition("a") |
["händ"] |
["h", "a", "̈nd"] |
Beware! |
#prepend | "ä".prepend("ä") |
"ää" |
"ää" |
OK |
#replace | "ä".replace("ẍ") |
"ẍ" |
"ẍ" |
OK |
#reverse | "händ".reverse |
"dnäh" |
"dn̈ah" |
Beware! |
#rpartition | "händ".rpartition("a") |
["händ"] |
["h", "a", "̈nd"] |
Beware! |
#rstrip | "line ".rstrip |
"line" |
"line " |
Beware! |
#scrub | "ä".scrub |
"ä" |
"ä" |
OK |
#setbyte | s = "ä"; s.setbyte(0, "x".ord); s |
"ẍ" |
"ẍ" |
OK |
#size | "ä".size |
1 |
2 |
Beware! |
#slice | "ä".slice(0) |
"ä" |
"a" |
Beware! |
#split | "ä".split("a") |
["ä"] |
["", "̈"] |
Beware! |
#squeeze | "ää".squeeze("ä") |
"ä" |
"ää" |
Beware! |
#start_with? | "ä".start_with?("a") |
false |
true |
Beware! |
#strip | " line ".strip |
"line" |
" line " |
Beware! |
#sub | "ä".sub("a", "x") |
"ä" |
"ẍ" |
Beware! |
#succ | "ä".succ |
"b̈" |
"b̈" |
OK |
#swapcase | "ä".swapcase |
"Ä" |
"Ä" |
OK |
#to_c | "١".to_c |
(1+0i) |
(0+0i) |
Beware! |
#to_f | "١".to_f |
1.0 |
0.0 |
Beware! |
#to_i | "١".to_i |
1 |
0 |
Beware! |
#to_r | "١".to_r |
(1/1) |
(0/1) |
Beware! |
#to_sym | "ä".to_sym |
:ä |
:ä |
OK |
#tr | "ä".tr("a", "b") |
"ä" |
"b̈" |
Beware! |
#unpack | "ä".unpack("CCC") |
[97, 204, 136] |
[97, 204, 136] |
OK |
#upto | "ä".upto("c̈").to_a |
["ä", "b̈", "c̈"] |
["ä", "b̈", "c̈"] |
OK |
#valid_encoding? | "ä".valid_encoding? |
true |
true |
OK |