Among the new features shipped with Ruby 2.4 is improved Unicode support. Specifically, methods like upcase and downcase work as expected, turning "ä" to "Ä" and back. This made me curious: what other Unicode improvements have been made since 2013 when I read André Arko's blog post Strings in Ruby are UTF-8 now… right??

I tested all of Ruby's string methods, not looking for technical errors but for violations of the "principle of least surprise." Specifically, my assumptions were that:

  1. Unique characters are unique: "e" and "ë" are different, just like "e" and "E" are.
  2. Single characters count as single characters, no matter how they're represented in unicode. This means that "e" and "ë" are each a single character, even though the latter is represented by two code points.
  3. Characters are immutable. Reversing a string of characters shouldn't alter the individual characters.
  4. Whitespace is treated as whitespace. Even those tricky unicode whitespace characters.
  5. Digits are treated as digits. The number 2 is always the number 2 no matter how it's written.

Unfortunately, most of Ruby's string manipulation methods fail these tests. If you're working with Unicode strings, you therefore have to be extremely careful which ones you use.

NOTE: After publication, some readers pointed out that many of the failures I mentioned wouldn't have happened if I would have normalized the unicode test strings. This is true. However strings aren't automatically normalized by Ruby or Rails (in any of the apps I tested). These tests were always meant to illustrate the worst-case and I think they're still useful in that regard.

Unicode tests with Ruby 2.4.0

Method Test Expected Result Verdict
#% "%s" % "noël" "noël" "noël" OK
#* "noël" * 2 "noëlnoël" "noëlnoël" OK
#<< "noël" << "ë" "noëlë" "noëlë" OK
#<=> "ä" <=> "z" -1 -1 OK
#== "ä" == "ä" true true OK
#=~ "ä" =~ /a./ nil 0 Beware!
#[] "ä"[0] "ä" "a" Beware!
#[]= "ä"[0] = "u" "u" "u" OK
#b "ä".b.encoding.to_s "ASCII-8BIT" "ASCII-8BIT" OK
#bytes "ä".bytes [97, 204, 136] [97, 204, 136] OK
#bytesize "ä".bytesize 3 3 OK
#byteslice "ä".byteslice(1) "\xCC" "\xCC" OK
#capitalize "ä".capitalize "Ä" "Ä" OK
#casecmp "äa".casecmp("äz") -1 -1 OK
#center "ä".center(3) " ä " "ä " Beware!
#chars "ä".chars ["ä"] ["a", "̈"] Beware!
#chomp "ä ".chomp "ä" "ä" OK
#chop "ä".chop "" "a" Beware!
#chr "ä".chr "ä" "a" Beware!
#clear "ä".clear "" "" OK
#codepoints "ä".codepoints [97, 776] [97, 776] OK
#concat "ä".concat("x") "äx" "äx" OK
#count "ä".count("a") 0 1 Beware!
#crypt "123".crypt("ää") == "123".crypt("aa") false false OK
#delete "ä".delete("a") "ä" "̈" Beware!
#downcase "Ä".downcase "ä" "ä" OK
#dump "ä".dump "\"a\\u0308\"" "\"a\\u0308\"" OK
#each_byte "ä".each_byte.to_a [97, 204, 136] [97, 204, 136] OK
#each_char "ä".each_char.to_a ["ä"] ["a", "̈"] Beware!
#each_codepoint "ä".each_codepoint.to_a [97, 776] [97, 776] OK
#each_line "ä".each_line.to_a ["ä"] ["ä"] OK
#empty? "ä".empty? false false OK
#encode "ä".encode("ASCII", undef: :replace) "a?" "a?" OK
#encoding "ä".encoding.to_s "UTF-8" "UTF-8" OK
#end_with? "ä".end_with?("ä") true true OK
#eql? "ä".eql?("a") false false OK
#force_encoding "ä".force_encoding("ASCII") "a\xCC\x88" "a\xCC\x88" OK
#getbyte "ä".getbyte(2) 136 136 OK
#gsub "ä".gsub("a", "x") "ä" "ẍ" Beware!
#hash "ä".hash == "a".hash false false OK
#include? "ä".include?("a") false true Beware!
#index "ä".index("a") nil 0 Beware!
#replace "ä".replace("u") "u" "u" OK
#insert "ä".insert(1, "u") "äu" "aü" Beware!
#inspect "ä".inspect "\"ä\"" "\"ä\"" OK
#intern "ä".intern :ä :ä OK
#length "ä".length 1 2 Beware!
#ljust "ä".ljust(3, "_") "ä__" "ä_" Beware!
#lstrip " ä".lstrip "ä" "ä" OK
#match "ä".match("a") nil # Beware!
#next "ä".next "ä" "b̈" Beware!
#ord "ä".ord 97 97 OK
#partition "händ".partition("a") ["händ"] ["h", "a", "̈nd"] Beware!
#prepend "ä".prepend("ä") "ää" "ää" OK
#replace "ä".replace("ẍ") "ẍ" "ẍ" OK
#reverse "händ".reverse "dnäh" "dn̈ah" Beware!
#rpartition "händ".rpartition("a") ["händ"] ["h", "a", "̈nd"] Beware!
#rstrip "line ".rstrip "line" "line " Beware!
#scrub "ä".scrub "ä" "ä" OK
#setbyte s = "ä"; s.setbyte(0, "x".ord); s "ẍ" "ẍ" OK
#size "ä".size 1 2 Beware!
#slice "ä".slice(0) "ä" "a" Beware!
#split "ä".split("a") ["ä"] ["", "̈"] Beware!
#squeeze "ää".squeeze("ä") "ä" "ää" Beware!
#start_with? "ä".start_with?("a") false true Beware!
#strip " line ".strip "line" " line " Beware!
#sub "ä".sub("a", "x") "ä" "ẍ" Beware!
#succ "ä".succ "b̈" "b̈" OK
#swapcase "ä".swapcase "Ä" "Ä" OK
#to_c "١".to_c (1+0i) (0+0i) Beware!
#to_f "١".to_f 1.0 0.0 Beware!
#to_i "١".to_i 1 0 Beware!
#to_r "١".to_r (1/1) (0/1) Beware!
#to_sym "ä".to_sym :ä :ä OK
#tr "ä".tr("a", "b") "ä" "b̈" Beware!
#unpack "ä".unpack("CCC") [97, 204, 136] [97, 204, 136] OK
#upto "ä".upto("c̈").to_a ["ä", "b̈", "c̈"] ["ä", "b̈", "c̈"] OK
#valid_encoding? "ä".valid_encoding? true true OK

Get the Honeybadger newsletter

Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
    author photo
    Starr Horne

    Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

    More articles by Starr Horne
    An advertisement for Honeybadger that reads 'Turn your logs into events.'

    "Splunk-like querying without having to sell my kidneys? nice"

    That’s a direct quote from someone who just saw Honeybadger Insights. It’s a bit like Papertrail or DataDog—but with just the good parts and a reasonable price tag.

    Best of all, Insights logging is available on our free tier as part of a comprehensive monitoring suite including error tracking, uptime monitoring, status pages, and more.

    Start logging for FREE
    Simple 5-minute setup — No credit card required