Have you ever had a bunch of data in an array, but needed to do a key/value lookup like you would with a hash? Fortunately, Ruby provides a mechanism for treating arrays as key-value structures. Let's check it out!
Introducing Array#assoc
and Array#rassoc
Imagine that you've been given a magical stock-picking machine. Every few minutes it spits out a recommendation to buy or sell a stock. You've managed to hook it up to your computer, and receiving a stream of data that looks like this:
picks = [
["AAPL", "buy"],
["GOOG", "sell"],
["MSFT", "sell"]
]
To find the most recent guidance for Google, you could make use of the Array#assoc
method. Here's what that looks like:
# Returns the first row of data where row[0] == "GOOG"
picks.assoc("GOOG") # => ["GOOG", "sell"]
To find the most recent "sell" recommendation, you could use the Array#rassoc
method.
# Returns the first row of data where row[1] == "sell"
picks.rassoc("sell") # => ["GOOG", "sell"]
If no match is found, the methods return nil:
picks.assoc("CSCO") # => nil
picks.rassoc("hold") # => nil
Historical data
Hashes can't have more than one value for a single key. But arrays can have as many duplicates as you like. The assoc and rassoc methods do the sensible thing in this case and return the first matching row they find. This lets us do some pretty interesting things.
Our imaginary stock picking machine provides a stream of data. Eventually, it's going to change its mind about a particular company and tell me to buy what it previously told me to sell. In that case our data looks like:
picks = [
["GOOG", "buy"],
["AAPL", "sell"],
["AAPL", "buy"],
["GOOG", "sell"],
["MSFT", "sell"]
]
If I were putting all of this data into a hash, updating the recommendation for a particular stock would cause me to lose any previous recommendations for that stock. Not so with the array. I can keep prepending recommendations to the array, knowing that Array#assoc will always give me the most recent recommendation.
# Returns the first row of data where row[0] == "GOOG"
picks.assoc("GOOG") # => ["GOOG", "buy"]
So we get the key-value goodness of a hash, along with a free audit trail.
More than two columns
Another neat thing about assoc is that you're not limited to just two columns per array. You can have as many columns as you like. Suppose you added a timestamp to each buy/sell recommendation.
picks = [
["AAPL", "buy", "2015-08-17 12:11:55 -0700"],
["GOOG", "sell", "2015-08-17 12:10:00 -0700"],
["MSFT", "sell", "2015-08-17 12:09:00 -0700"]
]
Now when we use assoc
or rassoc
, we'll get the timestamp as well:
# The entire row is returned
picks.assoc("GOOG") # => ["GOOG", "sell", "2015-08-17 12:10:00 -0700"]
I hope you can see how useful this could be when dealing with data from CSV and other file formats that can have lots of columns.
Speed
Ruby's hashes will definitely outperform Array#assoc
in most benchmarks. As the dataset gets bigger, the differences become more apparent. After all, hash table searches are O(1), while array searches are O(n). However in may cases the difference won't large enough for you to worry about - it depends on the details.
Just for fun, I wrote simple benchmark comparing hash lookup vs assoc for a 10 row dataset and for a 100,000 row dataset. As expected, the hash and array performed similarly with the small data set. With the large dataset, the hash dominated the array.
...though to be fair, I'm searching for the last element in the array, which is the worst case scenario for array searches.
require 'benchmark/ips'
require 'securerandom'
Benchmark.ips do |x|
x.time = 5
x.warmup = 2
short_array = (0..10).map { |i| [SecureRandom.hex(), i] }
short_hash = Hash[short_array]
short_key = short_array.last.first
long_array = (0..100_000).map { |i| [SecureRandom.hex(), i] }
long_hash = Hash[long_array]
long_key = short_array.last.first
x.report("short_array") { short_array.assoc(short_key) }
x.report("short_hash") { short_hash[short_key] }
x.report("long_array") { long_array.assoc(long_key) }
x.report("long_hash") { long_hash[long_key] }
x.compare!
end
# Calculating -------------------------------------
# short_array 91.882k i/100ms
# short_hash 149.430k i/100ms
# long_array 19.000 i/100ms
# long_hash 152.086k i/100ms
# -------------------------------------------------
# short_array 1.828M (± 3.4%) i/s - 9.188M
# short_hash 6.500M (± 4.8%) i/s - 32.426M
# long_array 205.416 (± 3.9%) i/s - 1.026k
# long_hash 6.974M (± 4.2%) i/s - 34.828M
# Comparison:
# long_hash: 6974073.6 i/s
# short_hash: 6500207.2 i/s - 1.07x slower
# short_array: 1827628.6 i/s - 3.82x slower
# long_array: 205.4 i/s - 33950.98x slower