Confession time: I don't particularly like working with regular expressions. While I use them all the time, anything more complex than a /^foo.*$/
requires me to stop and think. While I'm sure there are people who can decipher expressions like \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})
at a glance, but it takes me several minutes of googling and makes me unhappy. It's quite a difference from reading Ruby.
If you're curious, the example above is taken from this article on regex lookaheads.
The Situation
At Honeybadger I'm currently working on improving our search UI. Like many search systems, ours uses a simple query language. Before my changes, if you wanted to search for a custom date range, you had to manually type in a query like so:
occurred:[2017-06-12T16:10:00Z TO 2017-06-12T17:10:00Z]
Ouch!
In the new search UI, we want to detect when you start typing a date-related query and pop up a helpful datepicker. And of course, the datepicker is just the beginning. Eventually we'll expand the context-sensitive hinting to cover more kinds of search terms. Here are a few examples:
assigned:jane@email.com context.user.id=100
resolved:false ignored:false occurred:[
params.article.title:"Starr's parser post" foo:'ba
I need to tokenize these strings in such a way that:
- Whitespace separates tokens, except when surrounded by '', "" or []
- Unquoted whitespace is its own token
- I can run
tokens.join("")
to exactly recreate the input string
For example:
tokenize(%[params.article.title:"Starr's parser post" foo:'ba])
=> ["params.article.title:\"Starr's parser post\"", " ", "foo:'ba"]
Using A Regular Expression
My first thought was to use a capturing regular expression to define what a valid token should look like, then use String#split
to split the string into tokens. It's a pretty cool trick, actually:
# The parens in the regexp mean that the separator is added to the array
"foo bar baz".split(/(foo|bar|baz)/)
=> ["", "foo", " ", "bar", " ", "baz"]
This looked promising initially, despite the weird empty-strings. But my real-world regular expression was much more complex. My first draft looked like this:
/
( # Capture group is so split will include matching and non-matching strings
(?: # The first character of the key, which is
(?!\s)[^:\s"'\[]{1} # ..any valid "key" char not preceeded by whitespace
|^[^:\s"'\[]{0,1} # ..or any valid "key" char at beginning of line
)
[^:\s"'\[]* # The rest of the "key" chars
: # a colon
(?: # The "value" chars, which are
'[^']+' # ..anything surrounded by single quotes
| "[^"]+" # ..or anything surrounded by double quotes
| \[\S+\sTO\s\S+\] # ..or anything like [x TO y]
| [^\s"'\[]+ # ..or any string not containing whitespace or special chars
)
)
/xi
Working with this gave me a sinking feeling. Every time I found an edge case I'd have to amend the regular expression, making it even more complex. In addition, it needed to work in JavaScript as well as Ruby, so certain features like negative lookbehind weren't available.
...It was about this time that the absurdity of all this struck me. The regular expression approach I was using was much more complicated than it would be to write a simple parser from scratch.
Anatomy of a Parser
I'm no expert, but simple parsers are simple. All they do is:
- Step through a string, character by character
- Append each character to a buffer
- When a token-separating condition is encountered, save the buffer to an array and empty it.
Knowing this, we can set up a simple parser that splits strings by whitespace. It's roughly the equivalent to "foo bar".split(/(\s+)/)
.
class Parser
WHITESPACE = /\s/
NON_WHITESPACE = /\S/
def initialize
@buffer = []
@output = []
end
def parse(text)
text.each_char do |c|
case c
when WHITESPACE
flush if previous.match(NON_WHITESPACE)
@buffer << c
else
flush if previous.match(WHITESPACE)
@buffer << c
end
end
flush
@output
end
protected
def flush
if @buffer.any?
@output << @buffer.join("")
@buffer = []
end
end
def previous
@buffer.last || ""
end
end
puts Parser.new().parse("foo bar baz").inspect
# Outputs ["foo", " ", "bar", " ", "baz"]
This is a step in the direction of what I want, but it's missing support for quotes and brackets. Fortunately, adding that only takes a few lines of code:
def parse(text)
surround = nil
text.each_char do |c|
case c
when WHITESPACE
flush if previous.match(NON_WHITESPACE) && !surround
@buffer << c
when '"', "'"
@buffer << c
if !surround
surround = c
elsif surround == c
flush
surround = nil
end
when "["
@buffer << c
surround = c if !surround
when "]"
@buffer << c
if surround == "["
flush
surround = nil
end
else
flush() if previous().match(WHITESPACE) && !surround
@buffer << c
end
end
flush
@output
end
This code is only a bit longer than my regular-expression-based approach but is much much more straightforward.
Parting thoughts
There's probably a regular expression out there that would work fine with my use case. If history is any guide, It's probably simple enough to make me look like a fool. :)
But I really enjoyed the chance to write this little parser. It broke me out of the rut I was in with the regex approach. As a nice bonus, I'm a lot more confident in the resulting code than I ever am with code that is based around complicated regular expressions.