Mastering regular expressions in Go

Strings are arguably the most critical data type in programming, as they make so many things possible. For example, without strings, there would be no emails, text messages, or almost any other computer activity that you can think of because it is the most accessible way of passing information to both technical and non-technical people.

All programming languages, regardless of whether they’re primitive or advanced, have a way to represent and handle strings because they are the backbone of every useful program. In this article, you will learn everything you need to know to start manipulating strings with regular expressions in Go.

Prerequisites

The only prerequisite for following along with this tutorial is a basic knowledge of Go. You can get familiar with Go or even a quick brush up on it by going through my Go Beginner Series.

With the prerequisites out of the way, let's explore what regular expressions are in the next section.

Regular expressions

Regular expressions, also known as regex, can be defined as patterns or templates, which are sets of characters defining a search pattern that can be used to perform tasks, such as search and replace input validation.

To put what regular expressions can do for you into context, let’s look at how with and without regular expressions. For example, if you need to implement a function that checks and extracts URLs from a given text, you can implement this function in Go:

package main

import "strings"

func extractURLs(input string) []string {
 var urls []string
 words := strings.Fields(input)

 for _, word := range words {
  if strings.HasPrefix(word, "http://") || strings.HasPrefix(word, "https://") || strings.HasPrefix(word, "www.") {
   urls = append(urls, word)
  }
 }

 return urls
}

The code above defines an extractURLs function that takes in a string and checks if it has a URL in it by breaking it down and looping through the words to find any word that starts with http://, https://, or www. and returns all of them as a slice.

Now, let's see how we can use regular expressions to solve this problem in Go:

func extractURLs(input string) []string {
 pattern := regexp.MustCompile(`https?://\S+|www\.\S+`)
 return pattern.FindAllString(input, -1)
}

The code above does the same thing as the previous code block but with fewer lines of code; this is how regular expressions can help you write elegant code with less.

You would use the code above inside a Go main function like this:

func main() {
 input := `You can find me on https://www.google.com https://www.facebook.com www.twitter.com`
 urls := extractURLs(input)
 for _, url := range urls {
  println(url)
 }
}

You should then get a result that looks like this in the terminal:

https://www.google.com
https://www.facebook.com
www.twitter.com

Note: Don't worry if you don't understand the regular expression I used above; you will learn how it was composed and even start writing much more complex ones once you finish reading this article.

Regular expressions in Go

In Go, regular expressions are implemented using the RE2 library. RE2 is a regular expression library that emphasizes efficiency and safety. It was developed by Google and designed to handle regular expressions with linear time complexity. This means that the time it takes to match a string with a regular expression is proportional to the length of the string. In contrast, traditional, regular expression engines can exhibit exponential time complexity for certain patterns, making them vulnerable to catastrophic backtracking and causing performance issues.

However, while RE2 offers better performance and safety, it lacks some of the more advanced features found in other regex engines, such as backreferences, lookarounds, and unbounded repetition. Therefore, if your use case requires features RE2 does not support, you might need to use a different library or approach.

Note: To understand the reasons RE2 was developed, the ideas behind it, and why it is used in all of Google's applications and programming languages, read the three-part series by one of Go's creators, Russ Cox.

Regular expressions: syntax and concepts

In this section, we will explore the regular expressions syntax and its concepts while using Go's regexp package to see how each one works in detail.

Literal patterns

Literal patterns are characters that precisely match themselves in a regular expression. For example, the regular expression "Northern" would match the word "Northern" in a text like this:

package main

import (
 "fmt"
 "regexp"
)

func main() {
pattern := regexp.MustCompile("Northern")
text := "New York is in the Northern part of the United States."

 if pattern.MatchString(text) {
  fmt.Println("Correct region!")
 } else {
  fmt.Println("Wrong region!")
 }
}

The code above imports the fmt and regexp packages, defines a text variable containing the given string and a pattern variable that uses the MustCompile function on the regexp package to compile the pattern. It then uses the MatchString function with the text inside an if statement to print a message based on the result.

Note: Regular expressions are case-sensitive by default, so the code above would print Wrong region! if "Northern" in the text string was written in all lowercase.

Special characters

Regular expressions wouldn’t be handy if we could only use literal patterns as described above. Special characters are used to match much more extensive patterns in our code. There are different types of special characters in regular expressions, and we will look at the ones supported in the RE2 syntax specification in this article.

We’ll explore the special characters called flags and what they are used for in the next section.

Flags

In regular expressions, flags are used to improve literal patterns in different ways. Let's explore some of the common regular expression flags in this section.

The `i` flag

The i flag, also known as the “ignore case” flag, is used in regular expressions to make pattern-matching case-insensitive. When this flag is included in a regular expression, it allows the pattern to match both uppercase and lowercase versions of letters without distinction.

For example, to match strings like “Apple”, “aPple”, “APPlE”, and so on, in addition to just “apple”, you can use the i flag in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "Apples are delicious, but so are apples, aPpleS, and APPlE."

 pattern := regexp.MustCompile(`(?i)apple`)
 appleMatches := pattern.FindAllString(text, -1)
 fmt.Println("Case-insensitive matches for 'apple':", appleMatches)
}

The code above uses the FindAllString function to get all the matches for the word Apple, regardless of the spelling case, like so:

Case-insensitive matches for 'apple': [Apple apple aPple APPlE]

The `g` flag

The g flag, known as the "global" flag, performs a global search for a pattern within the given string. Without this flag, the regular expression would only match the first occurrence of the pattern in the string.

Go doesn't support the g flag in matching all pattern occurrences in a string. However, you can use the FindAllString function as we did in the previous code block.

The `m` flag

The m flag, short for “multiline”, affects the behavior of the ^ and $ anchors in a regular expression. By default, ^ matches the beginning of the entire string, and $ matches the end of the entire string. However, when the m flag is used, these anchors also match the beginning and end of individual lines within a multiline string. This is especially useful when you have a multiline text and want to apply the pattern to each line separately.

For instance, to match lines that begin with “Hello,” you can use the m flag like so:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 input := "Hello World\nHello Golang\nHello Regexp"

 re := regexp.MustCompile(`(?m)^Hello`)
 matches := re.FindAllString(input, -1)

 fmt.Println(matches) // returns [Hello Hello Hello]
}

The `s` flag

The s flag is used to perform single-line pattern matching. When this flag is set, the regular expression engine treats the input string as a one-line string and matches patterns across the entire string, including newline characters.

This is particularly useful when you want to match patterns that span multiple lines. For example, if you want to match the word 'pattern' followed by any characters, including newlines, and the word 'example', you can do this in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "Pattern\nspanning\nmultiple lines.\nExample."

 pattern := regexp.MustCompile(`(?s)Pattern.*Example`)
 patternMatches := pattern.FindAllString(text, -1)
 fmt.Println("Single line matches:", patternMatches)
} // returns Single line matches: [Pattern
// spanning
// multiple lines.
// Example]

We’ll explore metacharacters in the next section.

Metacharacters

In regular expressions, metacharacters are special characters used to match more characters while keeping the pattern as simple as possible. We'll look at the most common ones and how to use them in Go in the following sections.

Character classes

Character classes allow you to match a specific set of characters. For example, \d matches any digit, \w matches any word character (alphanumeric + underscore), and \s matches any whitespace character. For example, you can find all words that contain at least one digit character in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "The price is $20, the code is ABC123."

 pattern := regexp.MustCompile(`\w+\d\w*`)
 matches := pattern.FindAllString(text, -1)

 fmt.Println(matches) // returns [20 ABC123]
}

Note: Capitalizing the character classes will make them do the exact opposite. This means \D will match any non-digit character, \W will match any non-word character (alphanumeric + underscore), and \S will match any non-whitespace character.

Negated character set

A negated character set allows you to match any character except those specified in the set. It is denoted by using ^. For example, you can find all non-vowel characters in a string in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "Hello, how are you?"

 pattern := regexp.MustCompile(`[^aeiouAEIOU\s]`)
 matches := pattern.FindAllString(text, -1)

 fmt.Println(matches) // returns [H l l , h w r y ?]
}

Ranges

Ranges allow you to specify a range of characters using a hyphen within a character class. For instance, [0-9] matches any digit. For example, you can match numbers between 10 and 50 in a text in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "The numbers are 5, 25, 49, and 60."

 pattern := regexp.MustCompile(`[1-4][0-9]`)
 matches := pattern.FindAllString(text, -1)

 fmt.Println(matches) // returns [25, 49]
}

You can do the same for words like this:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "The quick brown fox jumps over the lazy dog."

 // Match any lowercase letter from 'a' to 'z'
 pattern := regexp.MustCompile("[a-z]")
 lowercaseLetters := pattern.FindAllString(text, -1)
 fmt.Println("Lowercase letters:", lowercaseLetters)

 // Match any word character (alphanumeric + underscore)
 pattern = regexp.MustCompile("[A-Za-z0-9_]")
 wordCharacters := pattern.FindAllString(text, -1)
 fmt.Println("Word characters:", wordCharacters)

 // Match any character that is NOT a space
 pattern = regexp.MustCompile("[^ ]")
 nonSpaceCharacters := pattern.FindAllString(text, -1)
 fmt.Println("Non-space characters:", nonSpaceCharacters)
}

The code above describes how to work with ranges with letters and should return the following result:

Lowercase letters: [h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g]
Word characters: [T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g]
Non-space characters: [T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g .]

We’ll explore quantifiers in the next section.

Quantifiers

A quantifier specifies the number of times a particular character or group of characters should appear in a given string. They specify the number or range of characters a regular expression must match. The following paragraphs explain some of the special characters that represent quantifiers.

Asterisk (`*`)

The asterisk * symbol allows you to match one or more occurrences of the preceding character or group. For example, if you want to match words that have one or more os followed by a k, you can use the asterisk like this in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`o*k`)
 fmt.Println(pattern.FindString("oogenesis")) // returns ""
 fmt.Println(pattern.FindString("cook"))      // returns "ook"
 fmt.Println(pattern.FindString("oocyst"))    // returns ""
 fmt.Println(pattern.FindString("book"))      // returns "ook"
}

The code above uses the asterisk in a pattern and the FindString function to return the matched characters or an empty string if there is no match.

Plus (`+`)

The plus + symbol indicates that the preceding character(s) must occur one or more times in the given string. Here is an example:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`go+`)
 fmt.Println(pattern.FindAllString("goo", -1))   // returns [goo]
 fmt.Println(pattern.FindAllString("goooo", -1)) // returns [goooo]
 fmt.Println(pattern.FindAllString("g", -1))     // returns []
}

The code above matches a string that has the letter 'g' followed by one or more numbers of letters 'o'.

Curly Braces ({min,max})

You can extend the functionality of the '+' quantifier by using curly braces to define a specific range for the preceding element to occur. For example, you can match strings that have a letter 'g' with a minimum of 2 and a maximum of 4 letters 'o' after it:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`go{2,4}`)
 fmt.Println(pattern.FindAllString("go", -1))        // returns []
 fmt.Println(pattern.FindAllString("goo", -1))       // returns [goo]
 fmt.Println(pattern.FindAllString("goooooooo", -1)) // returns [goooo]
}

Note: The pattern will return just the first four letters 'o' after the letter 'g' if there are more than 4.

Lazy quantifiers

By default, quantifiers are greedy, meaning they try to match as much as possible. Adding a ? after a quantifier makes it lazy, causing it to match as little as possible.

For example, if you want to match the first quoted word or words in a string, you can use a lazy quantifier to do so:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`".*?"`)
 fmt.Println(pattern.FindString(`"foo" "bar" "baz"`)) // returns "foo"
}

The code above uses the FindString function to only match the first word(s) enclosed in "" regardless of the characters it contains.

We’ll explore repetition in the next section.

Repetition

Repetition refers to how often a pattern can repeat within the input string. You've seen how to implement repetition with quantifiers in the previous sections, but we will explore more ways to perform repetition in the following sections.

Parentheses (`()`)

Parentheses are used to achieve repetition as they group parts of a regex, allowing you to apply quantifiers or other operations to that group. For example, to redact a text that has multiple instances of the word "kill", you can use parentheses with the ReplaceAll function in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := `The detective carefully examined the crime scene. It was a gruesome sight—the victim had been brutally killed. The evidence pointed towards a professional hitman as the perpetrator. The investigators were determined to solve the case and bring the killer to justice. The motive behind the killing remained unclear, but the police were determined to uncover the truth.`
 pattern := regexp.MustCompile(`(kill)+`)
 modStr := pattern.ReplaceAll([]byte(text), []byte("---"))
 fmt.Println(string(modStr))
}

The code above replaces all instances of the word kill in the given text with --- and returns it like this:

The detective carefully examined the crime scene. It was a gruesome sight—the victim had been brutally ---ed. The evidence pointed towards a professional hitman as the perpetrator. The investigators were determined to solve the case and bring the ---er to justice. The motive behind the ---ing remained unclear, but the police were determined to uncover the truth.

Dot (`.`)

The dot (.) is used to create patterns that can match any character in a specific position. For example, you can use it to match any word that starts with the letter h and ends with a t:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`h.t`) 
 fmt.Println(pattern.MatchString("hat")) // true
 fmt.Println(pattern.MatchString("hot")) // true
 fmt.Println(pattern.MatchString("bit")) // false
}

Pipe (`|`)

Another way to achieve repetition is to use the | character to match one or more patterns. It allows you to match either of the patterns on its left or right side. For example, you can match either cat or dog in a given text:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`dog|cat`)
 fmt.Println(pattern.MatchString("dog"))  // true
 fmt.Println(pattern.MatchString("cat"))  // true
 fmt.Println(pattern.MatchString("bird")) // false
}

We’ll look at how to escape special characters in the next section.

Escaping

Understanding how to escape special characters is essential to avoid errors when using regular expressions. For example, if you want to match the dot . character in a string, you need to escape it using the backslash \ character:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 re := regexp.MustCompile(`\d+\.\d+`)
 fmt.Println(re.MatchString("3.14"))  // true
 fmt.Println(re.MatchString("42"))    // false
}

The code uses a \ in the pattern to avoid the special behavior of the . in the pattern.

Anchors

Anchors are used to specify the position of a match within a string. The two most common anchors are ^ (caret) for the start of a line/string and $ (dollar) for the end of a line/string.

For example, you can match sentences that start with the word 'start' and ends with the word 'end':

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`^start.*end$`)
 fmt.Println(pattern.MatchString("start at the end"))    // true
 fmt.Println(pattern.MatchString("start at the bottom")) // false
}

Word boundaries

Word boundaries are used to match positions between word characters (alphanumeric and underscore) and non-word characters. For example, to match instances where the word starts with a case insensitive 'bet', you can use a word boundary:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`(?i)\bbet`)
 fmt.Println(pattern.FindAllString("Betty's better bet was to buy the blue blouse.", -1)) // [Bet bet, bet]
}

You can do the opposite by putting the word boundary at the end of the pattern:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 re := regexp.MustCompile(`(?i)sion\b`)
 fmt.Println(re.FindAllString("After much discussion, the team came to a consensus on the vision for the project.", -1)) // [sion sion]
}

The code above returns instances of words that end in "sion" in the string.

There's more. You can also match exact words by adding word boundaries to the start and end of words

package main

import (
 "fmt"
 "regexp"
)

func main() {
 re := regexp.MustCompile(`(?i)\bcat\b`)
 fmt.Println(re.FindAllString("cat is in the feline category of animals", -1)) // [cat]
}

The code above returns words that contain only the exact word cat.

Grouping

Grouping is used to create sub-patterns within a regular expression by enclosing them in parentheses (). This is useful for applying quantifiers or alternations to a group of characters. For example, you can use multiple patterns in your pattern:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`(quick|lazy) (brown|gray) (fox|dog)`)
 text := "The lazy gray dog jumps over the lazy fox"
 fmt.Println(pattern.FindAllString(text, -1)) // returns [lazy gray dog]
}

The code above matches either of the three patterns enclosed in the parentheses.

Capturing groups

Capturing groups are used to group and capture parts of a pattern in a regular expression. Capturing groups are enclosed within parentheses, instructing the regular expression engine to remember the matched content within the group. For example, you can use capturing groups to rearrange an incorrect date format:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := regexp.MustCompile(`(\d{2})-(\d{2})-(\d{4})`)
 modifiedDate := pattern.ReplaceAllString("Date: 12-31-2022", "$2/$1/$3")
 fmt.Println(modifiedDate)
}

The code above rearranges the given date from a month/day/year format to a day/month/year value using the captured matches and returns the new date:

Date: 31/12/2022

Matching Unicode in regular expressions

Unicode values are just as important as regular values because they increase inclusivity and help us build with other demographics in mind.

Matching Unicode texts with regular expressions in Go is done with the Unicode character class name scripts defined in the RE2 syntax specification. For example, to match Chinese characters, you need to use the Han script:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "Hello world! 世界你好！"

 pattern := regexp.MustCompile(`\p{Han}`)
 fmt.Println(pattern.FindAllString(text, -1))
}

The code above returns the Chinese characters in the given string in the terminal:

[世 界 你 好]

Note: The complete list of scripts supported by different languages can be found in the RE2 syntax specification.

Handling regex tasks in Go

Now that you understand the regular expressions needed to create complex patterns, how the Go regex package helps you to match and work with them, and most importantly, how to use the RE2 syntax, you are ready to start working with regular expressions in Go applications.

In this section, we will explore some of the most common use cases of regular expressions in software development and how to solve them with the help of the Go regex package.

Form validation

One of the most common use cases of regular expressions in software development is form validation. For example, you can quickly validate email addresses before allowing a user to sign up on your application:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 pattern := `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`
 emailRegex := regexp.MustCompile(pattern)

 email := "user@example.com"
 if emailRegex.MatchString(email) {
  fmt.Println("Valid email address")
 } else {
  fmt.Println("Invalid email address")
 }
}

The code above uses a regular expression pattern with the MatchString function to check whether the input is a valid email address before printing a message based on that.

Extract specific data from the text

Similar to the use case above, you can use regular expressions to extract phone numbers from a user’s bio to automatically fill the phone number field:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 phonePattern := `(\d{3})-(\d{3})-(\d{4})`
 phoneRegex := regexp.MustCompile(phonePattern)

 text := "Contact us at 123-456-7890"
 matches := phoneRegex.FindStringSubmatch(text)

 if len(matches) >= 4 {
  areaCode, prefix, number := matches[1], matches[2], matches[3]
  formattedNumber := fmt.Sprintf("(%s) %s-%s", areaCode, prefix, number)
  fmt.Println("Formatted Phone Number:", formattedNumber) // Formatted Phone Number: (123) 456-7890
 } else {
  fmt.Println("No phone number found")
 }
}

The code above extracts the phone number from the given text and formats it properly for further use.

Masking credit card numbers

Another common use case of regular expressions is masking users' credit card numbers before they're displayed on the app. Here is an example of how to do this in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 cardNumber := "1234-5678-9012-3456"
 re := regexp.MustCompile(`(\d{4})-(\d{4})-(\d{4})-(\d{4})`)
 masked := re.ReplaceAllString(cardNumber, "$1-****-****-****") // returns 1234-****-****-****
 fmt.Println(masked)
}

The code above only shows the first four digits of the card number and masks the rest by replacing them with '****'.

HTML tag cleanup

Allowing users to input HTML tags into your application is risky, as it is a common way for bad actors to gain access to your backend. You can remove all HTML tags from your input fields before sending the data to the server using regular expressions:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 html := "<p>Hello <strong>world</strong></p>"
 re := regexp.MustCompile(`<[^>]*>`)
 cleaned := re.ReplaceAllString(html, "")
 fmt.Println(cleaned)
}

The code above removes all the HTML syntax from the string before processing it.

Extracting hex color codes

If you are building an application for designers, you might want to create a brilliant helper that detects and extracts hex color codes from their description to help document the design they’re talking about. Here is an example of how to do this in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 css := "The colors used in this project are #FFFFFF, #000000, and #FF0000."
 re := regexp.MustCompile(`#[0-9A-Fa-f]{6}`)
 colors := re.FindAllString(css, -1)
 fmt.Println(colors) // returns ["#FFFFFF", "#000000", "#FF0000"]
}

The code above extracts the hex color codes detected in the text.

Extract hashtags

A similar use case to the above is extracting hashtags from a given text. You can do this easily in Go:

package main

import (
 "fmt"
 "regexp"
)

func main() {
 text := "Exciting times ahead! #NewProject #LaunchDay"
 re := regexp.MustCompile(`#\w+`)
 hashtags := re.FindAllString(text, -1)
 fmt.Println(hashtags) // [NewProject #LaunchDay]
}

The code above extracts and returns hashtags from the given text.

Conclusion

Regular expressions are arguably one of the most essential concepts of programming that every developer needs to understand, as it allows you to perform complex string manipulation tasks easily using patterns.

In this article, you’ve learned about regular expressions and all the concepts that apply to Google RE2 syntax, including but not limited to flags, character sets, ranges, and repetition.

Whew! That was a long one! Thank you so much for reading. I hope this article has achieved its aim of demystifying regular expressions and how you can use them with the Go regex package to perform different string manipulation tasks in Go applications.