Strings are arguably the most critical data type in programming, as they make so many things possible. For example, without strings, there would be no emails, text messages, or almost any other computer activity that you can think of because it is the most accessible way of passing information to both technical and non-technical people.
All programming languages, regardless of whether they’re primitive or advanced, have a way to represent and handle strings because they are the backbone of every useful program. In this article, you will learn everything you need to know to start manipulating strings with regular expressions in Go.
Prerequisites
The only prerequisite for following along with this tutorial is a basic knowledge of Go. You can get familiar with Go or even a quick brush up on it by going through my Go Beginner Series.
With the prerequisites out of the way, let's explore what regular expressions are in the next section.
Regular expressions
Regular expressions, also known as regex, can be defined as patterns or templates, which are sets of characters defining a search pattern that can be used to perform tasks, such as search and replace input validation.
To put what regular expressions can do for you into context, let’s look at how with and without regular expressions. For example, if you need to implement a function that checks and extracts URLs from a given text, you can implement this function in Go:
package main
import "strings"
func extractURLs(input string) []string {
var urls []string
words := strings.Fields(input)
for _, word := range words {
if strings.HasPrefix(word, "http://") || strings.HasPrefix(word, "https://") || strings.HasPrefix(word, "www.") {
urls = append(urls, word)
}
}
return urls
}
The code above defines an extractURLs
function that takes in a string and checks if it has a URL in it by breaking it down and looping through the words to find any word that starts with http://
, https://
, or www.
and returns all of them as a slice.
Now, let's see how we can use regular expressions to solve this problem in Go:
func extractURLs(input string) []string {
pattern := regexp.MustCompile(`https?://\S+|www\.\S+`)
return pattern.FindAllString(input, -1)
}
The code above does the same thing as the previous code block but with fewer lines of code; this is how regular expressions can help you write elegant code with less.
You would use the code above inside a Go main
function like this:
func main() {
input := `You can find me on https://www.google.com https://www.facebook.com www.twitter.com`
urls := extractURLs(input)
for _, url := range urls {
println(url)
}
}
You should then get a result that looks like this in the terminal:
https://www.google.com
https://www.facebook.com
www.twitter.com
Note: Don't worry if you don't understand the regular expression I used above; you will learn how it was composed and even start writing much more complex ones once you finish reading this article.
Regular expressions in Go
In Go, regular expressions are implemented using the RE2 library. RE2 is a regular expression library that emphasizes efficiency and safety. It was developed by Google and designed to handle regular expressions with linear time complexity. This means that the time it takes to match a string with a regular expression is proportional to the length of the string. In contrast, traditional, regular expression engines can exhibit exponential time complexity for certain patterns, making them vulnerable to catastrophic backtracking and causing performance issues.
However, while RE2 offers better performance and safety, it lacks some of the more advanced features found in other regex engines, such as backreferences, lookarounds, and unbounded repetition. Therefore, if your use case requires features RE2 does not support, you might need to use a different library or approach.
Note: To understand the reasons RE2 was developed, the ideas behind it, and why it is used in all of Google's applications and programming languages, read the three-part series by one of Go's creators, Russ Cox.
Regular expressions: syntax and concepts
In this section, we will explore the regular expressions syntax and its concepts while using Go's regexp
package to see how each one works in detail.
Literal patterns
Literal patterns are characters that precisely match themselves in a regular expression. For example, the regular expression "Northern" would match the word "Northern" in a text like this:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile("Northern")
text := "New York is in the Northern part of the United States."
if pattern.MatchString(text) {
fmt.Println("Correct region!")
} else {
fmt.Println("Wrong region!")
}
}
The code above imports the fmt
and regexp
packages, defines a text
variable containing the given string and a pattern
variable that uses the MustCompile
function on the regexp
package to compile the pattern. It then uses the MatchString
function with the text
inside an if
statement to print a message based on the result.
Note: Regular expressions are case-sensitive by default, so the code above would print
Wrong region!
if "Northern" in thetext
string was written in all lowercase.
Special characters
Regular expressions wouldn’t be handy if we could only use literal patterns as described above. Special characters are used to match much more extensive patterns in our code. There are different types of special characters in regular expressions, and we will look at the ones supported in the RE2 syntax specification in this article.
We’ll explore the special characters called flags and what they are used for in the next section.
Flags
In regular expressions, flags are used to improve literal patterns in different ways. Let's explore some of the common regular expression flags in this section.
The i
flag
The i
flag, also known as the “ignore case” flag, is used in regular expressions to make pattern-matching case-insensitive. When this flag is included in a regular expression, it allows the pattern to match both uppercase and lowercase versions of letters without distinction.
For example, to match strings like “Apple”, “aPple”, “APPlE”, and so on, in addition to just “apple”, you can use the i
flag in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "Apples are delicious, but so are apples, aPpleS, and APPlE."
pattern := regexp.MustCompile(`(?i)apple`)
appleMatches := pattern.FindAllString(text, -1)
fmt.Println("Case-insensitive matches for 'apple':", appleMatches)
}
The code above uses the FindAllString
function to get all the matches for the word Apple, regardless of the spelling case, like so:
Case-insensitive matches for 'apple': [Apple apple aPple APPlE]
The g
flag
The g
flag, known as the "global" flag, performs a global search for a pattern within the given string. Without this flag, the regular expression would only match the first occurrence of the pattern in the string.
Go doesn't support the g
flag in matching all pattern occurrences in a string. However, you can use the FindAllString
function as we did in the previous code block.
The m
flag
The m
flag, short for “multiline”, affects the behavior of the ^
and $
anchors in a regular expression. By default, ^
matches the beginning of the entire string, and $
matches the end of the entire string. However, when the m
flag is used, these anchors also match the beginning and end of individual lines within a multiline string. This is especially useful when you have a multiline text and want to apply the pattern to each line separately.
For instance, to match lines that begin with “Hello,” you can use the m
flag like so:
package main
import (
"fmt"
"regexp"
)
func main() {
input := "Hello World\nHello Golang\nHello Regexp"
re := regexp.MustCompile(`(?m)^Hello`)
matches := re.FindAllString(input, -1)
fmt.Println(matches) // returns [Hello Hello Hello]
}
The s
flag
The s
flag is used to perform single-line pattern matching. When this flag is set, the regular expression engine treats the input string as a one-line string and matches patterns across the entire string, including newline characters.
This is particularly useful when you want to match patterns that span multiple lines. For example, if you want to match the word 'pattern' followed by any characters, including newlines, and the word 'example', you can do this in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "Pattern\nspanning\nmultiple lines.\nExample."
pattern := regexp.MustCompile(`(?s)Pattern.*Example`)
patternMatches := pattern.FindAllString(text, -1)
fmt.Println("Single line matches:", patternMatches)
} // returns Single line matches: [Pattern
// spanning
// multiple lines.
// Example]
We’ll explore metacharacters in the next section.
Metacharacters
In regular expressions, metacharacters are special characters used to match more characters while keeping the pattern as simple as possible. We'll look at the most common ones and how to use them in Go in the following sections.
Character classes
Character classes allow you to match a specific set of characters. For example, \d
matches any digit, \w
matches any word character (alphanumeric + underscore), and \s
matches any whitespace character. For example, you can find all words that contain at least one digit character in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "The price is $20, the code is ABC123."
pattern := regexp.MustCompile(`\w+\d\w*`)
matches := pattern.FindAllString(text, -1)
fmt.Println(matches) // returns [20 ABC123]
}
Note: Capitalizing the character classes will make them do the exact opposite. This means \D
will match any non-digit character, \W
will match any non-word character (alphanumeric + underscore), and \S
will match any non-whitespace character.
Negated character set
A negated character set allows you to match any character except those specified in the set. It is denoted by using ^
. For example, you can find all non-vowel characters in a string in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "Hello, how are you?"
pattern := regexp.MustCompile(`[^aeiouAEIOU\s]`)
matches := pattern.FindAllString(text, -1)
fmt.Println(matches) // returns [H l l , h w r y ?]
}
Ranges
Ranges allow you to specify a range of characters using a hyphen within a character class. For instance, [0-9] matches any digit. For example, you can match numbers between 10 and 50 in a text in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "The numbers are 5, 25, 49, and 60."
pattern := regexp.MustCompile(`[1-4][0-9]`)
matches := pattern.FindAllString(text, -1)
fmt.Println(matches) // returns [25, 49]
}
You can do the same for words like this:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "The quick brown fox jumps over the lazy dog."
// Match any lowercase letter from 'a' to 'z'
pattern := regexp.MustCompile("[a-z]")
lowercaseLetters := pattern.FindAllString(text, -1)
fmt.Println("Lowercase letters:", lowercaseLetters)
// Match any word character (alphanumeric + underscore)
pattern = regexp.MustCompile("[A-Za-z0-9_]")
wordCharacters := pattern.FindAllString(text, -1)
fmt.Println("Word characters:", wordCharacters)
// Match any character that is NOT a space
pattern = regexp.MustCompile("[^ ]")
nonSpaceCharacters := pattern.FindAllString(text, -1)
fmt.Println("Non-space characters:", nonSpaceCharacters)
}
The code above describes how to work with ranges with letters and should return the following result:
Lowercase letters: [h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g]
Word characters: [T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g]
Non-space characters: [T h e q u i c k b r o w n f o x j u m p s o v e r t h e l a z y d o g .]
We’ll explore quantifiers in the next section.
Quantifiers
A quantifier specifies the number of times a particular character or group of characters should appear in a given string. They specify the number or range of characters a regular expression must match. The following paragraphs explain some of the special characters that represent quantifiers.
Asterisk (*
)
The asterisk *
symbol allows you to match one or more occurrences of the preceding character or group. For example, if you want to match words that have one or more o
s followed by a k
, you can use the asterisk like this in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`o*k`)
fmt.Println(pattern.FindString("oogenesis")) // returns ""
fmt.Println(pattern.FindString("cook")) // returns "ook"
fmt.Println(pattern.FindString("oocyst")) // returns ""
fmt.Println(pattern.FindString("book")) // returns "ook"
}
The code above uses the asterisk in a pattern and the FindString
function to return the matched characters or an empty string if there is no match.
Plus (+
)
The plus +
symbol indicates that the preceding character(s) must occur one or more times in the given string. Here is an example:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`go+`)
fmt.Println(pattern.FindAllString("goo", -1)) // returns [goo]
fmt.Println(pattern.FindAllString("goooo", -1)) // returns [goooo]
fmt.Println(pattern.FindAllString("g", -1)) // returns []
}
The code above matches a string that has the letter 'g' followed by one or more numbers of letters 'o'.
Curly Braces ({min,max})
You can extend the functionality of the '+' quantifier by using curly braces to define a specific range for the preceding element to occur. For example, you can match strings that have a letter 'g' with a minimum of 2 and a maximum of 4 letters 'o' after it:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`go{2,4}`)
fmt.Println(pattern.FindAllString("go", -1)) // returns []
fmt.Println(pattern.FindAllString("goo", -1)) // returns [goo]
fmt.Println(pattern.FindAllString("goooooooo", -1)) // returns [goooo]
}
Note: The pattern will return just the first four letters 'o' after the letter 'g' if there are more than 4.
Lazy quantifiers
By default, quantifiers are greedy, meaning they try to match as much as possible. Adding a ? after a quantifier makes it lazy, causing it to match as little as possible.
For example, if you want to match the first quoted word or words in a string, you can use a lazy quantifier to do so:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`".*?"`)
fmt.Println(pattern.FindString(`"foo" "bar" "baz"`)) // returns "foo"
}
The code above uses the FindString
function to only match the first word(s) enclosed in ""
regardless of the characters it contains.
We’ll explore repetition in the next section.
Repetition
Repetition refers to how often a pattern can repeat within the input string. You've seen how to implement repetition with quantifiers in the previous sections, but we will explore more ways to perform repetition in the following sections.
Parentheses (()
)
Parentheses are used to achieve repetition as they group parts of a regex, allowing you to apply quantifiers or other operations to that group. For example, to redact a text that has multiple instances of the word "kill", you can use parentheses with the ReplaceAll
function in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := `The detective carefully examined the crime scene. It was a gruesome sight—the victim had been brutally killed. The evidence pointed towards a professional hitman as the perpetrator. The investigators were determined to solve the case and bring the killer to justice. The motive behind the killing remained unclear, but the police were determined to uncover the truth.`
pattern := regexp.MustCompile(`(kill)+`)
modStr := pattern.ReplaceAll([]byte(text), []byte("---"))
fmt.Println(string(modStr))
}
The code above replaces all instances of the word kill
in the given text
with --- and returns it like this:
The detective carefully examined the crime scene. It was a gruesome sight—the victim had been brutally ---ed. The evidence pointed towards a professional hitman as the perpetrator. The investigators were determined to solve the case and bring the ---er to justice. The motive behind the ---ing remained unclear, but the police were determined to uncover the truth.
Dot (.
)
The dot (.
) is used to create patterns that can match any character in a specific position. For example, you can use it to match any word that starts with the letter h
and ends with a t
:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`h.t`)
fmt.Println(pattern.MatchString("hat")) // true
fmt.Println(pattern.MatchString("hot")) // true
fmt.Println(pattern.MatchString("bit")) // false
}
Pipe (|
)
Another way to achieve repetition is to use the |
character to match one or more patterns. It allows you to match either of the patterns on its left or right side. For example, you can match either cat or dog in a given text:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`dog|cat`)
fmt.Println(pattern.MatchString("dog")) // true
fmt.Println(pattern.MatchString("cat")) // true
fmt.Println(pattern.MatchString("bird")) // false
}
We’ll look at how to escape special characters in the next section.
Escaping
Understanding how to escape special characters is essential to avoid errors when using regular expressions. For example, if you want to match the dot .
character in a string, you need to escape it using the backslash \
character:
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`\d+\.\d+`)
fmt.Println(re.MatchString("3.14")) // true
fmt.Println(re.MatchString("42")) // false
}
The code uses a \
in the pattern to avoid the special behavior of the .
in the pattern.
Anchors
Anchors are used to specify the position of a match within a string. The two most common anchors are ^
(caret) for the start of a line/string and $
(dollar) for the end of a line/string.
For example, you can match sentences that start with the word 'start' and ends with the word 'end':
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`^start.*end$`)
fmt.Println(pattern.MatchString("start at the end")) // true
fmt.Println(pattern.MatchString("start at the bottom")) // false
}
Word boundaries
Word boundaries are used to match positions between word characters (alphanumeric and underscore) and non-word characters. For example, to match instances where the word starts with a case insensitive 'bet', you can use a word boundary:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`(?i)\bbet`)
fmt.Println(pattern.FindAllString("Betty's better bet was to buy the blue blouse.", -1)) // [Bet bet, bet]
}
You can do the opposite by putting the word boundary at the end of the pattern:
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`(?i)sion\b`)
fmt.Println(re.FindAllString("After much discussion, the team came to a consensus on the vision for the project.", -1)) // [sion sion]
}
The code above returns instances of words that end in "sion" in the string.
There's more. You can also match exact words by adding word boundaries to the start and end of words
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`(?i)\bcat\b`)
fmt.Println(re.FindAllString("cat is in the feline category of animals", -1)) // [cat]
}
The code above returns words that contain only the exact word cat.
Grouping
Grouping is used to create sub-patterns within a regular expression by enclosing them in parentheses ()
. This is useful for applying quantifiers or alternations to a group of characters. For example, you can use multiple patterns in your pattern:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`(quick|lazy) (brown|gray) (fox|dog)`)
text := "The lazy gray dog jumps over the lazy fox"
fmt.Println(pattern.FindAllString(text, -1)) // returns [lazy gray dog]
}
The code above matches either of the three patterns enclosed in the parentheses.
Capturing groups
Capturing groups are used to group and capture parts of a pattern in a regular expression. Capturing groups are enclosed within parentheses, instructing the regular expression engine to remember the matched content within the group. For example, you can use capturing groups to rearrange an incorrect date format:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := regexp.MustCompile(`(\d{2})-(\d{2})-(\d{4})`)
modifiedDate := pattern.ReplaceAllString("Date: 12-31-2022", "$2/$1/$3")
fmt.Println(modifiedDate)
}
The code above rearranges the given date from a month/day/year format to a day/month/year value using the captured matches and returns the new date:
Date: 31/12/2022
Matching Unicode in regular expressions
Unicode values are just as important as regular values because they increase inclusivity and help us build with other demographics in mind.
Matching Unicode texts with regular expressions in Go is done with the Unicode character class name scripts defined in the RE2 syntax specification. For example, to match Chinese characters, you need to use the Han script:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "Hello world! 世界你好!"
pattern := regexp.MustCompile(`\p{Han}`)
fmt.Println(pattern.FindAllString(text, -1))
}
The code above returns the Chinese characters in the given string in the terminal:
[世 界 你 好]
Note: The complete list of scripts supported by different languages can be found in the RE2 syntax specification.
Handling regex tasks in Go
Now that you understand the regular expressions needed to create complex patterns, how the Go regex
package helps you to match and work with them, and most importantly, how to use the RE2 syntax, you are ready to start working with regular expressions in Go applications.
In this section, we will explore some of the most common use cases of regular expressions in software development and how to solve them with the help of the Go regex
package.
Form validation
One of the most common use cases of regular expressions in software development is form validation. For example, you can quickly validate email addresses before allowing a user to sign up on your application:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern := `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`
emailRegex := regexp.MustCompile(pattern)
email := "user@example.com"
if emailRegex.MatchString(email) {
fmt.Println("Valid email address")
} else {
fmt.Println("Invalid email address")
}
}
The code above uses a regular expression pattern with the MatchString
function to check whether the input is a valid email address before printing a message based on that.
Extract specific data from the text
Similar to the use case above, you can use regular expressions to extract phone numbers from a user’s bio to automatically fill the phone number field:
package main
import (
"fmt"
"regexp"
)
func main() {
phonePattern := `(\d{3})-(\d{3})-(\d{4})`
phoneRegex := regexp.MustCompile(phonePattern)
text := "Contact us at 123-456-7890"
matches := phoneRegex.FindStringSubmatch(text)
if len(matches) >= 4 {
areaCode, prefix, number := matches[1], matches[2], matches[3]
formattedNumber := fmt.Sprintf("(%s) %s-%s", areaCode, prefix, number)
fmt.Println("Formatted Phone Number:", formattedNumber) // Formatted Phone Number: (123) 456-7890
} else {
fmt.Println("No phone number found")
}
}
The code above extracts the phone number from the given text and formats it properly for further use.
Masking credit card numbers
Another common use case of regular expressions is masking users' credit card numbers before they're displayed on the app. Here is an example of how to do this in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
cardNumber := "1234-5678-9012-3456"
re := regexp.MustCompile(`(\d{4})-(\d{4})-(\d{4})-(\d{4})`)
masked := re.ReplaceAllString(cardNumber, "$1-****-****-****") // returns 1234-****-****-****
fmt.Println(masked)
}
The code above only shows the first four digits of the card number and masks the rest by replacing them with '****'.
HTML tag cleanup
Allowing users to input HTML tags into your application is risky, as it is a common way for bad actors to gain access to your backend. You can remove all HTML tags from your input fields before sending the data to the server using regular expressions:
package main
import (
"fmt"
"regexp"
)
func main() {
html := "<p>Hello <strong>world</strong></p>"
re := regexp.MustCompile(`<[^>]*>`)
cleaned := re.ReplaceAllString(html, "")
fmt.Println(cleaned)
}
The code above removes all the HTML syntax from the string before processing it.
Extracting hex color codes
If you are building an application for designers, you might want to create a brilliant helper that detects and extracts hex color codes from their description to help document the design they’re talking about. Here is an example of how to do this in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
css := "The colors used in this project are #FFFFFF, #000000, and #FF0000."
re := regexp.MustCompile(`#[0-9A-Fa-f]{6}`)
colors := re.FindAllString(css, -1)
fmt.Println(colors) // returns ["#FFFFFF", "#000000", "#FF0000"]
}
The code above extracts the hex color codes detected in the text.
Extract hashtags
A similar use case to the above is extracting hashtags from a given text. You can do this easily in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
text := "Exciting times ahead! #NewProject #LaunchDay"
re := regexp.MustCompile(`#\w+`)
hashtags := re.FindAllString(text, -1)
fmt.Println(hashtags) // [NewProject #LaunchDay]
}
The code above extracts and returns hashtags from the given text.
Conclusion
Regular expressions are arguably one of the most essential concepts of programming that every developer needs to understand, as it allows you to perform complex string manipulation tasks easily using patterns.
In this article, you’ve learned about regular expressions and all the concepts that apply to Google RE2 syntax, including but not limited to flags, character sets, ranges, and repetition.
Whew! That was a long one! Thank you so much for reading. I hope this article has achieved its aim of demystifying regular expressions and how you can use them with the Go regex
package to perform different string manipulation tasks in Go applications.