Regular Expressions in R

Filed Under: R Programming
Regular Expressions In R

Regular expressions in R or a regex are a sequence of special characters that are defined to match a particular search pattern in the text. Regular expressions can be created for several diverse purposes such as identifying sequences of numbers, formatted addresses, special strings, parts of names and so on.

In Linux based systems, regular expressions have always been computed and searched using the grep command. R programming also supports a function named grep() to accomplish these tasks as we will see in the following sections.

Components of Regular Expressions in R

A regular expression is comprised of some special characters and symbols that add to the meaning of the search pattern we are looking for. While there are symbols that match any kind of search string, it will be helpful to learn some of the commonly appearing symbols and characters. These are listed below.

  • Dot (.) – matches any character except for a new line.
  • Pipe (|) – Used to specify an alternate or condition on the expression.
  • Square braces [] – Any characters listed within the square braces are to be matched.
  • Hyphen (-) – Used to specify character range as in [a-m] or [A-Z].
  • Cap (^) – used to specify characters to exclude, as in [^0-9] means none of the digits should be matched.

Other than these, we also need symbols known as anchors to construct regular expressions. Anchors are the characters to match the beginning or end of a word or a string. They are:

  • Cap (^) – Matches the beginning of a string.
  • Dollar ($) – Matches the end of the string.
  • \\< – Matches the beginning of a word
  • \\> – Matched the end of a word.

A lot of times, you will also be required to specify the number of occurrences you need to match. For example, a telephone number might be a string containing 10 digits. This specification of the number of occurrences is done by using symbols named as quantifiers. These are:

  • Star (*) – Match the given pattern at least 0 times.
  • Plus (+) – Match the given pattern at least 1 time.
  • Question mark (?) – Match the pattern exactly once.
  • {n} – Match the given pattern exactly n times.
  • {n,} – Match the pattern at least n times.
  • {,n} – Match the pattern utmost n times.
  • {n,m} – Match the patterns occurring at least n and utmost m times.

Characters in search patterns are sometimes grouped into classes for easier reading. Each character class has a representative symbol and can be used to match a large number of characters belonging to that class. Some of these are:

  • \d – represents a digit and \D represents a non-digit.
  • \w represents an alpha-numeric character and \W represents non-alpha-numeric characters.
  • \x for hexadecimal digits.
  • \s for space and \S for non-spaces.

Finally, when we want to match any of these special characters in our regex, it is necessary to escape them. For example, dot already carries meaning in the regex but if you want to actually match a dot in your string, it is necessary to precede the character by a backslash as in –\..

Examples of Regular Expressions in R

As we are now aware of the components of regular expressions in R, we can put them to use in some examples.

  • ^The – Match all sentences beginning with the word The.
  • boat$ – Match all sentences ending with the word boat.
  • ^The.*boat$ – Match all sentences beginning with The, containing zero or more other characters and ending with boat.
  • ab*c – Matches patterns ac, abc, abbc, abbbc, abbbbbc ..etc
  • ab+c – Matches patterns abc, abbc, abbbc, abbbbc…etc.
  • \d{3}\-\d{4} – Matches all occurrences with three numerical characters, followed by a hyphen and followed by four numerical characters – for example – 456-1223 , or 222-7658 and so on.
  • a….. – Matches all six-letter words starting with a lower case a.
  • [^anc].+ – Matches any string of at least length two that doesn’t start with a, n or c.

Hence, it is possible to construct a regular expression for any pattern string to be matched.

The grep() function in R

The grep function matches a pattern against a text and returns the positions of the matched pattern. The grep function has multiple signatures that return the search results in different manners.

First, we create a long string vector to search for the required patterns.

> strvec <- c("Beamite", "Gazelow", "Gazairy", "Pantheon", "Chimeton", "Sandite", "Zebrawl", "Barrazel", "Bellibou", "Sandapi" )

This is a list of some randomly generated fantasy character names. Let us start using our grep function on this vector.

#Get the indexes of all the names in the list starting with B
> grep("B.*",strvec)
[1] 1 8 9

Instead, if we wish to get the names instead of indices, we just add the value=TRUE argument to the grep function.

> grep("B.*",strvec, value=TRUE)
[1] "Beamite"  "Barrazel" "Bellibou"

Similarly, if you wish to check if a pattern matches against each element of a string vector, you can use the grepl function instead of grep.

Suppose that we wish to know which one of the names is exactly 7 characters long, we write the following regular expression and feed it as a pattern to the grepl function below.

> grepl("^(.){7}$", strvec )
 [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE

Other Functions to Handle Regular Expressions in R

If you like to actually locate the pattern within the string, regexp() function is used. This function also returns the length of the first occurrence of the pattern matched.

> regexpr("ite$",strvec)
 [1]  5 -1 -1 -1 -1  5 -1 -1 -1 -1
attr(,"match.length")
 [1]  3 -1 -1 -1 -1  3 -1 -1 -1 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

We are searching for all the strings ending in “ite” here. The first and sixth elements of our string vector end with it. Therefore, the position is depicted as 5 for each of these.

If the pattern doesn’t occur, the function returns -1. The next part of the function result also returns the length of the string matched which is 3 in the present case.

Instead of position, if you wish to retrieve the subpattern matched, the function to be used is regmatches().

> regmatches(strvec,regexpr("S.*", strvec) )
[1] "Sandite" "Sandapi"

This function obtains the positions to be matched using the regexp() function and gets the string that matches the expression.

Finally, if you wish to match a string and also replace it with a new pattern, sub() is the function to go for. Suppose that we wish to change all B’s to D’s in our string vector. We can do this in the following manner.

> sub("[Bb]","D",strvec)
 [1] "Deamite"  "Gazelow"  "Gazairy"  "Pantheon" "Chimeton" "Sandite"  "ZeDrawl" 
 [8] "Darrazel" "Dellibou" "Sandapi"

Observe that the sub-function only replaces the first occurrence of the pattern. The string that is previously “Bellibou” has now become “Dellibou”. Instead, to replace all the occurrences of the letter, we use the gsub() function, g meaning global substitution.

> gsub("[Bb]","D",strvec)
 [1] "Deamite"  "Gazelow"  "Gazairy"  "Pantheon" "Chimeton" "Sandite"  "ZeDrawl" 
 [8] "Darrazel" "DelliDou" "Sandapi"

Conclusion

Regular expression support is an important feature of any programming language. The operations offered for regular expressions in R greatly ease the data preprocessing tasks. String handling is not often easy in such scenarios. The grep function combined with any other powerful string processing library such as strings helps the programmers a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages