What absolutely every Programmer should know about regular expressions

No, I am not going into theoretical definitions. I am going to talk about regex in todays computer languages and applications.

I am surprised how many programmers think regexes are complicated and ask e.g. on Stackoverflow for a regex for a specific task. They will get most probably an answer, if they ask nicely. But most of the time, their specifications are not complete or wrong, so they end up with a regex that works for some examples but not for all of their real data. If they recognize it, they have no glue why it is not working for that case and how to fix it. But it is not difficult to get the basics and to understand at least basic regexes.

I want to explain here the absolutely necessary basics, to write and understand basic regexes, so that you are able to use them more efficiently, search with the correct vocabulary or at least to read them.

Not all features that I explain here are available in all regex flavours, the only solution is then to check the documentation. A good point for regex informations is regular-expressions.info. There is also a feature list for a lot of different regex flavours .

Regular Expressions

The first thing to know is, a regex describes a pattern of characters.  This enables you to find that pattern inside a text. A very simple pattern would be

/Foo/ will find “Foo“, “Foobar”, “Foooo” and “BarFoo“, its case sensitive, it will not find “foo”!

the slashes around does not belong to the pattern, they are the regex delimiters. Thats Perl style, it depends on the language how a regex is denoted correctly.

Metacharacters

Now, there are some characters that have a special meaning in a regex. They are often called “Metacharacters”. Those are

\[^$.|() ?*+

if you want to match one of those characters, you have to escape them using the special character \

The . is a very special character, it will match every character except newline characters.

/F.o/ will find “Foo“, “Fxobar”, “F&ooo”

/F\.o/ will find “F.o“, “F.oooo”,  but not  “Foo”,   “Fxobar”, “F&ooo”

Quantifiers

You can say, repeat the character or group by using a quantifier. That would be

{x,y} where x is the minimum amount of occurrences and y is the maximum amount. If x==y only write {x}, if y should be unlimited, leave it empty {x,}. So

/o{2}/ will find  “Fooo”

For convenience there  are now some shortcuts

? is {0,1} means match 0 or 1, it makes the previous character of group optional

+ is {1,} means match 1 or more

*  is {0,} means match 0 or more

/Fo+/ will find “Foo“, “Foobar”, “Foooo” and “BarFoo

/Fo+b?/ will find “Foo“, “Foobar”, “Foooo” and “BarFoo

Character Classes

You can also define your own set of characters when there can be more than one, but . would match to many.

/F[ox]o/ will find “Foo“, “Fxobar”, but not “F&ooo”

You can put as many characters inside such a class as you want, but [abcdne] would only match one character (out of that class), if you want to match more, you need to use a quantifier after that class.

Metacharacters inside char class, will loose their special meaning. So

/Fo[+]/ will match “Fo+

But now other characters get a special meaning, or change their meaning inside a character class. I haven’t told you the meaning of till now, but it is a different inside a character class, at least when it is the first character. [^o] is a negated character class, this construct will match every character, but not  “o”.

 /F[^0]+/ will match “Fxo”, but not “Foo”

- is creating a range in a character class. [a-m] would match every character in the ASCII table from “a” to “m”.

/F[a-q]+/ would match   “Foo“, “Foobar”, “Foooo” and “BarFoo

So, please if you want to add a dash “-” to your character class, escape it (or put it as first or last character in the class), otherwise it will define ranges and match much more than you want. 

There are some predefined classes for your convenience:

\w is a word character, that means letters, digits and the underscore. What letters are, depends on your language, either only the ASCII letters (the worse case) or Unicode code points with the property letter.

\d is a digit

\s is a whitespace character, e.g. space, tab and newlines.

If the letter is an uppercase, then it’s the negated form of that class, e.g. The negated form of \w is \W.

Groups

You can group stuff together by using brackets (). By default such a group is a capturing group. That means it stores the text that has been matched by that part of the pattern in a variable that can be then accessed inside the pattern by using backreferences. From my experience, this is a bit time consuming, so if you don’t need that partial result use a non  capturing group. Every group that starts with a ? is a non capturing group with a special meaning. Just non capturing is (?:pattern).

/F(?:oo){2}/ would match “Foooo“, but not “Foo”

/F(oo)\1/ would match “Foooo”, but not “Foo”, \1 is a backreference to the part matched inside the brackets. So this requires “oo” to be matched inside the brackets and then there are two more needed because of the backreference.

Alternations

Another important construct is the alternation.

/Foo|Bar/ would match “Foo” or “Bar

Anchors

As last part to define a pattern I want to talk about anchors. Anchors are zero width assertions. That means they don’t match a character, they match a position. Anchors are important to define, where a pattern should match. There are three important anchors:

^ matches the start of the string

$ matches the end of the string

\b matches a word boundary. A word boundary is the position where on the one side is a \w character and on the other side is a \W character.

/^Foo/ matches “Foo“, “Foo bar text”, but not “This  is a Foo text”

/\bFoo\b/ matches   “Foo“, “Foo bar text”, “This  is a Foo text”, but not “Foobar”

/Foo$/ matches “Foo“,  “This  is Foo” but not   “Foo bar text”

Options

The matching behaviour of the regex can be modified by options or modifiers.

i makes the pattern match letters case independent. /a/i would match “a” and “A”.

m is the multiline modifier. It changes the behaviour of the ^ and $ anchor to match the start and end of the row instead of only the string. The $ anchor will then match before a \n character.

s is the singleline modifier. It changes the behaviour of the dot ., it makes it also match newline characters.

 The End

Of course these are really only the very basics and I told you not everything, but this leaves me some more things to write about in the future.

Thank you if you have read that far, my first blog post got a bit longer than I expected. I hope this post helped someone, at least a little bit. Please tell me what you think or if you found a mistake somewhere, leave a comment.

About these ads

One comment

  1. The part about quantifiers is a bit short, therefore I wrote another block post about the details of quantifiers: You do know Quantifiers. Really?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: