Perl - Regular expressions

From LXF Wiki

Table of contents

Perl Tutorial part 2

(Original version written by Marco Fioretti for Linux Format magazine issue 70.)


Scared by abstruse Perl operators and regular expressions? Check them again...


In this second leg of our travel in Perldom we'll see how to manipulate the most complex variables, that is arrays and hashes, and then introduce the real black magic of Perl, its regular expressions.

How to ruin your life with Regular Expressions

Perl has probably the most complete regular expression set of any computer language. To see with your eyes how perverse, er, powerful, they can be, check out the longest regular expression ever seen by humans at www.ex-parrot.com/~pdw/Mail-RFC822-Address.html. Rumours are that it validates email addresses, but don't look at it for too long. To write your own instead, the ultimate reference is “Mastering Regular Expressions” by J. Friedl (www.oreilly.com/catalog/regex2/ ).

Everything with arrays and lists

Perl list literals, once discovered, are hard to ignore. They are unnamed, ordered sequences of scalars enclosed in parentheses, which work similarly to arrays. Imagine you want to assign values to two or more variables in a single instruction, or swap them in any way. This is the way to do it with list literals:

($X,$Y,$Z) = ($Y,$Z,$X);  # Circular shift
($Name,$Surname,$Phone) = ('John','Smith',5556791);
($DARTH_VADER,@JEDI) = ('Anakin Skywalker', 'Joda', 'Obi-Wan', 'Mace Windu');

The first two lines are pretty self-explanatory. The last collates in one list one scalar ($DARTH_VADER) and one named array (@JEDI). Everybody knows what happens if we assign to it the list on the right side: to the Dark Side of the Force young Anakin goes alone. $DARTH_VADER, being a scalar, can only hold one value. Since the left side lists holds nothing else but one array, @JEDI, all the other knights in the right side list go there, in the same order. Let's now define some planets:

@STAR_WARS_PLANETS = ('Naboo', 'Tatooine', 'Geonosis');

and then add Coruscant and Alderaan right after Tatooine:

splice (@STAR_WARS_PLANETS, 2,0, ('Coruscant', 'Alderaan'));

Splice() deletes, adds or replaces elements inside an array. Its first argument is the array name. Next comes the index (starting from zero!) from which we wish to splice. The third parameter is how many elements must be removed. Here I don't want to remove anything, just add more stuff, so I write zero. The last, optional argument is the list to be added in the position previously specified. When missing, the effect of splice is to simply remove elements.

Perl has a sort function which, by default, considers all elements as strings, even when they are numbers, ordering them in a strictly alphabetical way. If you type the following code at the prompt:

perl -e "@A_LIST = ('Dominions', 180, 3, '10, Downing St.','Admiralty'); print join( \"\n\", sort @A_LIST), \"\n\";"

what you'll get is:

10, Downing St.
180
3
Admiralty
Dominions

Sorting in some other way can be accomplished in this way:

@SORTED_LIST = sort AS_I_WANT @UNORDERED_LIST;

AS_I_WANT is a subroutine that takes two scalars as inputs and returns -1, 0 or 1 depending from which of them comes first by the desired criteria. We'll look at subroutines in the following issues.

One last thing about arrays. There is one that no Perl hacker can live without, even if it really doesn't look an array. Luckily, it's very easy to use it. I am talking of nothing less than our beloved STDIN, the input text stream of every well behaved Unix program. We are mentioning STDIN right now because, believe it or not, it can be loaded into an array in a heartbeat:

@LINES = <STDIN>;

There. In just one instruction, you have loaded every line of input text in one separate element of @LINES. Handy, isn't it?

More about Hashes

In the first part of this tutorial we introduced hashes, that is groups of scalars which are indexed through other scalars (keys). Now, once you have some Perl hash, chances are you'll want to do something only to its keys, or only to their values, with or without taking into account with which keys they are associated. After all, if you where only interested in storing those values in a fixed relative order, you would have just used a regular array, wouldn't you?

Now, unlike arrays, Perl hashes are not arranged in any specific numerical order, nor in the order in which the keys are added. This is done for performance and other programming reasons, and the practical result is that hash values must always be addressed by key instead of position.

This is true even when you want to delete some elements, or to check if they exist and have been initialized with any non null value. In the first case, the right way (the only one, actually) to remove a key and its associated value from an hash is to do it with the delete() function, which takes as argument the key of the interested element. Using this function is necessary because if you, for example, assigned an empty value to it nothing would be removed. That element would still exist, just with a null value.

As a matter of fact, sometimes the first thing to do with an hash element is just to discover if it exists, and if any value has been assigned to it. To do this Perl provides two functions named (who would have guessed?) exists() and define(), to be used as follows:

if exists  ($STAR_WARS_ACTORS{'Leia'}) { # do something...};
if defined ($STAR_WARS_ACTORS{'Leia'}) { # do something else...};

The first command is executed only if in the hash there actually is a key equal to the 'Leia' string, regardless of what the associated value is. The second goes a bit further: it will be true only if there is a key equal to 'Leia' *AND* its associated value has been explicitly defined before.

Regular expressions

Perl was born to manipulate great quantities of text. In order to achieve this it has developed maybe the most complete and powerful set of regular expression facilities of all programming languages. A regular expression, or regex, is a description, using a custom syntax, of the structure of a string of text. The characteristic of such strings are expressed mixing its pieces of regular text with special metacharacters corresponding to their properties. Some of them are listed in the box.

Regular expression mini cheat sheet

This is a list of the most common special characters found in Perl regular expression and of their meaning. Photocopy it and always keep it close to your keyboard, it will be a real time saver.

.	Any single character except a newline
^	Beginning of the string
$	End of the string
*	Zero or more of the previous character
+ 	One or more of the previous character
?	Zero or one of the previous character
\n	New line
\t	Tab
\w	Digits and alphabet letters, regardless of case
\W	Every character which is not a letter or a digit
\d	Old fashioned digits: 0, 1 etc... up to 9
\D	Everything but digits
\s	Whitespaces: space, tab, newline, etc
\S	Any non-whitespace character
\b	Word boundary
|	Alternative between two values (A|B)
[]	Square brackets delimit a character class
()	Normal brackets remember the enclosed substring

Since the characters above are special, when it is needed to match one of them literally, for example a plus sign, it must be preceded (escaped) by a backslash:

A+ 	# matches one or more capital A's
\+ 	# matches one plus sign
\++  	# matches one or more plus sign

The beauty and ultimate purpose of all this is if you can describe any string in full detail you can also tell a script how to find specific patterns and how to alter them automatically. Not surprisingly, all this is best explained with some examples:

/Jedi/
/\bJedi\b/
/^Jedi$/
/Jedi/i
/Jedi|Sith/
/

The first regex is true whenever a string contains (maybe as part of a longer word) the “Jedi” substring. If “Jedi” matters only when is a complete word, enclose it between word boundary markers (\b) as in the second command. The third regex goes even further: it will be true only when Jedi is both at the beginning (^) and at the end ($) of a line, that is when Jedi is the only word of that line. By default a Perl regex is case sensitive, so if you need to ignore case and don't feel like typing “JEDI|jedi|JedI” and all the other possible combinations, use the i modifier as in the fourth line for a case insensitive match. Finally, the last command is true whenever Perl finds either the Jedi or a Sith. Of course, all these markers can be combined to describe strings much more complex than those in the example.

All this is worth knowing because when Perl recognizes the text pattern described by a regex it can take any actions you want or modify the corresponding string according to your instructions. Here are the two corresponding formats:

if ($STRING =~ m/some regex here/) {do something}
$STRING =~ s/some regex here/some other text pattern/;

The actual regex is delimited by slashes. Strings are associated to it by the =~ operator. When the slashes are preceded by an m character it means “does the string matches this regex?”. When there is an s instead of an m, and some other text between slashes right after the regex it means “take $STRING and, inside it, substitute to the regex whatever is written between the two last slashes”. Regexes are extremely flexible. Partly, this is due to the fact that they can contain scalar variables and can remember in other special variables which text they found:

$JEDI = 'Anakin';
s/Master $JEDI/The future Darth Vader/g;
s/Master (Obi Wan|Joda)/The Jedi Knight $1/;

Here we start saying that all the occurrences of 'Master Anakin' (since Perl will substitute to the variable its current value) must be substituted with 'The future Darth Vader'. The 'g' modifier at the end means “globally”: without it only the first match would have been replaced.

The second regex shows another really neat feature. It matches whenever one of the three Jedi mentioned is called 'Master'. Since the names are in parentheses, they are not forgotten when found, but saved in the special variable $1. Therefore, the same regex will substitute 'The Jedi Knight Joda' to 'Master Joda' or 'The Jedi Knight Obi Wan' to 'Master Obi Wan'. If more matches must be remembered they are associated in the same way to $2, $3 and so on.