Regular Expressions   «Prev  Next»
Lesson 1

Perl Regular Expressions

Regular expressions (sometimes called regexps) are expressions used for matching patterns of text. Regexes are used for everything from simple searches to complex search-and-replace procedures. If you have used the advanced mode on any of the major Internet search engines, then you have used regular expressions.
The purpose of this module is to familiarize you with the major features of Perl's regular expressions so that you can use this feature as effectively and seamlessly as possible. This is not intended to be an exhaustive regular expression tutorial, sinceregular expressions can be a deep subject. Many people, who find that regular expressions look cryptic, decide that they are too difficult and not worth the trouble. Once you get past that first impression, you will begin to use them more effectively and find your programs becoming smaller, faster, and more efficient.

How Perl ties into Regular Expressions

Most software is written to work with and modify data in one format or another. Perl was originally designed as a system for processing logs and summarizing and reporting on the information. Because of this focus, a large proportion of the functions built into Perl are dedicated to the extraction and recombination of information. For example, Perl includes functions for splitting a line by a sequence of delimiters, and it can recombine the line later using a different set. If you cannot do what you want with the built-in functions, then Perl also provides a mechanism for regular expressions. We can use a regular expression to extract information, or as an advanced search and replace tool, and as a transliteration tool for converting or stripping individual characters from a string. In this module, we are going to concentrate on the data-manipulation features built into Perl, from the basics of numerical calculations through to basic string handling. We will also look at the regular expression mechanism and how it works and integrates into the Perl language. Furthermore, we will also take the opportunity to look at the Unicode character system. Unicode is a standard for displaying strings that supports not only the ASCII standard, which represents characters by a single byte, but also provides support for multibyte characters, including those with accents.

Perl Regex

Many will argue that regular expressions are the most powerful aspect of the Perl language. Using regular expressions can make your programs smaller, faster, and much easier to read and maintain. Perl's implementation of regular expressions is particularly rich. It has all the enhancements of modern tools like egrep, plus a number of handy tricks all its own. If you take the time to learn about them, your ability to write powerful filters in Perl will greatly improve.

Regular expressions are available in many types of tools and editors, but their power is most fully exposed when available as part of a programming language. Examples include Java, Visual Basic and VBScript, JavaScript and ECMAScript, C, C++, C#, elisp, Perl, Python, Tcl, Ruby, PHP, sed, and awk. In fact, regular expressions are at the very heart of many programs written in some of these languages. At a low level, a regular expression describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data and control it.


Scenario

Imagine a scenario where you are given the job of checking the pages on a web server for doubled words (such as "this this"), a common problem with documents subject to heavy editing. Your job is to create a solution that will:
  1. Accept any number of lines check, report each line of each line that has doubled words, highlight (using standard ANSI escape sequences) each doubled word, and ensure that the source line appears with each line in the report.
  2. Work across lines, even ascending situations where a word at the end of one line is repeated at the beginning of the next.
  3. Find doubled words despite capitalization differences, such as with 'The the' as well as allow differing amounts of whitespace (spaces, tabs, newlines, and the like) to lie between the words.
  4. Find doubled words even when separated by HTML tags. HTML tags are for marking up text on World Wide Web pages, for example, to make a word bold:
    'this is very very important'
This is a real problem that needs to be solved. I ran such a tool on what I had written so far and was surprised at the way numerous doubled words had slipped in. There are many programming languages one could use to solve the problem, but one with regular expression support can make the job substantially easier.


I recommend Jeffrey Friedl's book, Mastering Regular Expressions, for a detailed look into regular expressions.
Regular Expressions