J2EEOnline J2EE





Regular Expressions   «Prev 

Perl Regex

Regular expressions are the most powerful aspect of the Perl language.
Using regular expressions can make your programs smaller, faster, and much easier to read and maintain. Perl's implementation of regular expressions is particularly rich. It has all the enhancements of modern tools like egrep, plus a number of handy tricks all its own.
I know that regular expressions seem cryptic to most programmers the first time they see them. Indeed, they can look like transmission line noise on a bad hair day. But, if you take the time to learn about them, your ability to write powerful filters in Perl will greatly improve.

Regular expressions are available in many types of tools (editors, word processors, system tools, database engines, and such), but their power is most fully exposed when available as part of a programming language. Examples include Java and JScript, Visual Basic and VBScript, JavaScript and ECMAScript, C, C++, C#, elisp, Perl, Python, Tcl, Ruby, PHP, sed, and awk. In fact, regular expressions are the very heart of many programs written in some of these languages. Ther e's a good reason that regular expressions are found in so many diverse languages and applications: they are extr emely power ful. At a low level, a regular expr ession describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data.




Scenario

Imagine a scenario where you are given the job of checking the pages on a web server for doubled words (such as "this this"), a common problem with documents subject to heavy editing. Your job is to create a solution that will:
  1. Accept any number of lines check, report each line of each line that has doubled words, highlight (using standard ANSI escape sequences) each doubled word, and ensure that the source line appears with each line in the report.
  2. Work across lines, even ascending situations where a word at the end of one line is repeated at the beginning of the next.
  3. Find doubled words despite capitalization differences, such as with 'The the' as well as allow differing amounts of whitespace (spaces, tabs, newlines, and the like) to lie between the words.
  4. Find doubled words even when separated by HTML tags. HTML tags are for marking up text on World Wide Web pages, for example, to make a word bold:
    'this is very very important'
This is a real problem that needs to be solved. I ran such a tool on what I had written so far and was surprised at the way numerous doubled words had slipped in. There are many programming languages one could use to solve the problem, but one with regular expression support can make the job substantially easier.