m Magic

Mär 12

von Sebastian am 12.03.2014 um 17:14 in English, Perl

Perl's "Regular Expression" Engine is one of the most flexible and powerful pattern matching and manipulation tools. "Easy" and "powerful" often behave like magnetic poles of the same kind: They can't be together. But the "s" and "m" suffix modifiers supported by the Perl RegEx engine aren't that complicated to understand but still very powerful.

Speaking of two suffix modifiers is not completly correct: There are three of them, because "nothing" - no suffix modifier - also defines a fixed behaviour.

All three affect three matching patterns: "^", "$" and ".". No modifier means:

"^" - match beginning of variable
"$" - match end of variable
"." - match everything but "\r" and "\n" (carrige return and line feed)

The "s" modifier behind a RegEx treats it as "single-string" changing only one pattern:

"." - match everything including line endings

"m" is the opposite, a "multiline-mode" and changes two patterns compared to the defaults without any modifier:

"^" - match the beginning of every line (including the first one which is the same as beginning of variable)
"$" - match the end of every line (including the last one which is the same as end of variable)

Here is a sample XML snippet:

<grouptag>
   <text>Something I'ld like to know</text>
   <text>More text</text>
</grouptag>

There are many more or less good XML parser modules on CPAN, but parsing XML using RegEx sometimes outperforms all of them. Matching everything within the grouptag should be easy using

/<grouptag>(.*)<\/grouptag>/s

This RegEx would match all the text tags and their content.

The following line will loop through all texts assuming a pretty formatted XML:

while (/^\s*<text>(.*?)<\/text>\s*$/gsm) {

The "^" will match the beginning and "$" the end of every line (because of the "m" modifier used).
"\s*" after the beginning and before the end will cut away all spaces. "\s" always matches all spaces, tabs and line endings.
"(.*?)" contains three parts:
- ".*" matches everything including any newlines within the text because the "s" modifier is active.
- "?" limits the previous ".*" match to "as short as possible". All following </text>\s+<text> would be matched by .* otherwise.
- "( )" puts the result into $1 to be used within the while loop.
The "g" modifier matches only once and restarts matching at the next <text> block for the next loop iteration.