Technically PCRE regexes are powerful and can match anything.
In reality, a complex PCRE regex will almost always be more difficult to maintain than a parser-combinator or hand-rolled parser in a more traditional language.
People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job.
The same is true for almost any regular expression that takes advantage of PCRE features, especially backreferences.
In addition, a regexp will only match html correctly if you write a very complex one. With a naive regexp for an html tag's contents, you'll find that you still might match that text inside a <script> tag even though that is not html.. so you now need to figure out when you're in a script tag and exclude that, or if you're inside an html attribute string, and before you know it you have a 2000 character regexp that no one else will be able to read, all because you didn't want to use an html parsing library where getting a tag's value correctly would be a single xpath expression or css selector away.
There's no arguing the fact that regexes are a poor fit for HTML, but maybe this is the wrong time to use that ridiculous email regex as an example, since TFA features a highly readable, fully compliant email matching regex as its main example.
It also doesn't point out that matching email addresses in general is a nightmare because the standard is one of those "we'll just allow everything everybody is doing right now" type standards that have a million different little quibbles.
No matter what language or programming style you use it's going to be ugly because it's an ugly problem.
What the language contains as a main example is PCRE pattern that matches email addresses and in comparison to it's original EBNF incarnation is highly unreadable due all the syntax noise required to graft backtracing support onto traditional regex syntax (not to mention the fact that it's performance is certainly highly suboptimal).
> It provides the same functionality as RFC::RFC822::Address, but uses Perl regular expressions rather that the Parse::RecDescent parser. This means that the module is much faster to load as it does not need to compile the grammar on startup.
Of course, if perl were a statically compiled language, the cost of compiling the grammar could be done at compile time.
Perl5 is not too meaningful for such performance comparisons, because on one hand it's regex implementation is very optimized while on other hand performance of Perl5 on "normal" procedural code is horrible (eg. Perl5 is about an order of magnitude slower on Gabriel's Takeuchi function benchmark than CPython).
This result generalises to most interpreted languages though. PHP, python, javascript, etc all have highly optimised regex engines, and regexs can consequently be a good optimisation technique when using those languages.
> People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job.
Plot twist: the html library is built upon regex's (at least in part).
Every parser is partially built upon regexes. You have to go all the way until Haskell, Prolog or such languages before you get better options than regexes to build them.
But they are not built solely of regexes. They always have added control structures that complement regexes on those place they are worst.
Even real regular regexes can be used when nesting is limited, which is true for most real-world html, xml, json. Still you're better off using libraries.
The words of the statement matter in specificity. You can parse HTML with a powerful regular expression, but it's not a good tool for the job. That said, I find it a wonderful tool to extract specific portions of an HTML document.
If you actually just care about retrieving a few specific bits of data within a page, I've found parsing libraries (including ones that allow for CSS selectors) to be just as brittle to changes as regular expression extraction, and not all that much easier to use, given a good grasp of both technologies.
That said, if you need to alter an HTML document in some non-trivial way, parsing is probably the way to go.
We had two versions of a particular app once. One used BeautifulSoup to parse the page and pull out the relevant elements. The other used some crusty old Regex patterns. At the end of the day the Regex version required about half of the maintenance the tag soup version did. IMHO the difference was that it took some of the content into consideration unlike the tag only version that was more sensitive to otherwise invisible changes under the hood.
Not to mention the sheer difference in performance between the two. I've found regexes to be magnitudes faster than parsers, for extracting data, that is.
That's what I thought. Just like making sure the phone doesn't connect to the internet so it can't be remotely wiped. I don't own an iPhone so I might be wrong but can't the cop just tape over the camera/sensor?