Summoning Cthulhu by Parsing HTML with Regular Expressions
Regular expressions are an incredibly powerful tool. It is no wonder that so many people have flocked to internet forums to ask, “can I use regex to parse HTML”? According to this legendary Stack Overflow post, the answer is a resounding no. In fact, the post implies that attempting to parse HTML with regex will summon a Lovecraftian entity known as Z̷͎̐a̶͉̓l̵̟͛g̷̺̐ȏ̸̙, and should therefore be avoided at all costs.
“A mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming.”
— Stack Overflow User
Numerous reasons are given in the Stack Overflow thread for why parsing HTML with regex is a no-go. Intimidating terminology such as “Chomsky Type 2 grammar” is thrown around. They claim that it is “mathematically impossible”, with one reply comparing it to dividing by zero. Looks like there’s nothing more to see here. After all, if the top replies on Stack Overflow say it’s impossible, then it must be impossible, right?
Well yes, but actually no.
Don’t get me wrong. Using regex to parse HTML is a very, very bad idea. However, the reasoning in that Stack Overflow thread is not entirely sound in the context of modern programming languages. In this blog post, I’ll explain why, in some programming languages, if you really really wanted to, you could write a regex to match HTML.
Languages and Regular Expressions
To clear up the confusion surrounding regexes and HTML, we must first figure out what “regular expression” actually means. Unfortunately, this term can mean two different things depending on context. On the one hand, there are “regular expressions” as they appear in modern programming languages and libraries. On the other hand, there is the formal concept of “regular expressions” as it appears in computer science, and in particular the study of languages. The latter notion is the one which causes so many people to believe that parsing HTML is not possible with regular expressions. To see why, we must dive into the formal theory of languages.
For our purposes, a language is simply a set of strings. The set of all email addresses is a language. So is the set of all phone numbers. Importantly, a language may be infinite; that is, it may contain infinitely many strings. For example, the set $\{\varepsilon,\>\text{A},\>\text{AA},\>\text{AAA},\>\text{AAAA},\>…\}$ is an infinite language (here, we use $\varepsilon$ to represent the empty string). We want a simple way to completely describe a language, even if it is extremely large or infinite. The previous example can be described informally as “the set of strings which consist of $\text{A}$ repeated zero or more times”. This description, which is itself finite, completely describes an infinite set of strings.
For complex languages, however, informal descriptions are too verbose. Furthermore, we want a computer to automatically check whether or not a string belongs to a language. Since computers cannot understand natural language (yet), informal descriptions will not suffice.
Enter regular expressions. Just like the regular expressions used in every-day programming, formal regular expressions are used to concisely describe some large set of strings. In formal language theory, regular expressions consist of:
- “Alphabet” symbols, such as $\text{A}$ or $\text{B}$
- The $+$ symbol, which means “or”
- The $\cdot$ symbol, which means “concatenate”
- The Kleene star, $*$, which means “repeat zero or more times”
- Parentheses, $($ and $)$, for grouping
(For simplicity, I am skipping over some subtler parts of the formal definition; you can find a complete definition here.)
Here are a few examples of formal regular expressions, along with the languages they describe:
Formal Regex | Language |
---|---|
$\text{A}+\text{B}$ | $\{\text{A},\>\text{B}\}$ |
$\text{A} \cdot \text{B} \cdot \text{C}$ | $\{\text{ABC}\}$ |
$\text{A} \cdot \text{B} \cdot (\text{C} + \text{D})$ | $\{\text{ABC},\> \text{ABD}\}$ |
$\text{A}^*$ | $\{\varepsilon,\>\text{A},\>\text{AA},\>\text{AAA},\>\text{AAAA},\>…\}$ |
$(\text{A} \cdot \text{B})^*$ | $\{\varepsilon,\>\text{AB},\>\text{ABAB},\>\text{ABABAB},\>…\}$ |
$\text{A} \cdot \text{B}^*$ | $\{\text{A},\>\text{AB},\>\text{ABB},\>\text{ABBB},\>…\}$ |
It is apparent that these regular expressions are not nearly as powerful as the regular expressions built into modern programming languages. At the very least, formal regexes are far more verbose. Whereas in a modern regex you can simply write something like [1-4]*
, a formal regex would require you to write $(1+2+3+4)^*$. Nevertheless, formal regexes are perfectly capable of describing simple languages such as phone numbers.
Regular and Non-Regular Languages
If a language can be described by a formal regular expression, we call it a regular language. In a perfect world, all languages would be regular. But this is not a perfect world. Consider the following language: $\{\varepsilon,\>\text{AB},\>\text{AABB},\>\text{AAABBB},\>…\}$. Let’s call this language $L_1$. This language contains all strings which consist of $\text{A}$ repeated $n$ times, followed by $\text{B}$ repeated $n$ times. Although it is very simple to describe $L_1$ informally, it turns out that it is not a regular language! There does not exist any regular expression which can describe $L_1$ (we can prove this by using something called the pumping lemma).
A language which cannot be described by a regular expression is called non-regular. In general, regular expressions fail to describe languages which require an arbitrary amount of memory to parse. To see what I mean, let’s take a look at the following regular expression:
\[\text{A}+(\text{A}\cdot\text{A})+(\text{A}\cdot\text{A}\cdot\text{A})\]
This describes the set of strings consisting of the letter $\text{A}$, repeated no more than 3 times. If we wanted to check whether a string is in this language, we would need to count the number of $\text{A}$s it contains; as soon as this count exceeds 3, we can stop and conclude that the string is not in the language. This process only ever requires a fixed amount of memory: we know that our count will never need to grow past a certain limit.
On the other hand, to check if a string is in our non-regular language $L_1$, we need to count the number of $\text{A}$s and the number of $\text{B}$s, and check that those two numbers are equal. If the string starts with 100 $\text{A}$s, we would need enough memory to count up to 100; if the string starts with 1,000,000 $\text{A}$s, we would need enough memory to count up to 1,000,000. Larger numbers require more memory, so we cannot put a hard limit on the amount of memory required in advance. In other words, parsing strings in $L_1$ requires an unlimited amount of memory!
This gives us a good rule of thumb for figuring out whether a language is regular or not. Given some language, come up with an algorithm for checking whether a string is in the language. If the algorithm takes a fixed, finite amount of memory, independent of the size of the input string, then the language is regular. If you try very hard and cannot come up with such an algorithm, then it is very likely that the language is non-regular (although you would still need to use other methods to prove this with certainty).
As another example, we can look at the set of all palindromes. One way to check whether a string is a palindrome is to reverse it, and check that it is equal to its own reverse. This process requires storing a reversed copy of the string in memory. Since the string can be arbitrarily large, we require an arbitrarily large amount of memory. Any algorithm that I can come up with for determining whether a string is a palindrome runs into this same issue. By our rule of thumb, the set of palindromes is a non-regular language; this is indeed a fact which can be formally proved.
So what about HTML?
<p>Hello World!</p>
is valid HTML. <p><p>Hello World!</p></p>
is also valid HTML. On the other hand, <p><p>Hello World!</p>
is not valid HTML: there are two opening <p>
tags, and only one closing </p>
tag. Even in this very specific case, we need to store a count in memory to check whether something is valid HTML: we need to count the number of opening <p>
tags, count the number of closing </p>
tags, and check that those two numbers are equal. More generally, to parse HTML, we need to keep track of all opening tags in memory, and later check that each one has a corresponding closing tag. There can be an arbitrary number of tags, so we need an arbitrary amount of memory to parse them. By this (highly informal) justification, HTML should be non-regular; and as with the palindrome example, it is possible to formally prove that this is indeed the case! This is why all those Stack Overflow users were so adamant that you cannot parse HTML with regular expressions.
Context-Free Grammars
Since regular expressions are not very powerful, we need a better way of describing languages. One way of doing this is by using a context-free grammar (CFG). A CFG is essentially a set of instructions, called production rules, which tell you how to construct strings in a language. This concept is easier to explain by example. Recall the language $L_1$, which we looked at in the previous section. The following is a CFG which describes $L_1$:
\[\begin{align*} \langle S \rangle &\>\>\>\to\>\>\> \text{A} \langle S \rangle \text{B}, \>\> \varepsilon \end{align*}\]
The symbol $\langle S \rangle$ is the start symbol. When we want to construct a string in $L_1$, we start by writing down $\langle S \rangle$. Then, the above CFG tells us that we can replace $\langle S \rangle$ with either $\text{A} \langle S \rangle \text{B}$ or the empty string $\varepsilon$. As long as the symbol $\langle S \rangle$ is still present, we can continue replacing it according to these rules. For example, to generate the string $\text{AABB}$, we would do the following:
\[\begin{align*} 1.\>\> &\langle S \rangle \\ 2.\>\> &\text{A} \langle S \rangle \text{B} \\ 3.\>\> &\text{AA} \langle S \rangle \text{BB} \\ 4.\>\> &\text{AABB} \\ \end{align*}\]
For the first 3 steps, we replace $\langle S \rangle$ with $\text{A} \langle S \rangle \text{B}$. For step 4, we replace $\langle S \rangle$ with the empty string.
We immediately see that context-free grammars are more powerful than regular expressions: $L_1$ can be described by a CFG, but not by a regular expression.
The set of palindromes consisting of capital English letters (which, as argued above, is a non-regular language), can also be described by a CFG:
\[\begin{align*} \langle S \rangle &\>\>\>\to\>\>\> \langle X \rangle,\>\> \langle Y \rangle,\>\> \varepsilon \\ \langle X \rangle &\>\>\>\to\>\>\> \text{A} \langle S \rangle \text{A}, \>\> \text{B} \langle S \rangle \text{B},\>\> \text{C} \langle S \rangle \text{C},\>\> ...,\>\> \text{Z} \langle S \rangle \text{Z} \\ \langle Y \rangle &\>\>\>\to\>\>\> \text{A}, \>\> \text{B},\>\> \text{C},\>\> ...,\>\> \text{Z} \end{align*}\]
This CFG is much more complicated, and has many production rules. The first line says that we can replace the start symbol $\langle S \rangle$ with $\langle X \rangle$, or $\langle Y \rangle$, or the empty string. In turn, we can replace $\langle X \rangle$ with $\text{A} \langle S \rangle \text{A}$, or $\text{B} \langle S \rangle \text{B}$, or $\text{C} \langle S \rangle \text{C}$, etc. The third line says that we can replace $\langle Y \rangle$ with any letter of the alphabet.
To construct the palindrome $\text{ABCCBA}$ using this CFG, we would do the following:
\[\begin{align*} 1.\>\> &\langle S \rangle \\ 2.\>\> &\langle X \rangle \\ 3.\>\> &\text{A} \langle S \rangle \text{A} \\ 4.\>\> &\text{A} \langle X \rangle \text{A} \\ 5.\>\> &\text{AB} \langle S \rangle \text{BA} \\ 6.\>\> &\text{AB} \langle X \rangle \text{BA} \\ 7.\>\> &\text{ABC} \langle S \rangle \text{CBA} \\ 8.\>\> &\text{ABCCBA} \\ \end{align*}\]
To construct the palindrome $\text{TENET}$, we would do the following:
\[\begin{align*} 1.\>\> &\langle S \rangle \\ 2.\>\> &\langle X \rangle \\ 3.\>\> &\text{T} \langle S \rangle \text{T} \\ 4.\>\> &\text{T} \langle X \rangle \text{T} \\ 5.\>\> &\text{TE} \langle S \rangle \text{ET} \\ 6.\>\> &\text{TE} \langle Y \rangle \text{ET} \\ 7.\>\> &\text{TENET} \\ \end{align*}\]
HTML can also be described by a context-free grammar. It is, however, extremely complicated. One production rule for HTML might look like this:
\[\begin{align*} \langle S \rangle &\>\>\>\to\>\>\> \text{<p>} \langle S \rangle \text{</p>} \end{align*}\]
This rule says that valid HTML can consist of an opening <p>
tag and closing </p>
tag, with valid HTML in-between. The CFG would include a similar production rule for every possible HTML tag. Of course, even this single production rule is highly oversimplified, and omits many other things which may be included in a <p>
tag, such as attributes (e.g. <p style="color: red">
). Unfortunately, the full CFG for HTML would be too complex to write down here.
Regular Expressions Today
“I'm going to make my own regular expressions! With blackjack! And hookers!”
— A modern programmer implementing a modern regex engine
Regular expressions in modern programming languages are far more powerful than the ones defined in the formal theory of languages. To illustrate, consider the following Ruby regex:
/\A
(?<S> (A \g<S> B)?)
\z/x
The above regex takes advantage of an advanced feature called recursive subexpression calls to mimic the structure of a CFG. The important parts are on the second line; let’s go through them step-by-step. The outer most part, (?<S> ...)
, means “create a group called <S>
”. The inner contents tell Ruby what the group <S>
consists of. In this case, it consists of (A \g<S> B)?
. That is, it consists of the letter A
, followed by the group <S>
, followed by B
. The ?
at the end means that the whole thing is optional, or in other words, can be replaced with the empty string. To summarize: this regex matches strings which consist of <S>
, where <S>
is either A<S>B
or the empty string. This should look familiar: it’s exactly our CFG for $L_1$, written in a way that can be understood by Ruby’s regex engine! So, this Ruby “regex” is actually describing a non-regular language!
Similarly, the following Ruby regex mimics the CFG for palindromes we had above:
/\A
(?<S> (\g<X> | \g<Y>)?)
(?<X> (?<letter>[A-Z]) \g<S> \k<letter+0>){0}
(?<Y> [A-Z]){0}
\z/x
This will match palindromes consisting of capital English letters.
The takeaway from all this is that Ruby regexes are powerful enough to describe CFGs. And since HTML is a CFG, you should be able to write a Ruby regex which matches HTML. I, however, am not bored enough to actually attempt such a thing. Perhaps you, dear reader, are willing to undertake this task. However, if you do so, I must warn you that some unfortunate side effects may occur, including (but not limited to) fits of rage, loss of sanity, and the summoning of Cthulhu himself.
Share
from Hacker News https://ift.tt/8tkXHoU
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.