Abrazolica

Home Archive Tags About RSS

Counting with Regular Expressions

posted on: Friday, May 24th, 2013, 11:41

A regular expression is used to define a set of words constructed from a finite alphabet of letters. The set of words so defined is called a regular language. The best way to understand what this means is by example. Let the alphabet consist of four letters \(A=\{a,b,c,d\}\). One simple language is all the two letter words that begin with the letter \(a\). The regular expression defining the language is: \(aa+ab+ac+ad\). The \(+\) operator in the expression is analogous to a logic \(or\) operator or a set theory union operator. The expression can be simplified a bit by writing it as \(a(a+b+c+d)\) which represents the concatenation of \(a\) with any of the possible one letter words.

This example shows the two basic operations used to construct regular expressions: union and concatenation. If \(R\) and \(S\) are regular expressions then \(R+S\) and \(RS\) are also regular expressions. The simplest regular expression is just a single letter, \(R=a\) or \(R=\lambda\) where \(\lambda\) is an empty string i.e. a string with no letters. Multiple concatenations can be written in a manner analogous to exponentiation, \(ababab=(ab)^3\) or if \(R=ab\) then \(RRR=R^3\). Note that \((ab)^3\ne a^3b^3=aaabbb\).

Another operator used in regular expressions is called the Kleene star. It is defined as follows: \(R^*=\lambda + R + R^ 2 +R^3 +\cdots\). A starred regular expression can appear zero or more times concatenated with itself. Using the Kleene star, the regular expression for all possible words is simply \((a+b+c+d)^*\).

A language can have words of many different lengths. The number of words with a given length can be found using something called a generating function which we will denote as \(N(z)\). For regular languages \(N(z)\) will be either a polynomial or a rational function of two polynomials that can be expanded as a power series. In both cases the coefficient of \(z^n\) will equal the number of words that have length \(n\). \(N(z)\) can be found directly from the regular expression by replacing each alphabet letter by \(z\) and replacing each starred expression, \(R^*\) by \((1-R(z))^{-1}\) where \(R(z)\) is gotten by replacing each alphabet letter in \(R\) by \(z\). The \(+\) operators are then treated as addition operators and concatenation is treated as multiplication. For example the generating function for the language defined by \((a+b)(c+d)^*\) becomes

\[N(z)=(z+z)(1-z-z)^{-1}=\frac{2z}{1-2z}\]

The following examples show how to use these ideas to solve different kinds of counting problems.

A binary alphabet has only two letters \(A=\{a,b\}\). The regular expression for all possible words that can be formed is \((a+b)^*\). Simple combinatorial reasoning says that the number of words of length \(n\) must be \(2^n\) since there are two choices for each of the \(n\) letters. You get the same answer by deriving the generating function from the regular expression. Replace \(a\) and \(b\) with \(z\) and transform the star using the rule given above. This results in the following generating function:

\[N(z)=\frac{1}{1-2z}=1+2z+4z^2+8z^3+\cdots\]

The coefficient of \(z^n\) in the power series expansion is \(2^n\), the same answer we got using a combinatorial argument.

The problem can easily be generalized to an alphabet with \(m\) letters, \(A=\{a_1, a_2, \ldots, a_m\}\). The regular expression for all possible words is \((a_1 + a_2 + \cdots + a_m)^*\). Converting this into a generating function gives:

\[N(z)=\frac{1}{1-mz}=1+mz+m^2z^2+m^3z^3+\cdots\]

The coefficient of \(z^n\) in the power series expansion is \(m^n\), a result you can also get using combinatorial reasoning.

Using the same alphabet with \(m\) letters, we want to count words that contain exactly \(k\) copies of \(a_1\). The regular expression for this language is \((a_2+a_3+\cdots +a_m)^*(a_1(a_2+a_3+\cdots +a_m)^*)^k\). Converting this into a generating function gives:

\[N(z)=\frac{1}{1-(m-1)z}\left(\frac{z}{1-(m-1)z}\right)^k=\frac{z^k}{(1-(m-1)z)^{k+1}}\]

The power series expansion of the generating function is:

\[N(z)=\sum_{n=k}^{\infty}\binom{n}{k}(m-1)^{n-k}z^n\]

So the number of words of length \(n\) with exactly \(k\) copies of \(a_1\) is \(\binom{n}{k}(m-1)^{n-k}\). The combinatorial interpretation is that there are \(\binom{n}{k}\) ways to choose which \(k\) positions become \(a_1\) and there are \(m-1\) choices for each of the remaining \(n-k\) positions. For the binary alphabet \(A=\{a,b\}\), \(m=2\) and the number of binary words that contain \(a\) exactly \(k\) times is \(\binom{n}{k}\). For the alphabet \(A=\{a,b,c\}\), \(m=3\) and the number of words that contain \(a\) exactly \(k\) times is \(\binom{n}{k}2^{n-k}\).

Given the alphabet \(A=\{a,b,c,d\}\), how many words of length \(n\) contain exactly one \(c\) and one \(d\)? The regular expression for the language is \((a+b)^*(c(a+b)^*d+d(a+b)^*c)(a+b)^*\) and the generating function is:

\[N(z)=\frac{1}{1-2z}\frac{2z^2}{1-2z}\frac{1}{1-2z}=\frac{2z^2}{(1-2z)^3}\]

In the power series expansion the coefficient of \(z^n\) is \(\binom{n}{2}2^{n-1}\).

Using the same alphabet, how many words of length \(n\) contain exactly two copies each of \(c\) and \(d\)? The words in this language are specified by regular expressions of the form: \((a+b)^*x(a+b)^*x(a+b)^*x(a+b)^*x(a+b)^*\) where two of the four \(x\) will equal \(c\) and two will equal \(d\). The language is completely specified by \(\binom{4}{2}=6\) regular expressions of this type. The generating function for each of the regular expressions is given by the above equation for alphabets with \(m=3\) letters and \(k=4\) copies of a given letter (the alphabet is \(A=\{a,b,x\}\) and there are 4 copies of \(x\)). The generating function for the language is therefore:

\[N(z)=\frac{6z^4}{(1-2z)^5}\]

Can you show how to generalize this result to a language where each word contains exactly \(r\) copies of \(c\) and \(s\) copies of \(d\)? Show that the generating function in this case is:

\[N(z)=\binom{r+s}{r}\frac{z^{r+s}}{(1-2z)^{r+s+1}}\]

Suppose now that we have two alphabets, \(A=\{a_1,a_2,\ldots,a_l\}\) with \(l\) letters, and \(B=\{b_1,b_2,\ldots,b_m\}\) with \(m\) letters. We want a language where each word can have any number of letters from \(A\) but every letter of \(B\) must occur a specified number of times. The letter \(b_i\), \(i=1,2,\ldots,m\) must occur exactly \(j_i\) times in each word for a total of \(k=j_1+j_2+\cdots+j_m\) letters from the \(B\) alphabet. Let \(b\) represent any of the \(b_i\) letters, then a regular expression specifying an acceptable word will look like: \((a_1+a_2+\cdots+a_l)^*(b(a_1+a_2+\cdots+a_l)^*)^k\). The entire language will be specified by \(\binom{k}{j_1 j_2 \cdots j_m}\) such regular expressions. The generating function for the language is then

\[N(z)=\binom{k}{j_1 j_2 \cdots j_m}\frac{z^k}{(1-lz)^{k+1}}\]

There are many interesting counting problems involving runs and patterns in words. For example, given the binary alphabet \(A=\{a,b\}\), how many words of length \(n\) have no runs of \(k\) or more \(a\)'s? The regular expression for a run of less than \(k\) \(a\)'s is \(R_k=\lambda+a+a^2+\cdots+a^{k-1}\) and the generating function is

\[N_k(z)=1+z+z^2+\cdots+z^{k-1}=\frac{1-z^k}{1-z}\]

The regular expression for the language is \((R_kb)^*R_k\) and the generating function is

\[N(z)=\frac{N_k(z)}{1-zN_k(z)}=\frac{1-z^k}{1-2z+z^{k+1}}\]

We can also look at words with no runs of \(k\) or more \(a\)'s or \(b\)'s. For no runs of more than a single \(a\) or \(b\) the regular expression is \((\lambda+a)(ba)^*(\lambda+b)\) and the generating function is

\[N(z)=\frac{(1+z)^2}{1-z^2}=\frac{1+z}{1-z}\]

If you expand this you find that there are only two words for any length \(n\) which you would expect since a word must start with either \(a\) or \(b\) and then alternate.

For no runs of more than two \(a\)'s or \(b\)'s the regular expression is \((\lambda+a+a^2)((b+b^2)(a+a^2))^*(\lambda+b+b^2)\) and the generating function is

\[N(z)=\frac{(1+z+z^2)^2}{1-(z+z^2)^2}=\frac{1+z+z^2}{1-z-z^2}\]

See if you can show that for general \(k\), the generating function is:

\[N(z)=\frac{(1+z+z^2+\cdots+z^k)^2}{1-(z+z^2+\cdots+z^k)^2}=\frac{1-z^{k+1}}{1-2z+z^{k+1}}\]

These examples only scratch the surface of what you can do with regular expressions in combinatorics.