Depuration of context-free grammars

Date

March 24, 2026

Given a context-free grammar G=(V,T,\mathcal P,S), the term depuration refers to the process of constructing an equivalent grammar G' more structured and without clearly redundant variables.

A variable of G is useful if it is involved in a production of a string of only terminal symbols, that is a variable is useful if it is reachable from S and it is productive.

We can define the reachable variables recursively follows: S is reachable from S (clearly), and if A is reachable and we have a rule A\to \alpha with \alpha\in (V\cup T)^* then all the variables in \alpha are also reachable. Once we identified the reachable variables R\subseteq T, we can then remove all the the productions B\to \beta with B\notin R.

We can define productive variables as the ones that eventually are able to generate a terminal string. The definition is recursive similar to the one before: if A\to \alpha and \alpha\in T^*, then A is productive; if A\to\alpha with \alpha\in (V\cup T)^* and every variable in \alpha is productive, then A is productive. Once we identified the productive variables R\subseteq T, we can then remove all the the productions B\to \beta with B\notin R or with some variable in \beta not in R.

Warning

The order in which we perform the operations matters. Consider the grammar \begin{align*} S&\to AB \mid a\\ B&\to b \end{align*} if we remove first the unreachable and then the unproductive variables you still get a grammar with useless symbols. You need first to remove the unproductive variables, and then the unreachable ones.

A production of the form A\to \lambda is a \lambda-production.

A variable A is nullable if it can derive the empty word \lambda. We can define recursively nullable variables: if A\to \lambda is a production, then A is nullable; if A\to\alpha and \alpha\in V^* with all the variables in \alpha nullable then A is also nullable.

Once we identified the nullable variables, we can then remove all productions of the form A\to \lambda (with A\neq S), and for every production A\to \alpha B\beta with B nullable, add the production A\to \alpha\beta, and keep symplifying the new set of productions. Unless A=S, never write productions A\to\lambda that might occur.

For example removing the \lambda-productions from the grammar \begin{align*} S&\to ABC \mid \lambda\\ A&\to \lambda\\ B&\to\lambda\\ C&\to\lambda \end{align*} gives the grammar S\to ABC \mid AB \mid AC \mid BC \mid A \mid B \mid C \mid \lambda.

A production of the form A\to B is called unit production. The variable A is clearly redundant, its only job is to get expanded into B. Productions of the form A\to a where a is a terminal are not unit productions. The technique to remove all unit productions that simply expands them until they disappear often works but not always. Consider for instance the productions A\to B,\ B\to C,\ C\to A. What works is the following:

  • find all unit pairs, i.e. pairs of variables (A,B) where B can be derived from A. For example (A,B) are a unit pair in the grammar A\to BC,\ C\to \lambda. As for the other cases, there is an inductive construction of unit pairs: for each variable A, (A,A) is a unit pair; and if (A,B) is a unit pair and B\to C is a production, the (A,C) is also a unit pair.
  • for each unit pair (A,B) add all the productions A\to \alpha to G', where B\to \alpha is a nonunit production in \mathcal P. For the unit pair (A,A) we add as productions in G' all nonunit productions of A.

For example, removing the unit rules from the grammar \begin{align*} S&\to X_1\\ X_1&\to X_2\\ \vdots &\\ X_{n-1} &\to X_n\\ X_n&\to w \end{align*} with w terminal word, gives the grammar

\begin{align*} S&\to w\\ X_1&\to w\\ \vdots &\\ X_{n-1} &\to w\\ X_n&\to w \end{align*}

Theorem 1 There is an order in which to apply the removal of useless variables, \lambda-productions, and unit productions to obtain an equivalent grammar. What is this order?