Pumping lemma for context-free languages

Date

April 20, 2026

Theorem 1 The language \{a^nb^nc^n \mid n\in \mathbb N\} is not a context-free language.

Proof (click to expand)

We prove this by contradiction. Suppose L=\{a^nb^nc^n \mid n\in \mathbb N\} was context-free and let G a context-free grammar generating it. Since every context-free grammar can be put in Chomsky Normal Form we can assume w.l.o.g. that G is in Chomsky Normal Form. The crucial point of having the grammar in Chomsky Normal Form is that the parse trees are binary trees. Hence, in particular, to generate a word w the parse tree must have depth at least \log_2|w|.

The idea of the argument is similar to the one for the regular languages: start with a very long word in L, which the grammar hence must generate, and show that we can construct another word not in L which G also generates.

Let N be the number of non-terminals of G. Take the word z=a^{2^N}b^{2^N}c^{2^N}. The parse tree for z must have depth at least \log_2|a^{2^N}b^{2^N}c^{2^N}|>N+1. So there is a root-to-leaf path \pi in the tree mentioning at least N+1 non-terminals. By the pigeonhole principle, there must be a non-terminal which appears in \pi at least twice. There might be more than one non-terminal that repeats multiple times or one that repeats more than twice. Choose as repeated terminal the first occurrence of a repetition in \pi scanning it from the leaf. Let A be this non-terminal. The bottom-most occurrence of A and occurrence of A immediately before that in \pi give 3 trees inside the parsing tree for z:

the one rooted at the starting symbol (T_0);
the one rooted at the bottom-most occurrence of A (T_2);
the one rooted at the occurrence of A immediately before that (T_1).

This partitions z naturally into 5 parts, z=uvwxy:

u and y are the parts of z belonging to T_0 but not T_1 (u at the left and y at the right);
v and x are the parts of z belonging to T_1 but not T_2 (v at the left and x at the right);
w is the part of z that belongs to T_2.

See Figure 1 below for a diagram representing T_0,T_1,T_2 and uvwxy.

Figure 1: The decomposition induced by the two repeated variables A.

Now, we can take the tree T_1 and place it at the root of T_2 over and over. This gives valid productions in G for all the words of the form uv^iwx^iy, for each i\in \mathbb N.

See Figure 2 and Figure 3 for a diagram representing the parsing tree resulting from i=0 and i=2.

Figure 2: The diagram corresponding to pumping with i=0.

Figure 3: The diagram corresponding to pumping with i=2.

What do we know about u,v,w,x,y? By the choice of A, the depth of T_1 is at most N and hence |vwx|\leq 2^N. Moreover, since the parsing tree is binary, T_1\neq T_2, which means |vx|\geq 1.

We can take i=0, that is the word uwy and show it cannot belong to L. Since |vwx|\leq 2^N this word cannot mention as, bs, and cs at the same time. No matter what vwx is there is always at least one symbol s\in \{a,b,c\} not mentioned in vwx and hence modifying vwx into w will not change the s^{2^N} part of z. But |vx|\geq 1, so going from vwx to w we removed at least one symbol different from s. Hence it is impossible that uwy has the form a^kb^kc^k for some k, that is uwy\notin L. Contradiction.

The argument used to prove Theorem 1 can be abstracted to the following statement which is known as the pumping lemma for context-free languages.

If we can show that
for every n\in \mathbb N there exists a word z\in L with |z|\geq n s.t. for every possible decomposition of z as z=uvwxy with |vwx|\leq n and |vx|\geq 1, there exists i\in \mathbb N s.t. uv^iwx^iy\notin L,
then L is not a context-free language.

As for the pumping lemma for regular languages, the use of the pumping lemma for context-free languages is essentially describing a winning strategy for the following game. Given a language we want to prove is not context-free:

the “adversary” picks n\in \mathbb N;
we choose a word z\in L with |z|\geq n (as a function of n);
the adversary chooses u,v,w,x,y s.t.
- z=uvwxy
- |vwx|\leq n
- |vx|\geq 1
we choose i\in \mathbb N (as a function of n,u,v,w,x,y)
if we prove that uv^iwx^iy\notin L we win (i.e. we proved L is not context-free).

Hence a “template” for a proof using the pumping lemma for context-free languages is as follow:

We prove that L is not context-free using the pumping lemma for context-free languages. Let n be the pumping-lemma contant. Consider the word z=..... We have that |z|\geq n and z\in L (this needs an argument, possibly trivial, but you need to check it). Given a decomposition of z as z=uvwxy, with |vwx|\leq n and |vx|\geq 1, consider i=..... We prove that uv^iwx^iy\notin L (this needs an argument, usually quite non-trivial).

As an example, see how the template is filled following the argument we saw for Theorem 1:

We prove that L is not context-free using the pumping lemma for context-free languages. Let n be the pumping-lemma contant. Consider the word z=a^nb^nc^n. It is immediate to check that |z|\geq n and z\in L. Given a decomposition of z as z=uvwxy, with |vwx|\leq n and |vx|\geq 1, consider i=0. We prove that uv^iwx^iy\notin L. Since |vwx|\leq n this word cannot mention as, bs, and cs at the same time. No matter what vwx is there is always at least one symbol s\in \{a,b,c\} not mentioned in vwx and hence modifying vwx into w will not change the s^n part of z. But |vx|\geq 1, so going from vwx to w we removed at least one symbol different from s. Hence it is impossible that uwy has the form a^kb^kc^k for some k, that is uwy\notin L.

Although the overall structure is similar, usually applications of the pumping-lemma for context-free languages require a more detailed case analysis than the pumping lemma for regular languages. This typicaly leads to quite “inelegant” proofs, and as in the case of regular languages often it is possible to prove the non-context-freenes of languages using closure properties and the known fact that some notable language is not context-free.

Corollary 1 The language \{w\in \{a,b,c\} \mid |w|_a=|w|_b=|w|_c\} is not context-free.

Proof (click to expand)

The set of context-free languages is closed under intersection with a regular language, that is if A is context-free and B is regular then A\cap B is also context-free. If L=\{w\in \{a,b,c\} \mid |w|_a=|w|_b=|w|_c\} was a context free language, then L\cap \mathscr{L}(a^*b^*c^*)=\{a^nb^nc^n\mid n\in \mathbb N\} would also be context-free. The language \mathscr{L}(a^*b^*c^*) is obviously regular (we wrote a regular expression for it). This contradicts Theorem 1, hence L cannot be a context-free language.

Another consequence of Theorem 1 is for instance that the set of context-free languages is not closed under intersection.

Corollary 2 The set of context-free languages is not closed under intersection.

Proof (click to expand)

The two languages \{a^nb^mc^m\mid n,m\in \mathbb N\} and \{a^mb^mc^n\mid n,m\in \mathbb N\} are context-free and their intersection is the language \{a^nb^nc^n\mid n\in \mathbb N\}. By Theorem 1 this is not context-free, so the set of context-free languages cannot be closed under intersection.