구문 분석

어휘 분석을 통해 소스 코드가 토큰(token)들의 스트림으로 변환되었다면, 컴파일러의 다음 주요 단계는 구문 분석(Syntax Analysis) 입니다. 구문 분석기는 이 토큰 스트림을 입력으로 받아 프로그램의 문법적 구조를 파악하고, 이를 보통 파스 트리(Parse Tree) 또는 구문 트리(Syntax Tree) 형태로 표현합니다. 이 과정을 파싱(Parsing) 이라고도 합니다.

어휘 분석에서 토큰의 패턴을 정의하기 위해 정규 표현식(Regular Expression)을 사용했던 것처럼, 구문 분석에서는 프로그램의 구조, 즉 문장들이나 구문 요소들이 어떻게 결합되어 올바른 프로그램을 형성하는지를 정의하기 위해 문맥 자유 문법(Context-Free Grammar, CFG)을 사용합니다.

문맥 자유 문법 (Context-Free Grammar, CFG)

정규 표현식과 유한 오토마타는 어휘 단위(토큰)를 인식하는 데는 강력하지만, 중첩(nesting) 구조나 재귀적인 구조를 표현하는 데는 한계가 있습니다. 예를 들어, 괄호의 짝 맞추기, begin-end 블록 구조, 또는 재귀적인 수식 구조 등은 정규 언어만으로는 완벽하게 표현하기 어렵습니다. 이러한 2차원적인, 계층적인 구조를 표현하기 위해서는 정규 표현식보다 표현력이 더 강한 문맥 자유 문법이 필요합니다.

회문(Palindrome) 은 문맥 자유 문법으로 표현할 수 있지만 정규 언어로는 표현할 수 없는 대표적인 예입니다. 회문은 앞에서부터 읽으나 뒤에서부터 읽으나 동일한 문자열을 말합니다 (예: "madam", "racecar", 또는 알파벳 [math]\displaystyle{ \{0, 1\} }[/math] 상에서는 "0110", "101"). 회문을 정의하는 문맥 자유 문법은 다음과 같이 재귀적으로 정의될 수 있습니다:

기본 케이스 (Bases):
- [math]\displaystyle{ P \rightarrow \epsilon }[/math] (빈 문자열은 회문이다)
- [math]\displaystyle{ P \rightarrow 0 }[/math] ('0' 자체는 회문이다)
- [math]\displaystyle{ P \rightarrow 1 }[/math] ('1' 자체는 회문이다)
귀납적 단계 (Induction): 만약 문자열 [math]\displaystyle{ w }[/math]가 회문이라면, 다음도 회문입니다:
- [math]\displaystyle{ P \rightarrow 0P0 }[/math] ([math]\displaystyle{ w }[/math]의 양쪽에 '0'을 붙인 [math]\displaystyle{ 0w0 }[/math]도 회문이다)
- [math]\displaystyle{ P \rightarrow 1P1 }[/math] ([math]\displaystyle{ w }[/math]의 양쪽에 '1'을 붙인 [math]\displaystyle{ 1w1 }[/math]도 회문이다)

모든 문맥 자유 문법은 이처럼 재귀적인 정의를 통해 언어를 생성합니다.

문맥 자유 문법의 형식적 정의

문맥 자유 문법 [math]\displaystyle{ G }[/math]는 다음 네 가지 구성요소의 튜플 [math]\displaystyle{ (V, T, S, P) }[/math]로 정의됩니다:

[math]\displaystyle{ V }[/math]: 논터미널(Non-terminal) 기호들의 유한한 집합입니다. 논터미널은 문법 규칙을 통해 더 작은 단위로 확장될 수 있는 구문 변수(syntactic variable)를 나타냅니다 (예: Expression, Statement, NounPhrase).
[math]\displaystyle{ T }[/math]: 터미널(Terminal) 기호들의 유한한 집합입니다. 터미널은 문법의 가장 기본적인 단위로, 더 이상 확장되지 않는 기호들입니다. 어휘 분석기가 생성하는 토큰들이 여기에 해당합니다 (예: id, number, +, if). [math]\displaystyle{ V }[/math]와 [math]\displaystyle{ T }[/math]는 서로소 집합입니다 ([math]\displaystyle{ V \cap T = \emptyset }[/math]).
[math]\displaystyle{ S }[/math]: [math]\displaystyle{ V }[/math]에 속하는 특별한 논터미널 기호로, 시작 심볼(Start Symbol) 이라고 합니다. 문법은 이 시작 심볼로부터 유도를 시작합니다.
[math]\displaystyle{ P }[/math]: 생성 규칙(Production rule) 들의 유한한 집합입니다. 각 생성 규칙은 [math]\displaystyle{ A \rightarrow \alpha }[/math] 형태를 가집니다.
- [math]\displaystyle{ A \in V }[/math]는 논터미널입니다.
- [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math]는 터미널과 논터미널 기호들의 문자열입니다 ([math]\displaystyle{ \alpha }[/math]는 빈 문자열 [math]\displaystyle{ \epsilon }[/math]일 수도 있습니다).
- 이 규칙은 "논터미널 [math]\displaystyle{ A }[/math]는 문자열 [math]\displaystyle{ \alpha }[/math]로 대체될(파생될) 수 있다"라고 읽습니다.

유도 (Derivation)

문맥 자유 문법 [math]\displaystyle{ G = (V, T, S, P) }[/math]가 주어졌을 때, 문자열이 이 문법에 의해 생성될 수 있는지를 확인하는 과정을 유도(Derivation) 라고 합니다. 유도는 생성 규칙을 반복적으로 적용하여 시작 심볼로부터 특정 문자열을 만들어내는 과정입니다.

직접 유도 (Direct Derivation, [math]\displaystyle{ \Rightarrow }[/math]): 만약 [math]\displaystyle{ \alpha A \beta }[/math]가 터미널과 논터미널로 이루어진 문자열이고 ([math]\displaystyle{ A \in V }[/math], [math]\displaystyle{ \alpha, \beta \in (V \cup T)^* }[/math]), [math]\displaystyle{ A \rightarrow \gamma }[/math]가 [math]\displaystyle{ G }[/math]의 생성 규칙 중 하나라면, [math]\displaystyle{ \alpha A \beta }[/math]는 [math]\displaystyle{ \alpha \gamma \beta }[/math]를 직접 유도한다고 하며, [math]\displaystyle{ \alpha A \beta \Rightarrow \alpha \gamma \beta }[/math]로 표기합니다. "문맥 자유"라는 이름은 논터미널 [math]\displaystyle{ A }[/math]를 [math]\displaystyle{ \gamma }[/math]로 대체할 때 [math]\displaystyle{ A }[/math]의 주변 문맥(context)인 [math]\displaystyle{ \alpha }[/math]와 [math]\displaystyle{ \beta }[/math]에 영향을 받지 않는다는 의미에서 유래합니다.

유도 클로저 ([math]\displaystyle{ \Rightarrow^* }[/math]): [math]\displaystyle{ \Rightarrow^* }[/math]는 0번 이상의 직접 유도 과정을 나타내는 관계입니다.
- 기본 케이스: 임의의 문자열 [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math]에 대해, [math]\displaystyle{ \alpha \Rightarrow^* \alpha }[/math] 입니다 (0번 유도).
- 귀납적 케이스: 만약 [math]\displaystyle{ \alpha \Rightarrow^* \beta }[/math]이고 [math]\displaystyle{ \beta \Rightarrow \gamma }[/math]이면, [math]\displaystyle{ \alpha \Rightarrow^* \gamma }[/math] 입니다.

어떤 문자열 [math]\displaystyle{ w }[/math]가 문법 [math]\displaystyle{ G }[/math]에 포함되는지 확인하려면, 시작 심볼 [math]\displaystyle{ S }[/math]로부터 [math]\displaystyle{ S \Rightarrow^* w }[/math]가 가능한지 보면 됩니다. 파싱(Parsing)은 이러한 유도 과정을 역으로 수행하여, 주어진 터미널 문자열(토큰 스트림)이 시작 심볼로부터 어떻게 유도되었는지를 찾아내는 과정입니다.

문장 형식 (Sentential Forms): 문법 [math]\displaystyle{ G=(V, T, S, P) }[/math]에서 시작 심볼 [math]\displaystyle{ S }[/math]로부터 유도될 수 있는 모든 문자열 [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math] (즉, [math]\displaystyle{ S \Rightarrow^* \alpha }[/math]를 만족하는 [math]\displaystyle{ \alpha }[/math])를 문장 형식이라고 합니다. 문장 형식은 터미널과 논터미널을 모두 포함할 수 있습니다.

문장 (Sentence): 문장 형식 중에서 논터미널을 전혀 포함하지 않는, 즉 터미널 기호로만 이루어진 문자열을 해당 문법의 문장이라고 합니다.

문법의 언어 (Language of a Grammar): 문법 [math]\displaystyle{ G }[/math]가 생성하는 언어 [math]\displaystyle{ L(G) }[/math]는 [math]\displaystyle{ G }[/math]의 모든 문장들의 집합입니다. 즉, [math]\displaystyle{ L(G) = \{ w \in T^* \mid S \Rightarrow^* w \} }[/math] 입니다.