구문 분석

어휘 분석을 통해 소스 코드가 토큰(token)들의 스트림으로 변환되었다면, 컴파일러의 다음 주요 단계는 구문 분석(Syntax Analysis) 입니다. 구문 분석기는 이 토큰 스트림을 입력으로 받아 프로그램의 문법적 구조를 파악하고, 이를 보통 파스 트리(Parse Tree) 또는 구문 트리(Syntax Tree) 형태로 표현합니다. 이 과정을 파싱(Parsing) 이라고도 합니다.

어휘 분석에서 토큰의 패턴을 정의하기 위해 정규 표현식(Regular Expression)을 사용했던 것처럼, 구문 분석에서는 프로그램의 구조, 즉 문장들이나 구문 요소들이 어떻게 결합되어 올바른 프로그램을 형성하는지를 정의하기 위해 문맥 자유 문법(Context-Free Grammar, CFG)을 사용합니다.

문맥 자유 문법 (Context-Free Grammar, CFG)

정규 표현식과 유한 오토마타는 어휘 단위(토큰)를 인식하는 데는 강력하지만, 중첩(nesting) 구조나 재귀적인 구조를 표현하는 데는 한계가 있습니다. 예를 들어, 괄호의 짝 맞추기, begin-end 블록 구조, 또는 재귀적인 수식 구조 등은 정규 언어만으로는 완벽하게 표현하기 어렵습니다. 이러한 2차원적인, 계층적인 구조를 표현하기 위해서는 정규 표현식보다 표현력이 더 강한 문맥 자유 문법이 필요합니다.

회문(Palindrome) 은 문맥 자유 문법으로 표현할 수 있지만 정규 언어로는 표현할 수 없는 대표적인 예입니다. 회문은 앞에서부터 읽으나 뒤에서부터 읽으나 동일한 문자열을 말합니다 (예: "madam", "racecar", 또는 알파벳 [math]\displaystyle{ \{0, 1\} }[/math] 상에서는 "0110", "101"). 회문을 정의하는 문맥 자유 문법은 다음과 같이 재귀적으로 정의될 수 있습니다:

기본 케이스 (Bases):
- [math]\displaystyle{ P \rightarrow \epsilon }[/math] (빈 문자열은 회문이다)
- [math]\displaystyle{ P \rightarrow 0 }[/math] ('0' 자체는 회문이다)
- [math]\displaystyle{ P \rightarrow 1 }[/math] ('1' 자체는 회문이다)
귀납적 단계 (Induction): 만약 문자열 [math]\displaystyle{ w }[/math]가 회문이라면, 다음도 회문입니다:
- [math]\displaystyle{ P \rightarrow 0P0 }[/math] ([math]\displaystyle{ w }[/math]의 양쪽에 '0'을 붙인 [math]\displaystyle{ 0w0 }[/math]도 회문이다)
- [math]\displaystyle{ P \rightarrow 1P1 }[/math] ([math]\displaystyle{ w }[/math]의 양쪽에 '1'을 붙인 [math]\displaystyle{ 1w1 }[/math]도 회문이다)

모든 문맥 자유 문법은 이처럼 재귀적인 정의를 통해 언어를 생성합니다.

문맥 자유 문법의 형식적 정의

문맥 자유 문법 [math]\displaystyle{ G }[/math]는 다음 네 가지 구성요소의 튜플 [math]\displaystyle{ (V, T, S, P) }[/math]로 정의됩니다:

[math]\displaystyle{ V }[/math]: 논터미널(Non-terminal) 기호들의 유한한 집합입니다. 논터미널은 문법 규칙을 통해 더 작은 단위로 확장될 수 있는 구문 변수(syntactic variable)를 나타냅니다 (예: Expression, Statement, NounPhrase).
[math]\displaystyle{ T }[/math]: 터미널(Terminal) 기호들의 유한한 집합입니다. 터미널은 문법의 가장 기본적인 단위로, 더 이상 확장되지 않는 기호들입니다. 어휘 분석기가 생성하는 토큰들이 여기에 해당합니다 (예: id, number, +, if). [math]\displaystyle{ V }[/math]와 [math]\displaystyle{ T }[/math]는 서로소 집합입니다 ([math]\displaystyle{ V \cap T = \emptyset }[/math]).
[math]\displaystyle{ S }[/math]: [math]\displaystyle{ V }[/math]에 속하는 특별한 논터미널 기호로, 시작 심볼(Start Symbol) 이라고 합니다. 문법은 이 시작 심볼로부터 유도를 시작합니다.
[math]\displaystyle{ P }[/math]: 생성 규칙(Production rule) 들의 유한한 집합입니다. 각 생성 규칙은 [math]\displaystyle{ A \rightarrow \alpha }[/math] 형태를 가집니다.
- [math]\displaystyle{ A \in V }[/math]는 논터미널입니다.
- [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math]는 터미널과 논터미널 기호들의 문자열입니다 ([math]\displaystyle{ \alpha }[/math]는 빈 문자열 [math]\displaystyle{ \epsilon }[/math]일 수도 있습니다).
- 이 규칙은 "논터미널 [math]\displaystyle{ A }[/math]는 문자열 [math]\displaystyle{ \alpha }[/math]로 대체될(파생될) 수 있다"라고 읽습니다.

유도 (Derivation)

문맥 자유 문법 [math]\displaystyle{ G = (V, T, S, P) }[/math]가 주어졌을 때, 문자열이 이 문법에 의해 생성될 수 있는지를 확인하는 과정을 유도(Derivation) 라고 합니다. 유도는 생성 규칙을 반복적으로 적용하여 시작 심볼로부터 특정 문자열을 만들어내는 과정입니다.

직접 유도 (Direct Derivation, [math]\displaystyle{ \Rightarrow }[/math]): 만약 [math]\displaystyle{ \alpha A \beta }[/math]가 터미널과 논터미널로 이루어진 문자열이고 ([math]\displaystyle{ A \in V }[/math], [math]\displaystyle{ \alpha, \beta \in (V \cup T)^* }[/math]), [math]\displaystyle{ A \rightarrow \gamma }[/math]가 [math]\displaystyle{ G }[/math]의 생성 규칙 중 하나라면, [math]\displaystyle{ \alpha A \beta }[/math]는 [math]\displaystyle{ \alpha \gamma \beta }[/math]를 직접 유도한다고 하며, [math]\displaystyle{ \alpha A \beta \Rightarrow \alpha \gamma \beta }[/math]로 표기합니다. "문맥 자유"라는 이름은 논터미널 [math]\displaystyle{ A }[/math]를 [math]\displaystyle{ \gamma }[/math]로 대체할 때 [math]\displaystyle{ A }[/math]의 주변 문맥(context)인 [math]\displaystyle{ \alpha }[/math]와 [math]\displaystyle{ \beta }[/math]에 영향을 받지 않는다는 의미에서 유래합니다.

유도 클로저 ([math]\displaystyle{ \Rightarrow^* }[/math]): [math]\displaystyle{ \Rightarrow^* }[/math]는 0번 이상의 직접 유도 과정을 나타내는 관계입니다.
- 기본 케이스: 임의의 문자열 [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math]에 대해, [math]\displaystyle{ \alpha \Rightarrow^* \alpha }[/math] 입니다 (0번 유도).
- 귀납적 케이스: 만약 [math]\displaystyle{ \alpha \Rightarrow^* \beta }[/math]이고 [math]\displaystyle{ \beta \Rightarrow \gamma }[/math]이면, [math]\displaystyle{ \alpha \Rightarrow^* \gamma }[/math] 입니다.

어떤 문자열 [math]\displaystyle{ w }[/math]가 문법 [math]\displaystyle{ G }[/math]에 포함되는지 확인하려면, 시작 심볼 [math]\displaystyle{ S }[/math]로부터 [math]\displaystyle{ S \Rightarrow^* w }[/math]가 가능한지 보면 됩니다. 파싱(Parsing)은 이러한 유도 과정을 역으로 수행하여, 주어진 터미널 문자열(토큰 스트림)이 시작 심볼로부터 어떻게 유도되었는지를 찾아내는 과정입니다.

문장 형식 (Sentential Forms): 문법 [math]\displaystyle{ G=(V, T, S, P) }[/math]에서 시작 심볼 [math]\displaystyle{ S }[/math]로부터 유도될 수 있는 모든 문자열 [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math] (즉, [math]\displaystyle{ S \Rightarrow^* \alpha }[/math]를 만족하는 [math]\displaystyle{ \alpha }[/math])를 문장 형식이라고 합니다. 문장 형식은 터미널과 논터미널을 모두 포함할 수 있습니다.

문장 (Sentence): 문장 형식 중에서 논터미널을 전혀 포함하지 않는, 즉 터미널 기호로만 이루어진 문자열을 해당 문법의 문장이라고 합니다.

문법의 언어 (Language of a Grammar): 문법 [math]\displaystyle{ G }[/math]가 생성하는 언어 [math]\displaystyle{ L(G) }[/math]는 [math]\displaystyle{ G }[/math]의 모든 문장들의 집합입니다. 즉, [math]\displaystyle{ L(G) = \{ w \in T^* \mid S \Rightarrow^* w \} }[/math] 입니다.

유도 방식: 좌측 최단 유도와 우측 최단 유도

하나의 문장 형식에서 여러 개의 논터미널이 존재할 경우, 어떤 논터미널을 먼저 확장할지에 대한 선택이 필요합니다. 이를 위해 표준적인 유도 방식이 정의되어 있습니다:

좌측 최단 유도 (Leftmost Derivation, [math]\displaystyle{ \Rightarrow_l }[/math] 또는 [math]\displaystyle{ \Rightarrow_{lm} }[/math]):

각 유도 단계에서 항상 가장 왼쪽에 있는 논터미널 을 선택하여 생성 규칙을 적용합니다. 만약 [math]\displaystyle{ S \Rightarrow_l^* \alpha }[/math]이면, [math]\displaystyle{ \alpha }[/math]를 좌측 문장 형식(left sentential form) 이라고 합니다.

우측 최단 유도 (Rightmost Derivation, [math]\displaystyle{ \Rightarrow_r }[/math] 또는 [math]\displaystyle{ \Rightarrow_{rm} }[/math]):

각 유도 단계에서 항상 가장 오른쪽에 있는 논터미널 을 선택하여 생성 규칙을 적용합니다. 만약 [math]\displaystyle{ S \Rightarrow_r^* \alpha }[/math]이면, [math]\displaystyle{ \alpha }[/math]를 우측 문장 형식(right sentential form) 이라고 합니다. 우측 최단 유도는 특히 LR 파서와 같이 상향식(bottom-up) 파싱에서 유도 과정을 역순으로 구성하는 데 사용됩니다.

어떤 문법에 대해 주어진 문자열이 그 문법에 속한다면, 그 문자열에 대한 좌측 최단 유도와 우측 최단 유도가 각각 존재합니다 (때로는 유일하게 존재).

파스 트리 (Parse Tree)

파스 트리(Parse Tree) 또는 구체적인 구문 트리(Concrete Syntax Tree, CST)는 문맥 자유 문법으로부터 특정 문자열이 어떻게 유도되었는지를 시각적으로 보여주는 트리 형태의 표현입니다. 파스 트리는 유도 과정에서 논터미널이 생성 규칙에 의해 대체되는 순서의 변화는 무시하고, 어떤 규칙이 어떤 부분에 적용되었는지만을 나타냅니다. 즉, 동일한 규칙들을 사용하여 동일한 문자열을 유도했다면, 유도 순서(예: 좌측 최단 또는 우측 최단)에 관계없이 동일한 파스 트리가 생성됩니다.

파스 트리는 다음과 같은 속성을 가집니다:

루트 노드(Root Node): 문법의 시작 심볼 [math]\displaystyle{ S }[/math]로 레이블링됩니다.
내부 노드(Internal Node): 논터미널 기호 [math]\displaystyle{ A \in V }[/math]로 레이블링됩니다.
단말 노드(Leaf Node): 터미널 기호 [math]\displaystyle{ a \in T }[/math] 또는 빈 문자열 [math]\displaystyle{ \epsilon }[/math]으로 레이블링됩니다.
생성 규칙과의 관계: 만약 어떤 내부 노드가 논터미널 [math]\displaystyle{ A }[/math]로 레이블링되어 있고, 그 자식 노드들이 왼쪽에서 오른쪽으로 [math]\displaystyle{ X_1, X_2, \dots, X_k }[/math]로 레이블링되어 있다면, [math]\displaystyle{ A \rightarrow X_1 X_2 \dots X_k }[/math]는 반드시 문법 [math]\displaystyle{ P }[/math]에 포함된 생성 규칙이어야 합니다. (만약 [math]\displaystyle{ A \rightarrow \epsilon }[/math] 규칙이 적용되었다면, [math]\displaystyle{ A }[/math] 노드는 [math]\displaystyle{ \epsilon }[/math]으로 레이블링된 단일 자식 노드를 가집니다.)

파스 트리의 단말 노드들을 왼쪽에서 오른쪽으로 순서대로 읽으면 원래 유도된 문자열(문장)을 얻을 수 있습니다.

예시: 문법 [math]\displaystyle{ E \rightarrow E + E \mid E \star E \mid (E) \mid \text{id} }[/math] 에 대해 문자열 id + id를 유도하는 한 가지 방법은 [math]\displaystyle{ E \Rightarrow E+E \Rightarrow \text{id}+E \Rightarrow \text{id}+\text{id} }[/math] 입니다. 또 다른 유도 순서로 [math]\displaystyle{ E \Rightarrow E+E \Rightarrow E+\text{id} \Rightarrow \text{id}+\text{id} }[/math]가 있을 수 있습니다. 하지만 이 두 유도 과정은 동일한 파스 트리를 생성합니다.

만약 문법이 [math]\displaystyle{ E \rightarrow -E \mid (E) \mid E+E \mid E \star E \mid \text{id} }[/math] 이고 문자열이 -(id+id) 라면, 그 파스 트리는 다음과 같은 구조를 가질 수 있습니다:

루트 E는 자식으로 -와 다른 E (E1)를 가집니다.
E1은 자식으로 (, 다른 E (E2), )를 가집니다.
E2는 자식으로 또 다른 E (E3), +, 그리고 E (E4)를 가집니다.
E3과 E4는 각각 자식으로 id를 가집니다.