구문 분석: Difference between revisions

Revision as of 11:36, 28 May 2025

어휘 분석을 통해 소스 코드가 토큰(token)들의 스트림으로 변환되었다면, 컴파일러의 다음 주요 단계는 구문 분석(Syntax Analysis) 입니다. 구문 분석기는 이 토큰 스트림을 입력으로 받아 프로그램의 문법적 구조를 파악하고, 이를 보통 파스 트리(Parse Tree) 또는 구문 트리(Syntax Tree) 형태로 표현합니다. 이 과정을 파싱(Parsing) 이라고도 합니다.

어휘 분석에서 토큰의 패턴을 정의하기 위해 정규 표현식(Regular Expression)을 사용했던 것처럼, 구문 분석에서는 프로그램의 구조, 즉 문장들이나 구문 요소들이 어떻게 결합되어 올바른 프로그램을 형성하는지를 정의하기 위해 문맥 자유 문법(Context-Free Grammar, CFG)을 사용합니다.

문맥 자유 문법 (Context-Free Grammar, CFG)

정규 표현식과 유한 오토마타는 어휘 단위(토큰)를 인식하는 데는 강력하지만, 중첩(nesting) 구조나 재귀적인 구조를 표현하는 데는 한계가 있습니다. 예를 들어, 괄호의 짝 맞추기, begin-end 블록 구조, 또는 재귀적인 수식 구조 등은 정규 언어만으로는 완벽하게 표현하기 어렵습니다. 이러한 2차원적인, 계층적인 구조를 표현하기 위해서는 정규 표현식보다 표현력이 더 강한 문맥 자유 문법이 필요합니다.

회문(Palindrome) 은 문맥 자유 문법으로 표현할 수 있지만 정규 언어로는 표현할 수 없는 대표적인 예입니다. 회문은 앞에서부터 읽으나 뒤에서부터 읽으나 동일한 문자열을 말합니다 (예: "madam", "racecar", 또는 알파벳 [math]\displaystyle{ \{0, 1\} }[/math] 상에서는 "0110", "101"). 회문을 정의하는 문맥 자유 문법은 다음과 같이 재귀적으로 정의될 수 있습니다:

기본 케이스 (Bases):
- [math]\displaystyle{ P \rightarrow \epsilon }[/math] (빈 문자열은 회문이다)
- [math]\displaystyle{ P \rightarrow 0 }[/math] ('0' 자체는 회문이다)
- [math]\displaystyle{ P \rightarrow 1 }[/math] ('1' 자체는 회문이다)
귀납적 단계 (Induction): 만약 문자열 [math]\displaystyle{ w }[/math]가 회문이라면, 다음도 회문입니다:
- [math]\displaystyle{ P \rightarrow 0P0 }[/math] ([math]\displaystyle{ w }[/math]의 양쪽에 '0'을 붙인 [math]\displaystyle{ 0w0 }[/math]도 회문이다)
- [math]\displaystyle{ P \rightarrow 1P1 }[/math] ([math]\displaystyle{ w }[/math]의 양쪽에 '1'을 붙인 [math]\displaystyle{ 1w1 }[/math]도 회문이다)

모든 문맥 자유 문법은 이처럼 재귀적인 정의를 통해 언어를 생성합니다.

문맥 자유 문법의 형식적 정의

문맥 자유 문법 [math]\displaystyle{ G }[/math]는 다음 네 가지 구성요소의 튜플 [math]\displaystyle{ (V, T, S, P) }[/math]로 정의됩니다:

[math]\displaystyle{ V }[/math]: 논터미널(Non-terminal) 기호들의 유한한 집합입니다. 논터미널은 문법 규칙을 통해 더 작은 단위로 확장될 수 있는 구문 변수(syntactic variable)를 나타냅니다 (예: Expression, Statement, NounPhrase).
[math]\displaystyle{ T }[/math]: 터미널(Terminal) 기호들의 유한한 집합입니다. 터미널은 문법의 가장 기본적인 단위로, 더 이상 확장되지 않는 기호들입니다. 어휘 분석기가 생성하는 토큰들이 여기에 해당합니다 (예: id, number, +, if). [math]\displaystyle{ V }[/math]와 [math]\displaystyle{ T }[/math]는 서로소 집합입니다 ([math]\displaystyle{ V \cap T = \emptyset }[/math]).
[math]\displaystyle{ S }[/math]: [math]\displaystyle{ V }[/math]에 속하는 특별한 논터미널 기호로, 시작 심볼(Start Symbol) 이라고 합니다. 문법은 이 시작 심볼로부터 유도를 시작합니다.
[math]\displaystyle{ P }[/math]: 생성 규칙(Production rule) 들의 유한한 집합입니다. 각 생성 규칙은 [math]\displaystyle{ A \rightarrow \alpha }[/math] 형태를 가집니다.
- [math]\displaystyle{ A \in V }[/math]는 논터미널입니다.
- [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math]는 터미널과 논터미널 기호들의 문자열입니다 ([math]\displaystyle{ \alpha }[/math]는 빈 문자열 [math]\displaystyle{ \epsilon }[/math]일 수도 있습니다).
- 이 규칙은 "논터미널 [math]\displaystyle{ A }[/math]는 문자열 [math]\displaystyle{ \alpha }[/math]로 대체될(파생될) 수 있다"라고 읽습니다.

유도 (Derivation)

문맥 자유 문법 [math]\displaystyle{ G = (V, T, S, P) }[/math]가 주어졌을 때, 문자열이 이 문법에 의해 생성될 수 있는지를 확인하는 과정을 유도(Derivation) 라고 합니다. 유도는 생성 규칙을 반복적으로 적용하여 시작 심볼로부터 특정 문자열을 만들어내는 과정입니다.

직접 유도 (Direct Derivation, [math]\displaystyle{ \Rightarrow }[/math]): 만약 [math]\displaystyle{ \alpha A \beta }[/math]가 터미널과 논터미널로 이루어진 문자열이고 ([math]\displaystyle{ A \in V }[/math], [math]\displaystyle{ \alpha, \beta \in (V \cup T)^* }[/math]), [math]\displaystyle{ A \rightarrow \gamma }[/math]가 [math]\displaystyle{ G }[/math]의 생성 규칙 중 하나라면, [math]\displaystyle{ \alpha A \beta }[/math]는 [math]\displaystyle{ \alpha \gamma \beta }[/math]를 직접 유도한다고 하며, [math]\displaystyle{ \alpha A \beta \Rightarrow \alpha \gamma \beta }[/math]로 표기합니다. "문맥 자유"라는 이름은 논터미널 [math]\displaystyle{ A }[/math]를 [math]\displaystyle{ \gamma }[/math]로 대체할 때 [math]\displaystyle{ A }[/math]의 주변 문맥(context)인 [math]\displaystyle{ \alpha }[/math]와 [math]\displaystyle{ \beta }[/math]에 영향을 받지 않는다는 의미에서 유래합니다.

유도 클로저 ([math]\displaystyle{ \Rightarrow^* }[/math]): [math]\displaystyle{ \Rightarrow^* }[/math]는 0번 이상의 직접 유도 과정을 나타내는 관계입니다.
- 기본 케이스: 임의의 문자열 [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math]에 대해, [math]\displaystyle{ \alpha \Rightarrow^* \alpha }[/math] 입니다 (0번 유도).
- 귀납적 케이스: 만약 [math]\displaystyle{ \alpha \Rightarrow^* \beta }[/math]이고 [math]\displaystyle{ \beta \Rightarrow \gamma }[/math]이면, [math]\displaystyle{ \alpha \Rightarrow^* \gamma }[/math] 입니다.

어떤 문자열 [math]\displaystyle{ w }[/math]가 문법 [math]\displaystyle{ G }[/math]에 포함되는지 확인하려면, 시작 심볼 [math]\displaystyle{ S }[/math]로부터 [math]\displaystyle{ S \Rightarrow^* w }[/math]가 가능한지 보면 됩니다. 파싱(Parsing)은 이러한 유도 과정을 역으로 수행하여, 주어진 터미널 문자열(토큰 스트림)이 시작 심볼로부터 어떻게 유도되었는지를 찾아내는 과정입니다.

문장 형식 (Sentential Forms): 문법 [math]\displaystyle{ G=(V, T, S, P) }[/math]에서 시작 심볼 [math]\displaystyle{ S }[/math]로부터 유도될 수 있는 모든 문자열 [math]\displaystyle{ \alpha \in (V \cup T)^* }[/math] (즉, [math]\displaystyle{ S \Rightarrow^* \alpha }[/math]를 만족하는 [math]\displaystyle{ \alpha }[/math])를 문장 형식이라고 합니다. 문장 형식은 터미널과 논터미널을 모두 포함할 수 있습니다.

문장 (Sentence): 문장 형식 중에서 논터미널을 전혀 포함하지 않는, 즉 터미널 기호로만 이루어진 문자열을 해당 문법의 문장이라고 합니다.

문법의 언어 (Language of a Grammar): 문법 [math]\displaystyle{ G }[/math]가 생성하는 언어 [math]\displaystyle{ L(G) }[/math]는 [math]\displaystyle{ G }[/math]의 모든 문장들의 집합입니다. 즉, [math]\displaystyle{ L(G) = \{ w \in T^* \mid S \Rightarrow^* w \} }[/math] 입니다.

@@ Line 2: / Line 2: @@
 어휘 분석에서 토큰의 패턴을 정의하기 위해 정규 표현식(Regular Expression)을 사용했던 것처럼, 구문 분석에서는 프로그램의 구조, 즉 문장들이나 구문 요소들이 어떻게 결합되어 올바른 프로그램을 형성하는지를 정의하기 위해 '''문맥 자유 문법(Context-Free Grammar, CFG)'''을 사용합니다.
+== 문맥 자유 문법 (Context-Free Grammar, CFG) ==
+정규 표현식과 유한 오토마타는 어휘 단위(토큰)를 인식하는 데는 강력하지만, 중첩(nesting) 구조나 재귀적인 구조를 표현하는 데는 한계가 있습니다. 예를 들어, 괄호의 짝 맞추기, <code>begin-end</code> 블록 구조, 또는 재귀적인 수식 구조 등은 정규 언어만으로는 완벽하게 표현하기 어렵습니다. 이러한 2차원적인, 계층적인 구조를 표현하기 위해서는 정규 표현식보다 표현력이 더 강한 문맥 자유 문법이 필요합니다.
+'''회문(Palindrome)''' 은 문맥 자유 문법으로 표현할 수 있지만 정규 언어로는 표현할 수 없는 대표적인 예입니다. 회문은 앞에서부터 읽으나 뒤에서부터 읽으나 동일한 문자열을 말합니다 (예: "madam", "racecar", 또는 알파벳 <math>\{0, 1\}</math> 상에서는 "0110", "101"). 회문을 정의하는 문맥 자유 문법은 다음과 같이 재귀적으로 정의될 수 있습니다:
+* '''기본 케이스 (Bases):'''
+** <math>P \rightarrow \epsilon</math> (빈 문자열은 회문이다)
+** <math>P \rightarrow 0</math> ('0' 자체는 회문이다)
+** <math>P \rightarrow 1</math> ('1' 자체는 회문이다)
+* '''귀납적 단계 (Induction):''' 만약 문자열 <math>w</math>가 회문이라면, 다음도 회문입니다:
+** <math>P \rightarrow 0P0</math> (<math>w</math>의 양쪽에 '0'을 붙인 <math>0w0</math>도 회문이다)
+** <math>P \rightarrow 1P1</math> (<math>w</math>의 양쪽에 '1'을 붙인 <math>1w1</math>도 회문이다)
+모든 문맥 자유 문법은 이처럼 재귀적인 정의를 통해 언어를 생성합니다.
+=== 문맥 자유 문법의 형식적 정의 ===
+문맥 자유 문법 <math>G</math>는 다음 네 가지 구성요소의 튜플 <math>(V, T, S, P)</math>로 정의됩니다:
+* <math>V</math>: '''논터미널(Non-terminal)''' 기호들의 유한한 집합입니다. 논터미널은 문법 규칙을 통해 더 작은 단위로 확장될 수 있는 구문 변수(syntactic variable)를 나타냅니다 (예: <code>Expression</code>, <code>Statement</code>, <code>NounPhrase</code>).
+* <math>T</math>: '''터미널(Terminal)''' 기호들의 유한한 집합입니다. 터미널은 문법의 가장 기본적인 단위로, 더 이상 확장되지 않는 기호들입니다. 어휘 분석기가 생성하는 토큰들이 여기에 해당합니다 (예: <code>id</code>, <code>number</code>, <code>+</code>, <code>if</code>). <math>V</math>와 <math>T</math>는 서로소 집합입니다 (<math>V \cap T = \emptyset</math>).
+* <math>S</math>: <math>V</math>에 속하는 특별한 논터미널 기호로, '''시작 심볼(Start Symbol)''' 이라고 합니다. 문법은 이 시작 심볼로부터 유도를 시작합니다.
+* <math>P</math>: '''생성 규칙(Production rule)''' 들의 유한한 집합입니다. 각 생성 규칙은 <math>A \rightarrow \alpha</math> 형태를 가집니다.
+** <math>A \in V</math>는 논터미널입니다.
+** <math>\alpha \in (V \cup T)^*</math>는 터미널과 논터미널 기호들의 문자열입니다 (<math>\alpha</math>는 빈 문자열 <math>\epsilon</math>일 수도 있습니다).
+** 이 규칙은 "논터미널 <math>A</math>는 문자열 <math>\alpha</math>로 대체될(파생될) 수 있다"라고 읽습니다.
+=== 유도 (Derivation) ===
+문맥 자유 문법 <math>G = (V, T, S, P)</math>가 주어졌을 때, 문자열이 이 문법에 의해 생성될 수 있는지를 확인하는 과정을 '''유도(Derivation)''' 라고 합니다. 유도는 생성 규칙을 반복적으로 적용하여 시작 심볼로부터 특정 문자열을 만들어내는 과정입니다.
+* '''직접 유도 (Direct Derivation, <math>\Rightarrow</math>):''' 만약 <math>\alpha A \beta</math>가 터미널과 논터미널로 이루어진 문자열이고 (<math>A \in V</math>, <math>\alpha, \beta \in (V \cup T)^*</math>), <math>A \rightarrow \gamma</math>가 <math>G</math>의 생성 규칙 중 하나라면, <math>\alpha A \beta</math>는 <math>\alpha \gamma \beta</math>를 '''직접 유도한다'''고 하며, <math>\alpha A \beta \Rightarrow \alpha \gamma \beta</math>로 표기합니다. "문맥 자유"라는 이름은 논터미널 <math>A</math>를 <math>\gamma</math>로 대체할 때 <math>A</math>의 주변 문맥(context)인 <math>\alpha</math>와 <math>\beta</math>에 영향을 받지 않는다는 의미에서 유래합니다.
+* '''유도 클로저 (<math>\Rightarrow^*</math>):''' <math>\Rightarrow^*</math>는 0번 이상의 직접 유도 과정을 나타내는 관계입니다.
+** '''기본 케이스:''' 임의의 문자열 <math>\alpha \in (V \cup T)^*</math>에 대해, <math>\alpha \Rightarrow^* \alpha</math> 입니다 (0번 유도).
+** '''귀납적 케이스:''' 만약 <math>\alpha \Rightarrow^* \beta</math>이고 <math>\beta \Rightarrow \gamma</math>이면, <math>\alpha \Rightarrow^* \gamma</math> 입니다.
+어떤 문자열 <math>w</math>가 문법 <math>G</math>에 포함되는지 확인하려면, 시작 심볼 <math>S</math>로부터 <math>S \Rightarrow^* w</math>가 가능한지 보면 됩니다.
+'''파싱(Parsing)'''은 이러한 유도 과정을 역으로 수행하여, 주어진 터미널 문자열(토큰 스트림)이 시작 심볼로부터 어떻게 유도되었는지를 찾아내는 과정입니다.
+* '''문장 형식 (Sentential Forms):''' 문법 <math>G=(V, T, S, P)</math>에서 시작 심볼 <math>S</math>로부터 유도될 수 있는 모든 문자열 <math>\alpha \in (V \cup T)^*</math> (즉, <math>S \Rightarrow^* \alpha</math>를 만족하는 <math>\alpha</math>)를 '''문장 형식'''이라고 합니다. 문장 형식은 터미널과 논터미널을 모두 포함할 수 있습니다.
+* '''문장 (Sentence):''' 문장 형식 중에서 논터미널을 전혀 포함하지 않는, 즉 터미널 기호로만 이루어진 문자열을 해당 문법의 '''문장'''이라고 합니다.
+* '''문법의 언어 (Language of a Grammar):''' 문법 <math>G</math>가 생성하는 언어 <math>L(G)</math>는 <math>G</math>의 모든 문장들의 집합입니다. 즉, <math>L(G) = \{ w \in T^* \mid S \Rightarrow^* w \}</math> 입니다.