編譯原理Lab1-用FLEX構造C-Minus-f詞法分析器

HNU編譯原理lab1實驗–根據cminux-f的詞法補全lexical_analyer.l文件，完成詞法分析器。

本文沒有添加任何圖片，但是以復制輸出的形式展現出來了實驗結果。

實驗要求：

根據cminux-f的此法補全lexical_analyer.l文件，完成詞法分析器，能夠輸出識別出的token，type，line(剛出現的行數)，pos_start(該行的開始位置)，post_end(結束位置不包含)
例如：
文本輸入：

int a;

則識別結果應為：

int     280     1       2       5
a       285     1       6       7
;       270     1       7       8

cminus-f詞法

C MINUS是C語言的一個子集，該語言的語法在《編譯原理與實踐》第九章附錄中有詳細的介紹。而cminus-f則是在C MINUS上追加了浮點操作。

關鍵字else if int return void while float
專用符號+ - * / < <= > >= == != = ; , ( ) [ ] { } /* */
標識符ID和整數NUM，通過下列正則表達式定義:

letter = a|...|z|A|...|Z
digit = 0|...|9
ID = letter+
INTEGER = digit+
FLOAT = (digit+. | digit*.digit+)

注釋用/*...*/表示，可以超過一行。注釋不能嵌套。

/*...*/

注：[, ], 和 [] 是三種不同的token。[]用于聲明數組類型，[]中間不得有空格。
- a[]應被識別為兩個token: a、[]
- a[1]應被識別為四個token: a, [, 1, ]

實驗難點

git相關操作：常用的命令在實驗文檔中

將實驗倉庫克隆到本地:打開本地的工作目錄，在命令行中輸入
`git clone https://gitee.com/你的gitee用戶名/cminus_compiler-2022-fall.git`
打開本地的工作目錄，在命令行中輸入
git add *
git commit -m "注釋語句"
然后push到倉庫
git push

實驗環境配置

sudo apt-get install llvm bison flex
輸入：flex --version和bison --version

FLEX工具的簡單使用

首先，FLEX從輸入文件*.lex或者stdio讀取詞法掃描器的規范，從而生成C代碼源文件lex.yy.c。然后，編譯lex.yy.c并與-lfl庫鏈接，以生成可執行的a.out。最后，a.out分析其加入的輸入流，將其轉換為一系列token。
舉例：

%{
//在%{和%}中的代碼會被原樣照抄到生成的lex.yy.c文件的開頭，您可以在這里書寫聲明與定義
#include <string.h>
int chars = 0;
int words = 0;
%}
%%/*你可以在這里使用你熟悉的正則表達式來編寫模式*//*你可以用C代碼來指定模式匹配時對應的動作*//*yytext指針指向本次匹配的輸入文本*//*左部分（[a-zA-Z]+）為要匹配的正則表達式，右部分（{ chars += strlen(yytext);words++;}）為匹配到該正則表達式后執行的動作*/
[a-zA-Z]+ { chars += strlen(yytext);words++;}
. {}/*對其他所有字符，不做處理，繼續執行*/
%%
int main(int argc, char **argv){//yylex()是flex提供的詞法分析例程，默認讀取stdin      yylex();                                                               printf("look, I find %d words of %d chars\n", words, chars);return 0;
}

lex中的字符規定：

格式	含義
a	字符
“a”	元字符
\a	轉義
a*	a的零次或者多次重復
a+	a的一次或者多次重復
a？	一個可選的a
a	b
(a)	a本身
[abc]	字符abc中的任意一個
[a-d]	字符abcd的任意一個
{xxxxx}	名字xxx表示的正則表達式
.	除了新行之外的任意一個字符

實驗設計

找到Token符號對應的字符

在cminux_compiler-2023-fall/include/lexical_analyzer.h
有定義cimux_token_type（附錄）

對應正則表達式

根據cminus-f詞法

1. 關鍵字
else if int return void while float2. 專用符號
+ - * / < <= > >= == != = ; , ( ) [ ] { } /* */3. 標識符ID和整數NUM，通過下列正則表達式定義:
letter = a|...|z|A|...|Z
digit = 0|...|9
ID = letter+
INTEGER = digit+
FLOAT = (digit+. | digit*.digit+)4. 注釋用`/*...*/`表示，可以超過一行。注釋不能嵌套。
/*...*/
- 注：`[`, `]`, 和 `[]` 是三種不同的token。`[]`用于聲明數組類型，`[]`中間不得有空格。- `a[]`應被識別為兩個token: `a`、`[]`- `a[1]`應被識別為四個token: `a`, `[`, `1`, `]`

寫出對應的正則表達式和指定匹配對應的動作

C minus的詞法單元規則有：
關鍵字：else if int return void while float
專用符號：`+ - * / < <= > >= == != = ; , ( ) [ ] { } /* */``
標識符ID和整數NUM，通過下列正則表達式定義:

letter = a|...|z|A|...|Z
digit = 0|...|9
ID = letter+
INTEGER = digit+
FLOAT = (digit+. | digit*.digit+)

注釋用/*...*/表示，可以超過一行。注釋不能嵌套。

此部分用于定義C Minus的詞法單元的規則，模式采用正則表達式表示，注意當詞法單元的pattern中包含特殊字符時，需要使用轉義字符\。動作使用C語言描述，確定每個Token在每行的開始位置和結束位置，并且返回該詞法單元類型。該返回值為yylex()的返回值。
動作分為兩步：第一步，更新lines、pos_start、post_end。第二步：將識別結果token返回，return。
運算：

\+ {pos_start=pos_end;pos_end=pos_start+1;return ADD;}
\- {pos_start=pos_end;pos_end=pos_start+1;return SUB;}
\* {pos_start=pos_end;pos_end=pos_start+1;return MUL;}
\/ {pos_start=pos_end;pos_end=pos_start+1;return DIV;}
\< {pos_start=pos_end;pos_end=pos_start+1;return LT;}
"<=" {pos_start=pos_end;pos_end=pos_start+2;return LTE;}
\> {pos_start=pos_end;pos_end=pos_start+1;return GT;}
">=" {pos_start=pos_end;pos_end=pos_start+2;return GTE;}
"==" {pos_start=pos_end;pos_end=pos_start+2;return EQ;}
"!=" {pos_start=pos_end;pos_end=pos_start+2;return NEQ;}
\= {pos_start=pos_end;pos_end=pos_start+1;return ASSIN;}

符號：

\; {pos_start=pos_end;pos_end=pos_start+1;return SEMICOLON;}
\, {pos_start=pos_end;pos_end=pos_start+1;return COMMA;}
\( {pos_start=pos_end;pos_end=pos_start+1;return LPARENTHESE;}
\) {pos_start=pos_end;pos_end=pos_start+1;return RPARENTHESE;}
\[ {pos_start=pos_end;pos_end=pos_start+1;return LBRACKET;}
\] {pos_start=pos_end;pos_end=pos_start+1;return RBRACKET;}
\{ {pos_start=pos_end;pos_end=pos_start+1;return LBRACE;}
\} {pos_start=pos_end;pos_end=pos_start+1;return RBRACE;}

關鍵字：

else {pos_start=pos_end;pos_end=pos_start+4;return ELSE;}
if {pos_start=pos_end;pos_end=pos_start+2;return IF;}
int {pos_start=pos_end;pos_end=pos_start+3;return INT;}
float {pos_start=pos_end;pos_end=pos_start+5;return FLOAT;}
return {pos_start=pos_end;pos_end=pos_start+6;return RETURN;}
void {pos_start=pos_end;pos_end=pos_start+4;return VOID;}
while {pos_start=pos_end;pos_end=pos_start+5;return WHILE;}

標識符和整數NUM

[a-zA-Z]+ {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return IDENTIFIER;}
[0-9]+ {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return INTEGER;}
[0-9]*\.[0-9]+ {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return FLOATPOINT;}
"[]" {pos_start=pos_end;pos_end=pos_start+2;return ARRAY;}
[a-zA-Z] {pos_start=pos_end;pos_end=pos_start+1;return LETTER;}
[0-9]+\. {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return FLOATPOINT;}

其他的

當詞法分析器掃描到換行符時（Windows下為\r\n，Linux下為\n，Mac下為\r），行數lines自增，pos_start與pos_end更新
由于flex生成的詞法分析器采用最長匹配策略，且注釋/**/包含正則的通配符，正則規范較為復雜。當識別到一個注釋時，需要考慮詞法單元開始位置和結束位置變化，且多行注釋要修改lines.
錯誤的詞法單元，當掃描到錯誤的詞法單元，僅返回ERROR

\n {return EOL;} #換行
\/\*([^\*]|(\*)*[^\*\/])*(\*)*\*\/ {return COMMENT;} #注釋
" " {return BLANK;} #空格
\t {return BLANK;} # 空格
. {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return ERROR;} #錯誤

最終的添加

/******************TODO*********************//****請在此補全所有flex的模式與動作  start******///STUDENT TO DO\+ {pos_start=pos_end;pos_end=pos_start+1;return ADD;}
\- {pos_start=pos_end;pos_end=pos_start+1;return SUB;}
\* {pos_start=pos_end;pos_end=pos_start+1;return MUL;}
\/ {pos_start=pos_end;pos_end=pos_start+1;return DIV;}
\< {pos_start=pos_end;pos_end=pos_start+1;return LT;}
"<=" {pos_start=pos_end;pos_end=pos_start+2;return LTE;}
\> {pos_start=pos_end;pos_end=pos_start+1;return GT;}
">=" {pos_start=pos_end;pos_end=pos_start+2;return GTE;}
"==" {pos_start=pos_end;pos_end=pos_start+2;return EQ;}
"!=" {pos_start=pos_end;pos_end=pos_start+2;return NEQ;}
\= {pos_start=pos_end;pos_end=pos_start+1;return ASSIN;}
\; {pos_start=pos_end;pos_end=pos_start+1;return SEMICOLON;}
\, {pos_start=pos_end;pos_end=pos_start+1;return COMMA;}
\( {pos_start=pos_end;pos_end=pos_start+1;return LPARENTHESE;}
\) {pos_start=pos_end;pos_end=pos_start+1;return RPARENTHESE;}
\[ {pos_start=pos_end;pos_end=pos_start+1;return LBRACKET;}
\] {pos_start=pos_end;pos_end=pos_start+1;return RBRACKET;}
\{ {pos_start=pos_end;pos_end=pos_start+1;return LBRACE;}
\} {pos_start=pos_end;pos_end=pos_start+1;return RBRACE;}
else {pos_start=pos_end;pos_end=pos_start+4;return ELSE;}
if {pos_start=pos_end;pos_end=pos_start+2;return IF;}
int {pos_start=pos_end;pos_end=pos_start+3;return INT;}
float {pos_start=pos_end;pos_end=pos_start+5;return FLOAT;}
return {pos_start=pos_end;pos_end=pos_start+6;return RETURN;}
void {pos_start=pos_end;pos_end=pos_start+4;return VOID;}
while {pos_start=pos_end;pos_end=pos_start+5;return WHILE;}
[a-zA-Z]+ {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return IDENTIFIER;}
[0-9]+ {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return INTEGER;}
[0-9]*\.[0-9]+ {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return FLOATPOINT;}
"[]" {pos_start=pos_end;pos_end=pos_start+2;return ARRAY;}
[a-zA-Z] {pos_start=pos_end;pos_end=pos_start+1;return LETTER;}
[0-9]+\. {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return FLOATPOINT;}
\n {return EOL;}
\/\*([^\*]|(\*)*[^\*\/])*(\*)*\*\/ {return COMMENT;}
" " {return BLANK;}
\t {return BLANK;}
. {pos_start=pos_end;pos_end=pos_start+strlen(yytext);return ERROR;}/****請在此補全所有flex的模式與動作  end******/

和補充C語言代碼

注釋可以分為多行，所以在識別到注釋的時候要進行額外的分析，識別到換行符\n的時候，要lines+1，重置pos_end.

           case COMMENT://STUDENT TO DO{pos_start=pos_end;pos_end=pos_start+2;int i=2;while(yytext[i]!='*' || yytext[i+1]!='/'){  			if(yytext[i]=='\n'){lines=lines+1;pos_end=1;}elsepos_end=pos_end+1;i=i+1;}pos_end=pos_end+2;break;}case BLANK://STUDENT TO DO{pos_start=pos_end;pos_end=pos_start+1;break;}case EOL://STUDENT TO DO{lines+=1;pos_end=1;break;}

實驗結果驗證

實驗結果

根據實驗指導書上的流程輸入命令并且得到反饋結果

sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ mkdir build
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ cd build
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall/build$ cmake ../
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found FLEX: /usr/bin/flex (found version "2.6.4") 
-- Found BISON: /usr/bin/bison (found version "3.5.1") 
-- Found LLVM 10.0.0
-- Using LLVMConfig.cmake in: /usr/lib/llvm-10/cmake
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sunny2004/lab1/cminus_compiler-2023-fall/build
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall/build$ make lexer
[ 20%] [FLEX][lex] Building scanner with flex 2.6.4
lexical_analyzer.l:60: warning, 無法匹配規則
Scanning dependencies of target flex
[ 40%] Building C object src/lexer/CMakeFiles/flex.dir/lex.yy.c.o
lexical_analyzer.l: In function ‘analyzer’:
lexical_analyzer.l:92:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
lexical_analyzer.l: At top level:
/home/sunny2004/lab1/cminus_compiler-2023-fall/build/src/lexer/lex.yy.c:1320:17: warning: ‘yyunput’ defined but not used [-Wunused-function]static void yyunput (int c, char * yy_bp )^
/home/sunny2004/lab1/cminus_compiler-2023-fall/build/src/lexer/lex.yy.c:1363:16: warning: ‘input’ defined but not used [-Wunused-function]static int input  (void)^
[ 60%] Linking C static library ../../libflex.a
[ 60%] Built target flex
Scanning dependencies of target lexer
[ 80%] Building C object tests/lab1/CMakeFiles/lexer.dir/main.c.o
[100%] Linking C executable ../../lexer
[100%] Built target lexer
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall/build$ cd ..
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ ./build/lexer
usage: lexer input_file output_file
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ ./build/lexer ./tests/lab1/testcase/1.cminus out
[START]: Read from: ./tests/lab1/testcase/1.cminus
[END]: Analysis completed.
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ head -n 5 out
int	280	1	1	4
gcd	285	1	5	8
(	272	1	9	10
int	280	1	10	13
u	285	1	14	15
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ python3 ./tests/lab1/test_lexer.py
Find 6 files
[START]: Read from: ./tests/lab1/testcase/3.cminus
[END]: Analysis completed.
[START]: Read from: ./tests/lab1/testcase/2.cminus
[END]: Analysis completed.
[START]: Read from: ./tests/lab1/testcase/6.cminus
[END]: Analysis completed.
[START]: Read from: ./tests/lab1/testcase/1.cminus
[END]: Analysis completed.
[START]: Read from: ./tests/lab1/testcase/5.cminus
[END]: Analysis completed.
[START]: Read from: ./tests/lab1/testcase/4.cminus
[END]: Analysis completed.
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$ diff ./tests/lab1/token ./tests/lab1/TA_token
sunny2004@sunny2004-VirtualBox:~/lab1/cminus_compiler-2023-fall$

如果正確的話，diff不會返回任何輸出，如果返回了，就出錯了

gitee上傳

git commit -m "lab1-result"
如果是第一次提交，Ubuntu會告訴你這樣：
請告訴我您是誰，運行
git config --global user.email "you@example.com"
git config --global user.name "your Name"
來自設置您賬號的缺省身份標識。
如果僅在本地倉庫設置身份標識，則省略 --global參數
這個時候你就運行git config那兩行命令之后再運行git commit -m "lab1-result"就可以了
然后：git push 上傳工作到gitee倉庫（這一部分忘記復制了，實驗指導書里寫的很詳細，就按照那個來就行）

實驗反饋

學習和鞏固了正則表達式
熟悉了gitee的操作
一路磕磕絆絆，調試，趕在驗收之前完成了，編譯原理好難┭┮﹏┭┮

附錄1：cmius_token_type

typedef num cminus_token_type{
//運算
ADD = 259, 	加號：+
SUB = 260, 	減號：-
MUL = 261, 	乘號：*
DIV = 262, 	除法：/
LT = 263, 	小于：<
LTE = 264, 	小于等于：<=
GT = 265, 	大于：>
GTE = 266, 	大于等于：>=
EQ = 267, 	相等：==
NEQ = 268, 	不相等：!=
ASSIN = 269,單個等于號：=//符號
SEMICOLON = 270,	分號：;
COMMA = 271, 		逗號：,
LPARENTHESE = 272, 	左括號：(
RPARENTHESE = 273, 	右括號：)
LBRACKET = 274, 	左中括號：[
RBRACKET = 275, 	右中括號：]
LBRACE = 276, 		左大括號：{
RBRACE = 277, 		右大括號：}//關鍵字
ELSE = 278, 	else
IF = 279, 		if
INT = 280, 		int
FLOAT = 281, 	float
RETURN = 282, 	return 
VOID = 283, 	void
WHILE = 284,	while//ID和NUM
IDENTIFIER = 285,	變量名，例如a,low,high
INTEGER = 286, 		整數，例如10，1
FLOATPOINT = 287,	浮點數，例如11.1
ARRAY = 288,		數組，例如[]
LETTER = 289,		單個字母，例如a,z	//others
EOL = 290,			換行符，\n或\0	
COMMENT = 291,		注釋
BLANK = 292,		空格
ERROR = 258			錯誤
} Token;typedef struct{char text[256];int token;int lines;int pos_start;int pos_end;
} Token_Node;