Monday, September 30, 2013

Compilers Project1: Flex

I decided to use flex for the lexer, so I've been learning the syntax and writing small programs for each feature I have to add that wasn't in the provided sample.

Running Flex
Create the file:                            sample.flex
Run flex on it:                             flex sample.flex
    This creates a lex.yy.c file
Compile and link:                       gcc lex.yy.c -lfl
Run the executable:                    ./a.out


Keywords
I started out doing keywords, because I just have to match a list of words, which doesn't require a complicated regex. I tested it by inputting every word in the list and making sure it output the correct syntax. In this case: (KEYWORD symbol). And they all worked! But then I tried a more complicated input and got some strange input:
please return to me darling
ple(KEYWORD as)e (KEYWORD return) to me darl(KEYWORD in)g 
After researching word boundaries in flex and talking to my professor, we determined that once I put the rules for keywords into the full lexer file this wouldn't happen again because the words "please" and "darling" would be interpreted as identifiers, which are the longest matching regex.


Punctuation
The next thing I decided to tackle was punctuation. Like keywords, punctuation is relatively simple because I just need to match a list of operators and delimiters. Once again, I tested my regex by inputting every legal punctuation. Once those all passed I inputted some illegal punctuation.
~=
 *  (PUNCT "~")(PUNCT "=")
I'm not sure if the result is valid, or if I should be throwing an error. I'll have to talk to my professor about it (I suspect the latter).


String Literals
I was putting off doing literals because Python has 5 different types and they looked really complicated at first glance. However, each literal is defined by a grammar, which are broken up well and clearly defined.
stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
stringescapeseq ::=  "\" <any source character>
All I had to do was translate each definition into a regex. I did run into a problem, where my regex for string_literal caused an error, but <I'm still working on figuring out the problem>.
 

Bytes Literals
Bytes literals are really similar to string literals. I followed the same process.
bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "\" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "\">
bytesescapeseq ::=  "\" <any ASCII character>
And similarly to string_literals, I got an error for my regex for bytes_literal.


Integer Literals
The integer literals were one of the most straightforward of all the literals in Python. They're made up of decimal, binary, hex, and octal digits.
integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"+
nonzerodigit   ::=  "1"..."9"
digit          ::=  "0"..."9"
octinteger     ::=  "0" ("o" | "O") octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
octdigit       ::=  "0"..."7"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"
bindigit       ::=  "0" | "1"
This file compiled just fine, but when was testing it I found that the numbers 1-9 were not recognized (unless they were followed by a 0). I'm know the problem is with my regex for nonzero_digit (which is just [1-9]), but I'm not sure how to fix it. I tried adding a caret to the beginning (^[1-9]), but then I got an error when I tried to compile...


Floating Point Literals
I would say that floating point literals were the most straightforward literals, but imaginary literals consist of one definition.
floatnumber   ::=  pointfloat | exponentfloat
pointfloat    ::=  [intpart] fraction | intpart "."
exponentfloat ::=  (intpart | pointfloat) exponent
intpart       ::=  digit+
fraction      ::=  "." digit+
exponent      ::=  ("e" | "E") ["+" | "-"] digit+
Anyway, float literals were the only ones I didn't have any trouble with, but since the definitions are so expressive there may be some boundary cases I just didn't think to test.


Imaginary Literals
Last, but not least imaginary literals. One line, very easy.
imagnumber ::=  (floatnumber | intpart) ("j" | "J")
And since imaginary literals are essentially floating point literals with an added j or J I didn't have any problems with these either!


The only things I have left to do on this project are:
  1. More logic for indent/dedent (the code is there, I just need to bend it to my will)
  2. EOF (shouldn't be an issue)
  3. Put everything together in one file and do lots of testing!

No comments:

Post a Comment