Running Flex
Create the file: sample.flex
Run flex on it: flex sample.flex
This creates a lex.yy.c file
Compile and link: gcc lex.yy.c -lfl
Run the executable: ./a.out
Keywords
I started out doing keywords, because I just have to match a list of words, which doesn't require a complicated regex. I tested it by inputting every word in the list and making sure it output the correct syntax. In this case: (KEYWORD symbol). And they all worked! But then I tried a more complicated input and got some strange input:
please return to me darlingAfter researching word boundaries in flex and talking to my professor, we determined that once I put the rules for keywords into the full lexer file this wouldn't happen again because the words "please" and "darling" would be interpreted as identifiers, which are the longest matching regex.
ple(KEYWORD as)e (KEYWORD return) to me darl(KEYWORD in)g
Punctuation
The next thing I decided to tackle was punctuation. Like keywords, punctuation is relatively simple because I just need to match a list of operators and delimiters. Once again, I tested my regex by inputting every legal punctuation. Once those all passed I inputted some illegal punctuation.
~=I'm not sure if the result is valid, or if I should be throwing an error. I'll have to talk to my professor about it (I suspect the latter).
* (PUNCT "~")(PUNCT "=")
String Literals
I was putting off doing literals because Python has 5 different types and they looked really complicated at first glance. However, each literal is defined by a grammar, which are broken up well and clearly defined.
All I had to do was translate each definition into a regex. I did run into a problem, where my regex for string_literal caused an error, but <I'm still working on figuring out the problem>.stringliteral ::= [stringprefix](shortstring | longstring) stringprefix ::= "r" | "u" | "R" | "U" shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"' longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""' shortstringitem ::= shortstringchar | stringescapeseq longstringitem ::= longstringchar | stringescapeseq shortstringchar ::= <any source character except "\" or newline or the quote> longstringchar ::= <any source character except "\"> stringescapeseq ::= "\" <any source character>
Bytes Literals
Bytes literals are really similar to string literals. I followed the same process.
And similarly to string_literals, I got an error for my regex for bytes_literal.bytesliteral ::= bytesprefix(shortbytes | longbytes) bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"' longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""' shortbytesitem ::= shortbyteschar | bytesescapeseq longbytesitem ::= longbyteschar | bytesescapeseq shortbyteschar ::= <any ASCII character except "\" or newline or the quote> longbyteschar ::= <any ASCII character except "\"> bytesescapeseq ::= "\" <any ASCII character>
Integer Literals
The integer literals were one of the most straightforward of all the literals in Python. They're made up of decimal, binary, hex, and octal digits.
This file compiled just fine, but when was testing it I found that the numbers 1-9 were not recognized (unless they were followed by a 0). I'm know the problem is with my regex for nonzero_digit (which is just [1-9]), but I'm not sure how to fix it. I tried adding a caret to the beginning (^[1-9]), but then I got an error when I tried to compile...integer ::= decimalinteger | octinteger | hexinteger | bininteger decimalinteger ::= nonzerodigit digit* | "0"+ nonzerodigit ::= "1"..."9" digit ::= "0"..."9" octinteger ::= "0" ("o" | "O") octdigit+ hexinteger ::= "0" ("x" | "X") hexdigit+ bininteger ::= "0" ("b" | "B") bindigit+ octdigit ::= "0"..."7" hexdigit ::= digit | "a"..."f" | "A"..."F" bindigit ::= "0" | "1"
Floating Point Literals
I would say that floating point literals were the most straightforward literals, but imaginary literals consist of one definition.
Anyway, float literals were the only ones I didn't have any trouble with, but since the definitions are so expressive there may be some boundary cases I just didn't think to test.floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+
Imaginary Literals
Last, but not least imaginary literals. One line, very easy.
And since imaginary literals are essentially floating point literals with an added j or J I didn't have any problems with these either!imagnumber ::= (floatnumber | intpart) ("j" | "J")
The only things I have left to do on this project are:
- More logic for indent/dedent (the code is there, I just need to bend it to my will)
- EOF (shouldn't be an issue)
- Put everything together in one file and do lots of testing!