Tuesday, October 22, 2013

Compilers Project 2: Getting Started

I'm working on a parser for Python.

My code can be found here: https://bitbucket.org/ashley_dunn/compilers-project-2

I spoke with my professor about how I can go about tackling the parser and he said there are two primary paths I can take. First, I could write my own parser by hand (not recommended) utilizing the fact that Python has an LL(1) grammar and can be parsed with a recursive descent parser. Second I could use a derivative-based parser tool.

As usual, I'm starting out by doing some research. Here are a few of the sources I'm going to check out.
Option 1 (hand-written parser):

Option 2 (derivative-based parsing tool):


And here are some handy references:

It was highly recommended to me to use derp (derivative-based Racket parsing tool), so I'll probably end up doing that. However, I plan on researching both options just because they're interesting.

Monday, October 21, 2013

Compilers Project 1: Update 3

I'M FINISHED!!! Or close enough for me to turn in if I was taking the Compilers class.

I ended up 4 start states for string literals and 4 for bytes literals. They were pretty straightforward and were massively easier than what I was originally doing. My code is still a wee bit buggy. For example, here's some output for some regexes I tried in my string_start_states.l file:

([^\\]|"\\".)*
"""What up, bee?!"""        
dsf
(LIT "What up, bee?!"""
dsf
")

([^\\"]|"\\".)*
"""What up, "bee"?!"""
(LIT "What up, ")"(LIT "bee")"(LIT "?!")

([^\\"]|"\\".)*|"\""|"\"\""
"""What up, "bee"?!"""
(LIT "What up, ")(LIT """)(LIT "bee")(LIT """)(LIT "?!")
The first gets into an infinite loop because the "0 or more of anything except a backslash" matches the 3 quotes that are supposed to end the state. The second treats the quotes as single quotes and the third is just so wrong...

But I ran my lexer on the provided test code and the output was as expected except for in two places -- both of which are \ as a continuation character. >_<
So I ought to fix that and escape sequences in strings. But my program is FINALLY essentially done, and that feels great!

Thursday, October 10, 2013

Compilers Project 1: Update 2

I talked to my professor about the issues I was running into, specifically when lexing strings and bytes literals. Apparently the way to handle this is lexical states, which I would have known if I was actually taking the compilers class. I found this page and this, which look like really good references! So the plan is to create 4 start states for the different kinds of strings -- one single quote ('), one double quote ("), three single quotes ('''), and three double quotes ("""). And then I can handle them appropriately.

I also learned that all of the logic for indents/dedents is in place, and I really don't actually have to do much of anything for that. The only change I need to make is editing what he's already printing in the function ("{" for an indent, "}" for a dedent) to what the output should be ("(INDENT)" for an indent, "(DEDENT)" for a dedent).

Last but not least, I was using "EOF" for the end-of-file marker, but it was matching the string "EOF". I changed that to "<<EOF>>" and now that's working.

My plan is to finish this, and start the next project by Tuesday. Look for one more update on this project (to hopefully say that I figured everything out!), and then work on the Python parser!

Monday, October 7, 2013

Compilers Project 1: Update 1

In my last post, I was experiencing some issues with string, bytes, and integer literals. Since then I've fixed the integers (1-9 weren't being recognized), by changing my regex a little. I had:
decimal_int {nonzero_digit}*{zero}+
and changed it to:
decimal_int {nonzero_digit}{digit}*|{zero}+
and now it works!

Unfortunately, I was never able to pinpoint the problem with my string and bytes literals. Despite this, I decided to go ahead and start putting all of my rules into one lexer file. Excluding the broken rules mentioned above, my lexer seems to be working pretty well!

Other than that, I still haven't fixed the logic for indents and dedents, but I'm not sure how to go about it... I have a meeting with my professor tomorrow, so I'm planning on asking him about my string/bytes literals and indents/dedents. I'm really close to being finished!