The Pseudocode Translator Project

Monday, November 18, 2013

Compilers Project 2: Derp

My work in writing the parser is pretty straightforward. I was given some stub code, which consists of three main files:

python-ast.grm.sx: This is a simplified Python grammar. It is actually the only file I'll be editing, as I must add derp reductions to each of the rules.
pyparse.rkt: This file converts the output from my lexer to tokens that derp can parse.
derivative-parsers.rkt: This file is derp. It applies the reductions in python-ast.grm.sx to the rules and produces an abstract syntax tree.

Reductions (to be updated with examples)

Derp has a number of recognized reductions. The most basis one is (red L f) where L is a language and f is a process. This reduction applies the process to each parse of the language.

<example>

(--> L f) is equivalent to (red L f).

(car L) is equivalent to (red L (λ (t) (car t))). That is, the process (f) in this case is an anonymous (lambda) function that applies car to each parse of the language. In Racket, car returns the first element in a pair (t).

<example>

(@--> L f) is equivalent to (red L (λ (t) (apply f t))). In the following example:

(@--> (seq! `SYMBOL "," `example) cons)
- `SYMBOL is a pattern to be matched
- `example is a rule
- cons is a Racket function that adds an item to the beginning of a list
- L = (seq! `SYMBOL "," `example)
- f = cons
Each sequence that matches the pattern (`SYMBOL "," `example) is cons-ed onto a list.

(>--> L c ...) is equivalent to (red L (λ (t) (match t c ...))).

<example>

($--> L e ...) parses into the value of (begin e ...), with $$ bound to the match of L and ($ n) returning the nth member of $$. In the following example:

($--> SYMBOL (list $$))

- SYMBOL is a language
- list is a Racket function that returns a list containing its arguments
- $$ is essentially "foreach"
- L = SYMBOL
- e = (list $$)
Each matched symbol is added to a list, which is then returned. Replacing $$ with $3 would just return a list containing the third match.

Reductions in Action
The rule with the least number of dependencies in python-ast.grm.sx is atom, which looks like:

(atom            (or (seq "(" (opt tuple_or_test) ")") 
                     (seq "[" (opt testlist) "]")

                     (seq "{" (opt dictorsetmaker) "}")

                     NAME

                     NUMBER

                     (rep+ STRING)

                     "..."

                     "None"

                     "True"

                     "False"))

Applying reductions to the bottom 7 languages (starting with NAME), we get something that looks like:

(atom            (or (seq "(" (opt tuple_or_test) ")") 
                     (seq "[" (opt testlist) "]")

                     (seq "{" (opt dictorsetmaker) "}")

                     (>--> NAME [a a])

                     (>--> NUMBER [a a])

                     (>--> (rep+ STRING) [a a])

                     ($--> "..." 'Ellipsis)

                     (>--> "None" [a a])

                     (>--> "True" [a a])

                     (>--> "False" [a a])))

That's all the work I've done so far. Next I'm going to continue with atom, applying reductions to the first 3 languages. And after that I'll work on the other rules.

Tuesday, November 5, 2013

Compilers Project 2: Parsing with Derivatives

I definitely don't have the best handle on parsing with derivatives... but! Here's what I know based on this and this, which are based on this and this:

the derivative of a regex with respect to a character produces a second regex
the second regular expression "matches what the original expression would match, assuming it had just matched the character"

Here's what I think I know based on that information:

the derivative of a regex with respect to a character eliminates all strings that do not start with the character, and eliminates the character from strings that do start with the character.

Example: the derivative of the following language

{ "I", "am", "a", "happy", "camper" }

with respect to a yields

{ "m", ε }

Where ε is the empty string (since the 'a' is eliminated, but not because there wasn't a match. The null pattern, ∅, is used for no match (i.e. "I"))

Of course, that's a trivial example. In a "real-world" context, a match is against a whole word that is matched character by character. So a more realistic example would be to match against the word "happy":

"I am a happy camper"
Dh(c) = "appy"

because everything is eliminated with respect to 'h' -- except "happy", which derives to "appy".

Next, the derivative is taken in respect to 'a':

"appy"
Da(c) = "ppy"

And so on until the result is the empty string. An empty string means that the match was successful, while the null pattern indicates that the match failed.

Next steps
Derp, which stands for Derivative Parsers, is a tool written by my professor in Racket to make parser easier. I'm planning on using derp to write my parser for Python.

The documentation can be found here: http://matt.might.net/teaching/compilers/spring-2013/derp.html

Monday, November 4, 2013

Compilers Project 2: Recursive Descent Parser

According to Wikipedia

"In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar."

What does this even mean? Basically, given a grammar, there is a function for each term in the grammar that handles that term. These functions can call each other or themselves, and they end up constructing a tree that represents the input (e.g. a program).

For example, the grammar:

(funcdef (seq "def" NAME parameters ":" suite))
(parameters (seq "(" (opt paramlist) ")"))
(paramlist (seq (seq NAME (rep (seq "," NAME))) (opt ",")))

leads to the following program structure:

class Parser(object):
def __init__(self, string):
….
def funcdef(self):
….
def parameters(self):
….
def paramlist(self):
….
def suite(self):
….

And given the string:

def double(x):
return x*x

We get a tree that looks like this (excluding the definition for suite):
def double(x):
return x*x
/ \
def double(x) return x*x
/ \
(x) ….
/
x

(most of the information in this post is from this source: http://matt.might.net/articles/parsing-regex-with-recursive-descent/)

Tuesday, October 22, 2013

Compilers Project 2: Getting Started

I'm working on a parser for Python.

My code can be found here: https://bitbucket.org/ashley_dunn/compilers-project-2

I spoke with my professor about how I can go about tackling the parser and he said there are two primary paths I can take. First, I could write my own parser by hand (not recommended) utilizing the fact that Python has an LL(1) grammar and can be parsed with a recursive descent parser. Second I could use a derivative-based parser tool.

As usual, I'm starting out by doing some research. Here are a few of the sources I'm going to check out.
Option 1 (hand-written parser):

Option 2 (derivative-based parsing tool):

http://dinosaur.compilertools.net/yacc/ (not derivative-based, but not hand-written)

http://en.wikipedia.org/wiki/Yacc

And here are some handy references:

It was highly recommended to me to use derp (derivative-based Racket parsing tool), so I'll probably end up doing that. However, I plan on researching both options just because they're interesting.

Monday, October 21, 2013

Compilers Project 1: Update 3

I'M FINISHED!!! Or close enough for me to turn in if I was taking the Compilers class.

I ended up 4 start states for string literals and 4 for bytes literals. They were pretty straightforward and were massively easier than what I was originally doing. My code is still a wee bit buggy. For example, here's some output for some regexes I tried in my string_start_states.l file:

([^\\]|"\\".)*

"""What up, bee?!"""
dsf
(LIT "What up, bee?!"""
dsf
")

([^\\"]|"\\".)*

"""What up, "bee"?!"""
(LIT "What up, ")"(LIT "bee")"(LIT "?!")

([^\\"]|"\\".)*|"\""|"\"\""

"""What up, "bee"?!"""
(LIT "What up, ")(LIT """)(LIT "bee")(LIT """)(LIT "?!")

The first gets into an infinite loop because the "0 or more of anything except a backslash" matches the 3 quotes that are supposed to end the state. The second treats the quotes as single quotes and the third is just so wrong...

But I ran my lexer on the provided test code and the output was as expected except for in two places -- both of which are \ as a continuation character. >_<
So I ought to fix that and escape sequences in strings. But my program is FINALLY essentially done, and that feels great!

Thursday, October 10, 2013

Compilers Project 1: Update 2

I talked to my professor about the issues I was running into, specifically when lexing strings and bytes literals. Apparently the way to handle this is lexical states, which I would have known if I was actually taking the compilers class. I found this page and this, which look like really good references! So the plan is to create 4 start states for the different kinds of strings -- one single quote ('), one double quote ("), three single quotes ('''), and three double quotes ("""). And then I can handle them appropriately.

I also learned that all of the logic for indents/dedents is in place, and I really don't actually have to do much of anything for that. The only change I need to make is editing what he's already printing in the function ("{" for an indent, "}" for a dedent) to what the output should be ("(INDENT)" for an indent, "(DEDENT)" for a dedent).

Last but not least, I was using "EOF" for the end-of-file marker, but it was matching the string "EOF". I changed that to "<<EOF>>" and now that's working.

My plan is to finish this, and start the next project by Tuesday. Look for one more update on this project (to hopefully say that I figured everything out!), and then work on the Python parser!

Monday, October 7, 2013

Compilers Project 1: Update 1

In my last post, I was experiencing some issues with string, bytes, and integer literals. Since then I've fixed the integers (1-9 weren't being recognized), by changing my regex a little. I had:

decimal_int {nonzero_digit}*{zero}+

and changed it to:

decimal_int {nonzero_digit}{digit}*|{zero}+

and now it works!

Unfortunately, I was never able to pinpoint the problem with my string and bytes literals. Despite this, I decided to go ahead and start putting all of my rules into one lexer file. Excluding the broken rules mentioned above, my lexer seems to be working pretty well!

Other than that, I still haven't fixed the logic for indents and dedents, but I'm not sure how to go about it... I have a meeting with my professor tomorrow, so I'm planning on asking him about my string/bytes literals and indents/dedents. I'm really close to being finished!