Monday, November 18, 2013

Compilers Project 2: Derp

My work in writing the parser is pretty straightforward. I was given some stub code, which consists of three main files:
  1. python-ast.grm.sx: This is a simplified Python grammar. It is actually the only file I'll be editing, as I must add derp reductions to each of the rules.
  2. pyparse.rkt: This file converts the output from my lexer to tokens that derp can parse.
  3. derivative-parsers.rkt: This file is derp. It applies the reductions in python-ast.grm.sx to the rules and produces an abstract syntax tree.
Reductions (to be updated with examples)
  • Derp has a number of recognized reductions. The most basis one is (red L f) where L is a language and f is a process. This reduction applies the process to each parse of the language.
<example>

  • (--> L f) is equivalent to (red L f).

  • (car L) is equivalent to (red L (λ (t) (car t))). That is, the process (f) in this case is an anonymous (lambda) function that applies car to each parse of the language. In Racket, car returns the first element in a pair (t).
<example>

  • (@--> L f) is equivalent to (red L (λ (t) (apply f t))). In the following example:
(@--> (seq! `SYMBOL "," `example) cons)
- `SYMBOL is a pattern to be matched
- `example is a rule
- cons is a Racket function that adds an item to the beginning of a list

- L
=
(seq! `SYMBOL "," `example)
- f = cons
Each sequence that matches the pattern (`SYMBOL "," `example) is cons-ed onto a list.

  • (>--> L c ...) is equivalent to (red L (λ (t) (match t c ...))).
<example>

  • ($--> L e ...) parses into the value of (begin e ...), with $$ bound to the match of L and ($ n) returning the nth member of $$. In the following example:
($--> SYMBOL (list $$))

- SYMBOL is a language
- list is a Racket function that returns a list containing its arguments
- $$ is essentially "foreach"
- L = SYMBOL
- e = (list $$)
Each matched symbol is added to a list, which is then returned. Replacing $$ with $3 would just return a list containing the third match.
Reductions in Action
The rule with the least number of dependencies in python-ast.grm.sx is atom, which looks like:
(atom            (or (seq "(" (opt tuple_or_test) ")") 
                     (seq "[" (opt testlist) "]") 
                     (seq "{" (opt dictorsetmaker) "}") 
                     NAME 
                     NUMBER 
                     (rep+ STRING) 
                     "..." 
                     "None" 
                     "True" 
                     "False"))
Applying reductions to the bottom 7 languages (starting with NAME), we get something that looks like:
(atom            (or (seq "(" (opt tuple_or_test) ")") 
                     (seq "[" (opt testlist) "]") 
                     (seq "{" (opt dictorsetmaker) "}") 
                     (>--> NAME [a a])
                     (>--> NUMBER [a a])
                     (>--> (rep+ STRING) [a a])
                     ($--> "..." 'Ellipsis)
                     (>--> "None" [a a])
                     (>--> "True" [a a])
                     (>--> "False" [a a])))
That's all the work I've done so far. Next I'm going to continue with atom, applying reductions to the first 3 languages. And after that I'll work on the other rules.

Tuesday, November 5, 2013

Compilers Project 2: Parsing with Derivatives

I definitely don't have the best handle on parsing with derivatives... but! Here's what I know based on this and this, which are based on this and this:
  • the derivative of a regex with respect to a character produces a second regex
  • the second regular expression "matches what the original expression would match, assuming it had just matched the character"
Here's what I think I know based on that information:
  • the derivative of a regex with respect to a character eliminates all strings that do not start with the character, and eliminates the character from strings that do start with the character.
Example: the derivative of the following language

{ "I", "am", "a", "happy", "camper" }

with respect to a yields

{ "m", ε }

Where ε is the empty string (since the 'a' is eliminated, but not because there wasn't a match. The null pattern, ∅, is used for no match (i.e. "I"))
Of course, that's a trivial example. In a "real-world" context, a match is against a whole word that is matched character by character. So a more realistic example would be to match against the word "happy":
"I am a happy camper"
Dh(c) = "appy" 
because everything is eliminated with respect to 'h' -- except "happy", which derives to "appy". 
Next, the derivative is taken in respect to 'a':
"appy"
Da(c) = "ppy"
And so on until the result is the empty string. An empty string means that the match was successful, while the null pattern indicates that the match failed.



Next steps
Derp, which stands for Derivative Parsers, is a tool written by my professor in Racket to make parser easier. I'm planning on using derp to write my parser for Python.

The documentation can be found here: http://matt.might.net/teaching/compilers/spring-2013/derp.html

Monday, November 4, 2013

Compilers Project 2: Recursive Descent Parser

According to Wikipedia
"In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar."
What does this even mean? Basically, given a grammar, there is a function for each term in the grammar that handles that term. These functions can call each other or themselves, and they end up constructing a tree that represents the input (e.g. a program).

For example, the grammar:
(funcdef (seq "def" NAME parameters ":" suite))
(parameters (seq "(" (opt paramlist) ")"))
(paramlist (seq (seq NAME (rep (seq "," NAME))) (opt ",")))
leads to the following program structure:
class Parser(object):
    def __init__(self, string):
        ….
    def funcdef(self):
        ….
    def parameters(self):
        ….
    def paramlist(self):
        ….
    def suite(self):
        ….
And given the string:
def double(x):
    return x*x
We get a tree that looks like this (excluding the definition for suite):
                        def double(x):
                            return x*x
                          /                  \
                def double(x)    return x*x
                /                                         \
             (x)                                            ….
            /
          x

(most of the information in this post is from this source: http://matt.might.net/articles/parsing-regex-with-recursive-descent/)