The Pseudocode Translator Project: September 2013

Monday, September 30, 2013

Compilers Project1: Flex

I decided to use flex for the lexer, so I've been learning the syntax and writing small programs for each feature I have to add that wasn't in the provided sample.

Running Flex
Create the file:                            sample.flex
Run flex on it:                             flex sample.flex
    This creates a lex.yy.c file
Compile and link:                       gcc lex.yy.c -lfl
Run the executable:                    ./a.out

Keywords
I started out doing keywords, because I just have to match a list of words, which doesn't require a complicated regex. I tested it by inputting every word in the list and making sure it output the correct syntax. In this case: (KEYWORD symbol). And they all worked! But then I tried a more complicated input and got some strange input:

please return to me darling
ple(KEYWORD as)e (KEYWORD return) to me darl(KEYWORD in)g

After researching word boundaries in flex and talking to my professor, we determined that once I put the rules for keywords into the full lexer file this wouldn't happen again because the words "please" and "darling" would be interpreted as identifiers, which are the longest matching regex.

Punctuation
The next thing I decided to tackle was punctuation. Like keywords, punctuation is relatively simple because I just need to match a list of operators and delimiters. Once again, I tested my regex by inputting every legal punctuation. Once those all passed I inputted some illegal punctuation.

~=
* (PUNCT "~")(PUNCT "=")

I'm not sure if the result is valid, or if I should be throwing an error. I'll have to talk to my professor about it (I suspect the latter).

String Literals
I was putting off doing literals because Python has 5 different types and they looked really complicated at first glance. However, each literal is defined by a grammar, which are broken up well and clearly defined.

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
stringescapeseq ::=  "\" <any source character>

All I had to do was translate each definition into a regex. I did run into a problem, where my regex for string_literal caused an error, but <I'm still working on figuring out the problem>.

Bytes Literals
Bytes literals are really similar to string literals. I followed the same process.

bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "\" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "\">
bytesescapeseq ::=  "\" <any ASCII character>

And similarly to string_literals, I got an error for my regex for bytes_literal.

Integer Literals
The integer literals were one of the most straightforward of all the literals in Python. They're made up of decimal, binary, hex, and octal digits.

integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"+
nonzerodigit   ::=  "1"..."9"
digit          ::=  "0"..."9"
octinteger     ::=  "0" ("o" | "O") octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
octdigit       ::=  "0"..."7"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"
bindigit       ::=  "0" | "1"

This file compiled just fine, but when was testing it I found that the numbers 1-9 were not recognized (unless they were followed by a 0). I'm know the problem is with my regex for nonzero_digit (which is just [1-9]), but I'm not sure how to fix it. I tried adding a caret to the beginning (^[1-9]), but then I got an error when I tried to compile...

Floating Point Literals
I would say that floating point literals were the most straightforward literals, but imaginary literals consist of one definition.

floatnumber   ::=  pointfloat | exponentfloat
pointfloat    ::=  [intpart] fraction | intpart "."
exponentfloat ::=  (intpart | pointfloat) exponent
intpart       ::=  digit+
fraction      ::=  "." digit+
exponent      ::=  ("e" | "E") ["+" | "-"] digit+

Anyway, float literals were the only ones I didn't have any trouble with, but since the definitions are so expressive there may be some boundary cases I just didn't think to test.

Imaginary Literals
Last, but not least imaginary literals. One line, very easy.

imagnumber ::=  (floatnumber | intpart) ("j" | "J")

And since imaginary literals are essentially floating point literals with an added j or J I didn't have any problems with these either!

The only things I have left to do on this project are:

More logic for indent/dedent (the code is there, I just need to bend it to my will)
EOF (shouldn't be an issue)
Put everything together in one file and do lots of testing!

Thursday, September 19, 2013

Compilers Project 1: Getting Started

I'm working on a lexer for Python.

My code can be found here: https://bitbucket.org/ashley_dunn/compilers-project-1.

I'm starting out by doing some research. Specifically reading these:

Based on the little reading I have done so far, I'm leaning towards Racket for the lexer (since I'm using it in my Programming Languages class, and it seems like an appropriate tool for the job).

Output:

There are 8 valid tokens to parse:

(NEWLINE) -- for a logical newline.

possible newlines include:
\n \r \r\n

(INDENT) -- for a logical increase in indentation.

should be spaces, not tabs

(DEDENT) -- for a logical decrease in indentation.
(ID name) -- for an identifier.

possible identifiers match this regex: [A-Za-z_][A-Za-z_0-9]*

(LIT value) -- for a literal value.

possible literals are too numerous to copy pasta here, so check them out in this link.

(KEYWORD symbol) -- for an keyword.

possible keywords include:

False      class      finally    is         return
None       continue   for        lambda     try
True       def        from       nonlocal   while
and        del        global     not        with
as         elif       if         or         yield
assert     else       import     pass
break      except     in         raise

(PUNCT text) -- for operators and delimiters.

possible delimiters include:

(       )       [       ]       {       }
,       :       .       ;       @       =
+=      -=      *=      /=      //=     %=
&=      |=      ^=      >>=     <<=     **=

possible operators include:

+       -       *       **      /       //      %
<<      >>      &       |       ^       ~
<       >       <=      >=      ==      !=

(ENDMARKER) -- for the end of the input.
If you encounter a lexical error, print (ERROR "explanation") and quit.

The sample file provided in the project description (written in lex, which I may end up using instead of Racket) already takes care of newline, indent/dedent (mostly), id, and some operators/delimiters. So I need to add:

a little more logic for indent/dedent
support for literals
support for keywords
more operators and delimeters
and support for EOF

Tuesday, September 17, 2013

What's Next?

I talked to my professor about the research I've done so far and where I should focus my efforts next. He said that compiling Python to Java is probably the most accessible choice, and more interesting than compiling from Java to Python.

So my next steps are:

To read this: http://matt.might.net/articles/compiling-to-java/
Work through the first three projects of the compilers class
Extend Python to be even more pseudocode-y (optional colons, support for symbols and words, etc)

Thursday, September 12, 2013

Syntax Analysis Example 3

I don't know if I'm going to support Object Oriented Programming (OOP) in my "language" (at least not initially), but this example introduces some new features that I haven't encountered in the others, so I thought I'd include it. (I'll still comment on the class definition syntax, just in case I do decide to include OOP in the future).

class DisjointSet

private int[] parent
private int[] rank

DisjointSet (n)
    parent = new int[n]
    rank = new int[n]
    for (i = 1; i ≤ n; i++)
parent[i] = i
        rank[i] = 0

int find (x)
    while (x ≠ parent[x])
        x = parent[x]
    return x

union (x, y)
    rx = find(x)
    ry = find(y)
    if rx == ry
        return
    else if rank[rx] > rank[ry]
        parent[ry] = rx
    else if rank[rx] < rank[ry]
        parent[rx] = ry
    else
        parent[rx] = ry
        rank[ry] = rank[ry] + 1

First of all, the definition syntax is pretty close already. All the first line needs is a colon after the class name. Python doesn't do member variable declarations, so I guess I would just ignore those two lines (although, it does have a strange way of defining visibility, so maybe I would want to keep track of them being private?).

Python uses __init__ ( ) for constructor syntax, instead of a method with the same name as the class. I think this would be pretty easy to translate (just replace the name with __init__ and keep the same arguments). However, some languages can have multiple constructors, and users might want to emulate this in their pseudocode, whereas Python can only have one init function. This problem is solved in Python by using default values in the init function that can be overridden, but it might be tricky to try and meld multiple constructors together this way...

One last OOP concept: since Python doesn't do member variable declarations, every class variable must be prefixed with "self" (e.g. self.parent). I would probably handle this by storing any variables that are initialized in the constructor somewhere, and then appending "self." to the front of every instance of these variables. And speaking of the self keyword, this is also required to be the first keyword in any class method declaration. If I can properly recognize classes and functions in general, I think it would be relatively easy to add "self, " to the beginning of every methods' argument list.

Alright, new for loop syntax! This one has a few things that the "for each" syntax doesn't have: parenthesis and semicolons. In general I was thinking about allowing the user to use semicolons, but just ignoring them. However, that wouldn't really work in this case. I think I would just allow this syntax to be valid (since it's such a prevalent construct in some many languages). In a way this is easy. It's essentially equal to "for i in range (1, n+1)" (I do need to pay attention to whether the operator is less than or less than equal to, in order to determine whether I need to add one, since Python's range goes up to n-1). The tricky part is the increment value... Python's range naturally increments by 1, but you can add an optional argument to specify how much it should increment or decrement by. Since the last term in the standard for loop is an expression, I would have to parse this individually to determine the user's intent (e.g. the increment operator indicates that the loop should increment by 1 every time, but i *=2 indicates that the loop should increment by i*2 every time).

There are some instances in here in which my professor used symbols that aren't found on a standard keyboard (≠ and ≤). I think I can safely ignore these, since most people won't be able to type these, and I think he only included them because this code was in some slides.

This example also includes a while loop. This is pretty much the same as in Python, except for the missing colon, so I would just add that during translation.

Two last things: first, is the introduction of "else if". This should be allowed, but it will be replaced by "elif" during translation. And finally, his "if" and "else if" lines don't have parenthesis. I'll have to think about whether I ever want to allow this, but for now I think this will be a compiler/interpreter error (expected '(' line __).

CONCLUSION:

OOP?
Standard for loop syntax (initial; final; increment) should be allowed.
Standard while loop syntax should be allowed.
"else if", "elsif", "elseif", and "elif" should be allowed and treated as an "elif".

Syntax Analysis Example 2

dfs (G)
    clock = 1
    for each v in V
        visited[v] = false
    for each v in V
        if !visited[v]
            explore(G, v)

explore (G, v)
    visited[v] = true
    previsit(v)
    for each (v,u) in E
        if !visited[u]
            explore(u)
    postvisit(G, v)

previsit(v)
    pre[v] = clock++

postvisit(v)
    post[v] = clock++

This example is much for "codey" than the previous example (i.e. it looks more like real code). One thing I noticed in this example that I don't remember seeing in the other is inconsistency of spacing between function names and parenthesis. I guess I could allow 0 or 1 space, but I'm not sure how other languages handle that.

We have a couple new concepts in this example -- primarily "for each" syntax and array access. Python has a "for each" loop, but without the "each". So converting this to Python would be relatively simple; just drop the "each" and add a ":" to the end. Python also does array access with the square bracket notation, so that doesn't need changing.

However, Python strictly uses word representations for logical operators, instead of symbols (e.g. ! vs not). I would want the user to be able to use either (I usually remember to use "and" and "or" when I switch back to Python from a more strict language, but I sometimes forget to use "not" instead of the bang).

Python has augmented assignment operators, but it doesn't allow for increment or decrement operators. That is you can write "x += 1", but you can't write "x++". I like the increment/decrement operators, though. They're simple and quick. I think I might simply my life by just having them be postfix, though. No prefix allowed! (but I'm sure there would be some users that would want prefix... hmm)

CONCLUSION:

There should be 0 or 1 space between a keyword and its parenthesis.
The word "each" is allowed, but ignored in a "for each" loop.
Consequently, the word "foreach" is allowed, but will be treated as "for each".
Array access should be done with square bracket notation.
Logical operators can be represented by their English label or symbols (e.g. "and" vs "&&" or "not" vs "!").
Increment and decrement operators are allowed, but will only be treated as postfix.
Augmented assignment operators are allowed.

Syntax Analysis Example 1

I found some examples of pseudocode from my algorithms class. I'm going to post some of them with my analysis.

blendedMultiply (x, y)
    if (|x| and |y| are both less than THRESHOLD)
        return gradeSchoolMultiply(x, y)
    else
        split x into halves xL and xR
        split y into halves yL and yR
        P1 = blendedMultiply(xL, yL)
        P2 = blendedMultiply(xR, yR)
        P3 = blendedMultiply(xL+xR, yL+yR)
        return P1*2n + (P3 - P1 - P2)*2n/2 + P2

The first thing that jumps out at me is that there is a lot of English here that would be difficult to parse; specifically the contents of the if statement, and the two split lines. Other than that, this is pretty similar to "real" code. blendedMultiply is clearly a function with two arguments (untyped, which won't be a problem translating to Python, but would be going to a typed language like C++ or Java). The assignments of P1, P2, and P3 look pretty straightforward. And all of the return statements look pretty straightforward.

I think if I were parsing this, I would look for camel-case words followed by parenthesis for function definition and function calls (could also be a word made up of letter, numbers, and underscores, like blended_multiply). I suppose I could determine whether the function was being defined or called by first looking to see if the name already existed (probably a function call then), or by looking to see if the arguments already existed (if not, then it's probably a function definition). Then I could add the "def" and ":" as needed.

For the if statement, I could probably write | | into the language as representing the magnitude of a variable, which could be changed to len(x) in Python. I think it would be somewhat easy to check for "less than" and replace with "<" (but maybe it's not as trivial as I think). "Are both" could be dropped, since there's already an "and". Although, I would need to rearrange the structure a little: "if(len(x) < THRESHOLD and len(y) < THRESHOLD):". I think capitalized words are generally considered to be constants, so I could just look for a defined constant somewhere.

I'm really not sure how to handle the split lines at all... the last return statement is easy, though. After all, math is math.

My professor indented his code in a Pythonic way, which is also how I write my pseudocode (it's way more readable that way). I like the idea of using whitespace to parse my language, but I have friends who complain about Python's required indentation. I'm thinking that newlines will be required, but indentation will be optional (but we'll see).

CONCLUSION:

Functions should be defined as camel-case or underscore-separated words followed by a parenthesis with 0 or more arguments.
Functions are called the same way that they are defined.
if statements should contain their conditionals in parenthesis.
Assignment should be done with one equals operator
Math is math (although exponents might be tricky... Python uses ** instead of ^)
Arithmetic symbols can also be represented by their English label (e.g. "less than" vs. "<" or "mod"/"modulus"/"modulo" vs "%").
Constants should be named with all caps.

Wednesday, September 11, 2013

Designing A Programming Language

It sounds like I'm probably going to be going with choice 1 and making a pseudocode "language" with very flexible syntax. I'm not going to lie, this project seems big, scary, and overwhelming. To psyche myself out for the project I thought I'd do some Googling of "how to design a programming language". One of the first hits was an essay by Paul Graham. I'm a big fan of his, so I was interested to see what he has to say on the topic. A few points that stood out to me:

Design for Yourself and Your Friends. "If you look at the history of programming languages, a lot of the best ones were languages designed for their own authors to use, and a lot of the worst ones were designed for other people to use."

One of the reasons I've been feeling overwhelmed by this project (besides the sheer number of ways to write pseudocode in general) is trying to think of ways that other people would write pseudocode. I should just write this tool for me and the kind of pseudocode I use, and I can extend the project for more general use in the future if I want to continue to work on it.

Aim for Brevity. "I think almost anything you can do to make programs shorter is good."

I think pseudocode is generally not very verbose, but this seems good to keep in mind.

Speed Comes from Profilers. "What they need is a language that can show them what parts of their own programs need to be rewritten. That's where speed comes from in practice. So maybe it would be a net win if language implementors took half the time they would have spent doing compiler optimizations and spent it writing a good profiler instead."

I don't know if this will be applicable to my project, but again, it seems good to keep in mind.

You Need an Application to Drive the Design of a Language. "This may not be an absolute rule, but it seems like the best languages all evolved together with some application they were being used to write."

My original idea was some sort of translator tool, but a language looked to be the right way to solve the problem. So I think I'm good here. My language will be written to create the pseudocode to real language application.

A Language Has to Be Good for Writing Throwaway Programs.

Definitely. I mean, that's the whole idea behind pseudocode, really.

Object-Oriented Programming. "I realize this is a controversial one, but I don't think object-oriented programming is such a big deal. (...) What this means for language design, I think, is that you shouldn't build object-oriented programming in too deeply."

I hadn't really thought about OOP in my "language", but it is the paradigm that I've been taught... I'm undecided on this (although, I don't think classes/functions are very prevalent in most of the pseudocode I've seen or written. At least, not explicitly)

I also found this wikiHow, which was surprisingly helpful. It pointed out a lot of things that I hadn't thought of, like semantics (too many to list here). Specifically, I found this to be worth thinking about: "Be careful, in order to keep your language in the Context-Free language category or something inside it, your parser generator and you will appreciate it later on." I'm going to have to walk a fine line between syntax that's easily parsed and syntax that captures the spirit of pseudocode. It also emphasizes designing the language carefully before trying to implement anything. Since I don't know much about programming language design I would like to spend time reading up on that and compilers before I actually design my "language".

And finally, this is an excellent Stack Overflow-esque question and answer that brings up a lot of good points. Once again, design carefully -- "write down precisely what the rules are for determining what is a legal and illegal program (...) Write down these rules as precisely as you possibly can.". And this was intriguing advice, considering my third option for the project: "One of the best ways to get started writing a compiler is by writing a high-level-language-to-high-level-language compiler."

I have a lot to chew on now. I think next I'm going to try to find some more resources on actually designing and implementing a new programming language. And also start narrowing down what I actually want my syntax to be (with examples!).

Monday, September 9, 2013

Choice 3: Compiler (in prog)

There is a precedent for this. See:

http://nuitka.net

https://github.com/evancz/Elm

http://code.google.com/p/shedskin/

http://developers.facebook.com/blog/post/2010/02/02/hiphop-for-php--move-fast/

PROS:

Could be a stepping stone to my grand plan

CONS:

Not my original idea
It's been done before (many times, by people much smarter than me)

CONCLUSIONS:

I like the idea because it seems more manageable.

Choice 2: Macros

I don't have the greatest grasp of macros, but I'll try to explain them as best I can.

The kind of macros that I might want to include in my pseudocode "language" are different from those in C/C++. Macros in C just do text replacement when the file is preprocessed. This is not incredibly helpful with code because one macro cannot be referenced from another; it's not very extensible.

I would want to use syntactic macros, which can transform the language itself. These are used primarily in the Lisp family, but can also be found in Scala and other languages. Apparently the reason macros work so well in Lisp-style languages is because Lisp's source code is very similar to Lisp data? Everything in Lisp is data made up of atoms... so a macro replacement is just another atom in the data? (at least according to this source, which is the best explanation I've found, but still kind of confusing)

PROS:

The user can redefine anything
The "language" would be highly scalable and extensible

CONS:

Not really what I had in mind for this project. If the user has to define macros to extend the language then it's not just a simple converter from pseudocode to another language
I don't fully understand macros, so how could I implement them? (this shouldn't really be a con since I don't know how to implement any of my choices at this point)

CONCLUSION:

Macros sound cool, but I don't think they're the right tool for the job

Thursday, September 5, 2013

Choice 1: TMTOWTDI

The first possibility for implementing this project is to write my own language with very flexible syntax, so multiple ways of writing the same thing would be valid.

Example: FizzBuzz

1.

for numbers in range 1-100
if number is divisible by 3 print Fizz
        if number is divisible by 5 print Buzz
        if number is divisible by 3 and 5 print FizzBuzz
        else print number

    2.

    for each number 0 < i < 101 {
        if (i % 3 == 0) {
            print "Fizz"
}
if (i % 5 == 0) {
print "Buzz
}
if (i % 3 == 0 and i % 5 == 0) {
print "FizzBuzz"
}
else {
print i
}
}

    3.

    while i < 101
       if 15 divides i puts FizzBuzz
   elseif 5 divides i puts Buzz
       elseif 3 divides i puts Fizz
       else puts i

4.

while i < 101
    switch
case !mod 3 and !mod 5, print FizzBuzz
case !mod3, print Fizz
case mod3 and !mod5, print Buzz
else print i

So you could potentially have punctuation, like parenthesis, curly braces, arithmetic symbols, semicolons. Or not! Symbols could be written out instead of being used (% vs mod). There are multiple ways to word things (puts vs print; if/else vs switch/case). Whitespace is is optional; you could indent or not, newlines or not.

I'm feeling overwhelmed with just these few permutations! I think if I took this approach I'd have to start out small... and if I keep up with the project after my independent study I could maybe crowd-source the translator to add more possibilities. (OSS, anyone?)

PROS:

Closest to my original idea

CONS:

More possibilities for syntax may mean code that's harder to read (although pseudocode is supposed to be easy to understand)
More work for me -- not only in defining the various possible syntaxes, but also writing a verbose compiler
All the possibilities also mean that I'll have to focus my efforts, which can be difficult

CONCLUSION:

This is beginning to sound like a terrible idea... =/
I'll research more into the specifics of how to implement a language like this if I end up picking this path (not likely)

Wednesday, September 4, 2013

First thoughts

Well, it turns out that I'm far from the only person to think it would be great if there were some sort of pseudocode translator... I mean, the first three hits of the Google search for that phrase are people on Stack Overflow wondering about the logistics of such a thing! And the more I think about it the less sure I am of what I actually want to create.

On the one hand, I could make a "pseudocode language", which to me would kind of defeat the point of my original intention... But it would be something like Perl or Ruby, in that "there's more than one way to do it!" I would just write the language such that the user is allowed to write many different forms of pseudocode and still have it work. The downside of that, to me, is that the user still has to learn some syntax, and the compiler would be MASSIVE! In short, less work for the user (although more than I had hoped) and more work for me! Alternatively in the same vein, I could allow the user to define macros, in order to make the language extremely customizable.

On the other hand, I could forget the idea of pseudocode pretty much altogether and just focus on a compiler. Python is pretty similar to pseudocode already, so I could write a compiler from Python to Java or C++ instead. If I'm going to make the user learn some syntax, it might as well be for a language that is already well established. Plus, I think writing just a compiler, rather than designing a new language and then a compiler for it, would be less work for me.

Introduction

Last semester I took algorithms. Oftentimes the professor would provide us with some pseudocode and ask us questions about it or to analyze the efficiency of the algorithm. I like concrete examples, so I'd often want to run the code to get a feel for it. In order to do that I would have to translate my professor's pseudocode into a "real" language, like Python. I got to thinking (like programmers do), "wouldn't it be great if there was a tool that took pseudocode as input and provided real code as output?!"

I researched the idea a little then, but must not have used very good Google search terms because it didn't look like anyone had done it. I was going to change the world!

Then I started talking to another professor whose research specialties include programming languages and compilers. He said my idea sounded like a good project for an independent study. And thus "The Pseudocode Translator Project" was born! This blog will be my journey from knowing essentially nothing about writing a programming language or compiler to, hopefully, having a working pseudocode translator for public consumption.