Making a parser generator doesn’t have to be hard.

I’m inspired to write this because of Russ Cox’ regular expression blogpost, which follows a similar format, although I’m not about to compare my implementation to others.

Recently, I decided to create a project called CCC, or Collin’s Compiler Creator, mostly because I want to use parser generators, but feel disingenuous if I were to use an existing one like Bison or YACC. I have the same problem as John Carmack, if I didn’t write all of the code, I don’t feel like I wrote any of the code. But aside from that, I embarked on making a parser generator… and finished in the same night, and in this post we will make another.

Before we begin, the complete code can be downloaded here.

Step 0: How does a parser generator work?

A parser generator simply takes in a list of tokens to parse for, and outputs code to do the actual parsing. The generator itself only outputs the code to be used, so effectively no matter what language you choose to write it in, it is an ahead of time operation. Why would this be useful? Well, I invite you to check out the source code for Blæst, which I wrote my own parser for without the use of a parser generator. To put it lightly, that parser is a mess of spaghetti code wrapped in duct tape and glue. It is horribly inefficient, and filled with bugs. And aside from all of that is horrible to look at, and even worse to add new features to. Now compare that to a parser generator, I could’ve generated those almost 1,000 lines by just feeding it an array of tokens to look for, and when it finds them, just have it return a number corresponding to the token it found. And apart from that, rather than have each token use its own string compare (something this is slow and inefficient), it could combine them, since it knows what strings to compare ahead of this. This approach makes the code much more easy to maintain, and also makes everything easier to upgrade and even port to different languages.

Step 1: Lets get a list of tokens

For this tutorial I will use Javascript. I personally like Javascript because it is fairly similar to languages like C++ or Java, which most people will be familiar with, and if you aren’t, Javascript itself is probably familair to you. I have rarely met someone who can’t at least read Javascript. If you have a problem with this, check out the Github for CCC, its written entirely in… well… C.

The first thing we need to do is get a list of tokens. For this I’ll just create an array that is already populated, you can get these values however you wish, maybe making inputs on a website or via the command line or whatever. But the array should look something like this once you’re done.

// Our list of tokens
var tokens = ["Hello", "Hi", "Goodbye", "Good", "World", "Wordlist"];

I have chosen these specific words because when we construct trees for making the branching parser, they are going to come in handy for demonstrating how we can reuse branches.

Step 2: Create the trees

Now we need to create our actual parser tree. A parser tree is the sequence of letters needed to make the word. We call it a tree because it can branch. For example, both “Hello” and “Hi” start with “H”, so our “H” branch then branches off into an “e” and an “i” branch. This allows our parser to effectively string compare anything that starts with “H” together. For reference, standard string compares go through entire strings in one pass, it would string compare our input to “Hello” by comparing each letter of our input to “Hello”, which takes time. Then it would do the same with “Hi”. Both “Hello” and “Hi” start with an “H”, so if our input doesn’t start with an “H”, we can be assured it isn’t either “Hello” or “Hi”, so we don’t waste our time.

Creating these branches is actually quite easy. We simply walk through our word, creating branches for points we don’t already have. First we need to go through every token, then create a list of “next letters” for each current letter. If that “next letter” is the next letter of our word, just go to it and repeat. If not we add it and go from there.

// We need to create a 'state' which holds the possible next letters, and a state for those as well
var initialState = {
        id: 0,
        next: []
};

// Keep track of how many states we have made for code generation later
var stateCounter = 0;

// Loop through our tokens
for(let i = 0; i < tokens.length; i++){
  
  // Our current token we are checking
  let currentToken = tokens[i];
  
  console.log("Current token: " + currentToken);
  
  // Reset to the initial state since we are on a new token
  let currentState = initialState;
  
  // Now loop through every letter in our token
  letterLoop: for(let j = 0; j < currentToken.length; j++){
    
    // Our current token letter
    let currentLetter = currentToken.at(j);
    
    // Go through every possible next letter
    for(let k = 0; k < currentState.next.length; k++){
      
      let possibleNextLetter = currentState.next[k];
      // If the possible next letter is our current letter, we follow it
      if(possibleNextLetter.letter == currentLetter){
        
        console.log("Found branch for: " + currentLetter);
        
        // Set the new current state to the branch we are following
        currentState = possibleNextLetter.state;
        
        // Go back to the letter loop
        continue letterLoop;
      }
    }
    
    console.log("Adding " + currentLetter);

    // If we get down here, its because we couldn't find a branch, so we need to add one
    currentState.next.push({
      letter: currentLetter,
      state: {
        id: ++stateCounter,
        next: []
      }
    });
    // Now set our state to the new state we made
    currentState = currentState.next.at(-1).state;
    console.log("ID: " + currentState.id);
  }
  
  console.log("Adding end of word marker");
  
  // Add our end of word marker, this contains the token number we return
  currentState.next.push({
    letter: 0,
    state: {
        id: ++stateCounter,
        next: []
    },
    token: i + 1
  });
}

Now we should have a tree of every possible letter combination, if it were drawn out it should look something like this:

   [Start]
   /  |  \
  H   G   W
 /|   |   |
i e   o   o
| |   |   |
$ l   o   r
  |   |   |\
  l   d   l d
  |  /|   | |
  o $ b   d l
  |   |   | |
  $   y   $ i
      |     |
      e     s
      |     |
      $     t
            |
            $

Where $ is an end of token marker.

Step 3: Code generation

This is easily the hardest part, because it takes the most thinking, turning our states into code. Almost every parser generator I’ve seen uses a state machine for parsing, and lucky for us, they are very easy to implement. A basic state machine looks like this:

var state = 0;
switch(state){
  case 0:
    state = 1;
    break;
  case 1:
    state = 0;
    break;
  default:
    state = 1;
    break;
}

We just need to make our state machine slightly more useful than this one.

First we need to come up with our states… wait… we made those already. The state numbers in the case are just the state IDs we just made. So then to set the next state we just need to check for the letters… wait… we did that too. So we literally just loop through every state we generated and output the information. How easy is that!?

The generator may look complicated, but thats just because its not very clean looking. Just read through it and you’ll see it s simply just outputting switch/case statements.

console.log("BELOW IS THE GENERATED PARSER CODE");

// Print the head to our function

// function parse(token){
//   let i = 0;
//   let state = 0;
//   switch(state){
console.log("function parse(token){");
console.log("\tlet i = 0;");
console.log("\tlet state = 0;");
console.log("\twhile(true){");
console.log("\t\tswitch(state){");

// Now go though each state and print out the imporant bits
// To do this I'll simply create a recursive function.

function printState(state){
 
  // Print our case
  console.log("\t\t\tcase " + state.id + ":");

  // Make our case for the next letter
  console.log("\t\t\t\tswitch(token.at(i)){");
  
  // And our conditions
  for(let i = 0; i < state.next.length; i++){

    let condition = state.next[i];

    // If our letter is 0, we are at the end of the string, in Javascript, the string.at should return undefined at the end, so we check for that
    if(condition.letter == 0){
     // If we are actually at the end of the string, return our token number
     console.log("\t\t\t\t\tcase undefined: return " + condition.token + ";");
    }
    else{
      // Otherwise make it so we go to the next state

      // case 'nextLetter':
      //   i++;
      //   state = nextStateId;
      //   break;
      console.log("\t\t\t\t\tcase '" + condition.letter + "':");
      console.log("\t\t\t\t\t\ti++;");
      console.log("\t\t\t\t\t\tstate = " + condition.state.id + ";");
      console.log("\t\t\t\t\t\tbreak;");
    }
  }
  // Make sure if we can't branch anymore, we return 0 so we don't hit an infinite loop
  
  // defualt: reutrn 0;
  console.log("\t\t\t\t\tdefault: return 0;");

  // And finally just close up the switch/case and this current case for the state machine.
  console.log("\t\t\t\t}");
  console.log("\t\t\tbreak;");
  
  // Now go through every next state and print it out again
  for(let i = 0; i < state.next.length; i++){
    // recursion is fun
    // recursion is fun
    // recursion is fun
    // recursion is fun
    printState(state.next[i].state);
  }
}

// Call the function on the initial state and it should go through everything 
printState(initialState);

// Now finish up the function
console.log("\t\t}");
console.log("\t}");
console.log("}");

// Now everything is done

And with that, concludes our parser generator. Now this generator is not the best. For example, if the token doesn’t start at the beginning of the line, for example “gHello”, it will not match it. It will also only match full words, so “Helloworld” will not match for “Hello”. Its also case sensitive.

Running the program will output a function ‘parse’ in the console. The basic usage of this is as follows. parse(string to parse). It will return a number, 0 if it did not match, or a number 1 to … if it did. and n corresponds to the array position of the token + 1. For example, in this example, “Hello” returns 1 because it is the first element in the array.

The output of the example run here should look something like this:

Current token: Hello
Adding H
ID: 1
Adding e
ID: 2
Adding l
ID: 3
Adding l
ID: 4
Adding o
ID: 5
Adding end of word marker
Current token: Hi
Found branch for: H
Adding i
ID: 7
Adding end of word marker
Current token: Goodbye
Adding G
ID: 9
Adding o
ID: 10
Adding o
ID: 11
Adding d
ID: 12
Adding b
ID: 13
Adding y
ID: 14
Adding e
ID: 15
Adding end of word marker
Current token: Good
Found branch for: G
Found branch for: o
Found branch for: o
Found branch for: d
Adding end of word marker
Current token: World
Adding W
ID: 18
Adding o
ID: 19
Adding r
ID: 20
Adding l
ID: 21
Adding d
ID: 22
Adding end of word marker
Current token: Wordlist
Found branch for: W
Found branch for: o
Found branch for: r
Adding d
ID: 24
Adding l
ID: 25
Adding i
ID: 26
Adding s
ID: 27
Adding t
ID: 28
Adding end of word marker
BELOW IS THE GENERATED PARSER CODE
function parse(token){
	let i = 0;
	let state = 0;
	while(true){
		switch(state){
			case 0:
				switch(token.at(i)){
					case 'H':
						i++;
						state = 1;
						break;
					case 'G':
						i++;
						state = 9;
						break;
					case 'W':
						i++;
						state = 18;
						break;
					default: return 0;
				}
			break;
			case 1:
				switch(token.at(i)){
					case 'e':
						i++;
						state = 2;
						break;
					case 'i':
						i++;
						state = 7;
						break;
					default: return 0;
				}
			break;
			case 2:
				switch(token.at(i)){
					case 'l':
						i++;
						state = 3;
						break;
					default: return 0;
				}
			break;
			case 3:
				switch(token.at(i)){
					case 'l':
						i++;
						state = 4;
						break;
					default: return 0;
				}
			break;
			case 4:
				switch(token.at(i)){
					case 'o':
						i++;
						state = 5;
						break;
					default: return 0;
				}
			break;
			case 5:
				switch(token.at(i)){
					case undefined: return 1;
					default: return 0;
				}
			break;
			case 6:
				switch(token.at(i)){
					default: return 0;
				}
			break;
			case 7:
				switch(token.at(i)){
					case undefined: return 2;
					default: return 0;
				}
			break;
			case 8:
				switch(token.at(i)){
					default: return 0;
				}
			break;
			case 9:
				switch(token.at(i)){
					case 'o':
						i++;
						state = 10;
						break;
					default: return 0;
				}
			break;
			case 10:
				switch(token.at(i)){
					case 'o':
						i++;
						state = 11;
						break;
					default: return 0;
				}
			break;
			case 11:
				switch(token.at(i)){
					case 'd':
						i++;
						state = 12;
						break;
					default: return 0;
				}
			break;
			case 12:
				switch(token.at(i)){
					case 'b':
						i++;
						state = 13;
						break;
					case undefined: return 4;
					default: return 0;
				}
			break;
			case 13:
				switch(token.at(i)){
					case 'y':
						i++;
						state = 14;
						break;
					default: return 0;
				}
			break;
			case 14:
				switch(token.at(i)){
					case 'e':
						i++;
						state = 15;
						break;
					default: return 0;
				}
			break;
			case 15:
				switch(token.at(i)){
					case undefined: return 3;
					default: return 0;
				}
			break;
			case 16:
				switch(token.at(i)){
					default: return 0;
				}
			break;
			case 17:
				switch(token.at(i)){
					default: return 0;
				}
			break;
			case 18:
				switch(token.at(i)){
					case 'o':
						i++;
						state = 19;
						break;
					default: return 0;
				}
			break;
			case 19:
				switch(token.at(i)){
					case 'r':
						i++;
						state = 20;
						break;
					default: return 0;
				}
			break;
			case 20:
				switch(token.at(i)){
					case 'l':
						i++;
						state = 21;
						break;
					case 'd':
						i++;
						state = 24;
						break;
					default: return 0;
				}
			break;
			case 21:
				switch(token.at(i)){
					case 'd':
						i++;
						state = 22;
						break;
					default: return 0;
				}
			break;
			case 22:
				switch(token.at(i)){
					case undefined: return 5;
					default: return 0;
				}
			break;
			case 23:
				switch(token.at(i)){
					default: return 0;
				}
			break;
			case 24:
				switch(token.at(i)){
					case 'l':
						i++;
						state = 25;
						break;
					default: return 0;
				}
			break;
			case 25:
				switch(token.at(i)){
					case 'i':
						i++;
						state = 26;
						break;
					default: return 0;
				}
			break;
			case 26:
				switch(token.at(i)){
					case 's':
						i++;
						state = 27;
						break;
					default: return 0;
				}
			break;
			case 27:
				switch(token.at(i)){
					case 't':
						i++;
						state = 28;
						break;
					default: return 0;
				}
			break;
			case 28:
				switch(token.at(i)){
					case undefined: return 6;
					default: return 0;
				}
			break;
			case 29:
				switch(token.at(i)){
					default: return 0;
				}
			break;
		}
	}
}

With all that, I hope you enjoyed learning the joy of creating a parser generator. Overall this only took me a few hours to write, and it was quite fun to do so, so I really recommend doing it. I have uploaded a gist of the complete code for this parser, so use that for whatever you want.

This entry was posted in Programming and tagged , , , . Bookmark the permalink.

One Response to Making a parser generator doesn’t have to be hard.

  1. Brooks says:

    This is literally my homework assignment. It’s due Tuesday at 11:59.

Leave a Reply

Your email address will not be published. Required fields are marked *