Compiler

chapter 14



In the last tutorial, we took a C++ program file, opened it, read from it , converted all syntax to tokens , closed the file and sent the tokens to another file, finally we used the sendToparser() function to send the token to parser.cpp. So our next step in this tutorial is how the compiler translates the tokens sent to it into an assembly language code. We start by initializing some functions which we called earlier.
Next, let us discuss the different classes used by the compiler and their functions. These classes are stored in parser.h file. The symbolTable is a common class for all classes, we would discuss that later.
14.1 class tokenClass{ }
When the sendToparser() sends tokens to the compiler, before any processing the compiler will first of all store the tokens until all the tokens have been sent to the compiler and the run(0) is executed. So to store the tokens we use a class called tokenClass. We can also use a struct to store the tokens. The tokenClass is a public class and some of its members includes:TokenType type this stores the type of token e.g TOKEN_IF, TOKEN_WHILE etc. TokenType mType the mType will later be used on identifiers to store their type e.g string count => count is an identifier of type string , therefore count will stored as type= TOKEN_IDENTIFIER and mType = TOKEN_STRING. int lineNo, stores the line number which is sent alongside the tokens and would be used for error reporting to indicate the line number where an error occurred. int index this is a serial counter and stores the index of each token in the tokenClass. indexI they are used for only identifiers and stores the index of an identifier in the identifier class e.g int count => count is an identifier and is stored as TOKEN_IDENTIFIER while the name 'count' is stored in the identifier class and its index in the identifierClass is called 'indexI'. string pseudoName this will store the register name where each token is stored e.g register x10, x11 etc. The tokenClass has two constructors, which are accessed by the two sendToparser() function. One has 'indexI' to store items also in the identifier class.
14.2 class identifierClass{ }
This class is used to store identifier names. Its index is the 'indexI' and is stored in the tokenClass . It has only the string identifier that holds the identifier name. e.g int charcoal => charcoal is an identifier and is stored as TOKEN_IDENTIFIER in the tokenClass while the name 'charcoal' is stored in the identifier class at an indexI which is stored in the token class.
14.3 class funcList{ }
We need to store functions and very soon you will know why. Functions are stored in function list. The function list has public members which includes: string identifier this stores the function name. funcNo this stores the index of the function token in the token table. noArgument stores the number of argument the function contains. Vn stores the number of local variables contained in a function. Vs stores the index where the local variables of a function start in the variable class. As stores the index where the first argument of the function is stored in the argument list. Ae stores the index where the last argument of the function is stored in the argument list. call when zero, it means the function has been declared but not initialized and when one , it indicates that the function has been declared and initialized.
14.4 class lVlist{ }
This class stores all the local variables of all the functions. Its public members includes: string vIdentifier This stores the variable name. TokenType vType it stores the type variable e.g interger variable(TOKEN_VAR), string variable(TOKEN_STRING) etc. int Vno stores the local variable index amongst other variables within the same function, it will be used to assign registers to the variable. int funcIndex this stores the index of the function that contains the variable in the function list.
14.5 class gVlist{ }
This class stores global variables. It has similar members as lvlist{ } above except it does not have a function index since it does not belong to any function. It does not also contain the index since global variables are all in one class.
14.6 class argumentList{ }
This class stores lists of all the function arguments. Its members includes: TokenType argType stores the type of argument e.g func(string arg) => argtype will be equal to TOKEN_STRING. string argIdentifier stores the argument name , in the example above the name is 'arg'. int argNo stores the index of the argument within the function that called it, in the example below the argNo is zero since it is the first argument within the function. funcIndex stores the index of the function that owns the argument in the function list.
14.7 class symbolTable{ }
This class is a common class or a friend class to all the other classes. Here, all the class variables are initialized. They are set to a max of 1000 items.
Also all the indexes or counters for storing variables in the classes are also members of this class. As you can see they are a lot of indexes, which are used by each class to store data. The class constructor initializes the indexes to zero and the classes to null. It has two member functions: insert() used to insert tokens to token table and insertIdentifier() used to insert identifiers.
14.8 sendToparser()
This is the entry point of the scanner to the compiler. After token generation, tokens are sent to the compiler through the sendToparser() function. This functions and other functions that follows belong to the compiler.cpp or parser.cpp as I named it. This function takes the token type, its line number and passes it as an argument to the symbolTable insert() function. If it contains a numToken in the case of numbers, identifiers etc. it will call the symbolTable insertIdentifier() function.
14.9 void symbolTable::insertIdentifier()
This function inserts identifier names to the tokenIdentifier class. It is called by the sendToparser() when the token has a 'numToken' value. It takes as argument the identifier, then it creates an empty class called p. then passes the argument to P and adds it to the identifier class at the st.indexI and increments. The st.indexI is also stored in the token class as we would see in the next function.
14.10 void symbolTable::insert()
This function inserts token into the tokenTable class. TOKEN_STRING, TOKEN_FUNCALL, TOKEN_NUMBER, TOKEN_PIN, TOKEN_FUN, TOKEN_IDENTIFIER are the only tokens that have data for the identifier class. This function checks if the token to store belong to anyone in this group. If they belong then it inserts the 'st.indexI' value into the token class else it simply ignores it.
At this point, we have successfully inserted tokens into the tokenclass and their identifiers into the identifier class. This process of insertion continues until the last token which is TOKEN_EOF. Then the run function is called with argument of zero. Before we discuss the run function we would discuss some intermediate helper functions.
Above is a list of global variables which we would use, we would discuss them along the line of our discussion. On line 27. Is a stack named scope Count and its of type string. We use it to keep track of what block we are in example function block, IF block, WHILE block etc. The symbol table is declared on line 27 and the parser.h file is included on line 3. The rest are global variables initialized to zero.
Above are series of stack and arrays which we would also use. The registers we will discuss are RISC-V standard registers. The first array saveG[12], has list of global variables labelled g8 up to g27. These represents the global variable registers. Next is the arg[8]. This stores the argument registers used in assembly. SaveV[12], stores the saved variables registers. TempV[7], stores the temporary registers. Instruction[17], stores the RISC-V instructions we would use. Branch[6] stores branch instructions. On line 40, is another stack of type token Class named char_stack which we would also use shortly.
14.11 bool expect()
Before we talk about the run() function let us discuss the expect(). This functions compares tokens with other token in the tokenTable. It takes two arguments: one is called 'count' which is the index in the tokenTable of the token to compare while the other is the token you are comparing it with. It compares upto four tokens given their starting index. It returns true if the token in the tokenTable matches the token you are comparing it. It will be used throughout this course to compare tokens. Example; suppose we would like to check for an if statement, we can call expect as expect(TOKEN_IF, TOKEN_LEFT_BRACE, count), since 'if(' marks the beginning of an IF statement.
14.12 void error()
The error() function exits the code execution whenever an error occurs. When an error occur, the type of error and its line number are passed to the error() function which outputs the error and its line number before exiting.
14.13 void run(int count)
The run() function is the entry point of the compiler where it starts its compiling our tokens. It starts by check for function token, if it does expect a function then it calls the body() on line 1416. The count is a very important variable, it keeps track of what token we are currently reading in the token Table. The token.cpp calls run() with an argument of zero. Which means the count starts from zero which is the first item in the token table. Note that the count is not a global variable, rather it is passed around as an argument from one function to another and also as a return variable in most cases.
Example of variable declaration in C++ is 'void compile('. It starts with a return type and an identifier. Remember in token.cpp an identifier followed by a left parenthesis is no longer a TOKEN_IDENTIFIER rather it becomes a TOKEN_FUN which means a function token. Therefore, a return type token followed by a function token signifies a function declaration as seen on line 1391, 1392, 1393. These three lines checks if a return type is followed by a function token using the expect function. If false, we call the body() function. If true, the counter which is initially pointing on the return type token is incremented to point now on the function token 1394. The count which now points to the TOKEN_FUN is passed to the global 'funcNo' variable which will later be stored in the function list table. The addFunction() which is a Boolean function returns true if the function was successfully added to the function list class else it calls an error on the line number of the current token which is the TOKEN_FUN. We would also call addArgument which adds the function argument into the argument list class. Example for a function declaration given below:
void compile(int arg0, int arg1);
void => TOKEN_VOID
compile(int arg0, int arg1) => TOKEN_FUN
; => function declaration else if it is '{' => function initialized
So the addargument() will get its values from the token identifier class using the 'indexI' in the token table of the function. If a semicolon comes next then it is a function declaration else if we expect a left brace then it is a function initialized and we use the 'call' value in the function table to indicate this. If the function is initialized we also print out the name in the command line just like in assembly and call the introduce() function. An error is thrown if the function is neither declared nor initialized. We increment count to point to the next item within the function or outside the function if it was only declared. Then we push to the stack function name and the 'func_' keyword. This stack is useful as it indicates what scope we are currently in. when the function terminates using a right brace we also pop it out. The scopedepth is increased anytime we add an item to the stack because some items repeat itself example if, while, for statement etc. they appear multiple times and can be pushed many times so the scopedepth is used to different them by adding it also as part of the stack as we will see later it make sure each item is unique. It is only incremented without being attached since a function can be initialized once except for polymorphism which is not allowed here. Finally we call the body().
14.14 addFunction()
This functions adds the function name to the functionlist class. Example:
int compiler(int arg1, int arg2);
The TOKEN_FUN above has a value in the identifier table equal to => compiler(int arg1, arg2)
The addFunction() have to extract 'compiler' and add it to the function list table. On line 147 The global function 'tempCount' is set to zero. This function assigns register to function variables, therefore for every new function we add, we set it to zero. The 'funcCount' variable is used as index to store items in the function list. We start by setting the funcName to an empty string. Using 'funcNo' as the index of the TOKEN_FUN and 'indexI' as the index of its data in the tokenIdentifier class, we obtain 'compiler(int arg1, int arg2)'. The for loop, loops through all the character accumulating it in the 'funcName' until it meets the '(' then it breaks. We now have our function name and would need to insert it to our function list. On line 160 we check if such function has been declared earlier . When funcIndex is equal to zero then it means we haven't stored any function yet. If the function already exist we also make sure sure it has been initialized and report an error on the token line else we add the new function on line 169. We only add the function number and function name and set other parameters to zero. Then we increment funcCount which serves as the index and return true. In our previous function run(), we set call->1 if the function was initialized by checking for a left brace or semicolon.
14.15 addArgument()
This function will be used to add function arguments to the argument list class. We start by initializing some variables we would use. The 'As'(argument start) and 'Ae'(argument end) are two variables that are part of the function list and stores the index of the argument list where the function argument starts and end. The 'st.indexA' is the argument list index and whenever we want to add function argument we first of all set its current index as the 'Ae' as shown on line 81.
Just like in the previous function, we use the for loop to get the other part of the identifier example:
Compile(int arg0, bool arg1, string arg2)
While the addFunction() extracted compile and discarded the rest, the addArgument(), discards compile and extracts in between the parenthesis while splitting each argument. On line 77 the string funcArgs is initialized to an empty string and would be used as an accumulator for the for loop. The 't' variable which is initialized to zero, is set to one on line 86 this enables 'funcArgs ' to start accumulating on line 138. The argument has a type eg. Int, string, bool etc. when the funcArgs accumulates any of them we set the type on 118, 123, 128, 133 and if we meet a blank space we do nothing 136. 'type' is a local token variable. After accumulating the type we set the 'funcArgs' to zero and now it starts to accumulate the argument name or identifier. A comma (in the case of multiple arguments) or a right parenthesis (in case of multiple, single, empty arguments) marks the end of accumulating the identifier as will be shown on line 88, 105.
On line 88 the right parenthesis marks the end of the argument. Then if the funcArgs is empty or the type is not set then it is an empty argument function. We set the 'Ae' the same value as the 'As', increment it and return. Else if the type and funcArg is set then we add the argument to the argument table. We add the argument type, argument name, argument number and funcCount. The argumentNo is the argument index within the function. E.g:
Compile(arg0, arg1) => arg0 local index or argument number is '0' while its argument list index or st.indexA can be any value depending on how many other function argument have been added to the list. FuncCount is the index of the function that owns the arguments. 97, In assembly we move all arguments into a saved register in case we need to set new arguments for a function call. Using the argumentNo to select the register from the argV[] array and also using the argMv as index to select the saved register from the saveV[] array , we apply the move command to move the argument to the new register. We then increment st.indexA, argumentNo, and argMv while setting funcArgs back to empty string and the 'Ae' to the current st.indexA. The argMv is initialized to 5, because the argument variables will be stored from the fifth register and above while the function local variables will be stored in 0-4. If it is quite small, the designer can adjust it.
On line 105 the comma indicates that they are more arguments. The major difference with line 88 is that right parenthesis marks the end. Therefore, if the type is not and we couldn't grab the funcArgs it means the argument is invalid. Also we do not set 'Ae' since the comma means more arguments. Finally, on line 142 we save the number of argument on the function list return true and end.
14.16 Introduce()
This function is called once at the beginning of every function. According to the standard RISC-V, we allocate space on the stack to create the initial stack frame. we push all saved register, return address (can be omitted for a leaf program), and frame pointer on to the stack. Then make the frame pointer point to the beginning of the callers stack frame. Everything is easy except the saved variables. At this point we do not know how many saved variables we need to push to stack since we are still at the beginning of the function. Therefore, we need to count all the local variables before executing the above instruction.
This function takes as argument the function name and the count pointing to the opening left brace of the function. On line 1356 we increment the 'cnt' to now point to the next token withi the function and initialize a local variable 'braces' to 1. We would loop throughout the tokens until we get to the closing brace while counting the number of local variables we meet . braces is set to one indicating we counted the opening braces and anyother opening brace increments this variable and closing brace decrements it until brace is set to zero. On line 1371, if we meet TOKEN_EOF without the closing brace , then we throw an unexpected end of file error. Line 1366, 1367, 1368 indicate how a variable is declared. A token identifier type followed by a TOKEN_IDENTIFIER means a variable declaration. The argCnt keeps track of how variables we encounter. On line 1376 we reset 'cnt' because we would reuse it. Then the argCnt is multiplied by 16. Here, each variable is 16-bit . We also 32-bit for the return address variable and the frame pointer and we would create this space on the stack on line 1379. Using a loop and making sure it does not go below 32(two spaces for return address and frame pointer). We push the saved registers to the stack. On line 1385, 1386 we push the return address(X1) and frame pointer(X8) unto the stack. And finally set the frame pointer to point to the beginning of the callers stack frame. We set the Vn of the function list to 'braces'. We would use this value while closing the function, rather than count variables again we just access and use it.
14.17 terminates()
This function executes at the end of every function. It is called when a closing braces is found and a function is on top of the scope. We reverse every step taken in the introduce() function. From the function table the 'Vn' variable is fetched, this variable consists of the number of local variables times 16bits. Using a while loop we pop all the items in the stack into the saved variable register array using the cnt as the index. We make sure the loop does not go below 32its, because a fixed 32bits is used for the return address(x1) and frame pointer(x8) which is loaded from the stack, the stack is then incremented by the 'Vn' value and a jump to the return address is called.
14.18 Body()
This is the body of the compiler. Here, the compiler performs semantic analysis. It compares patterns, and calls the assigned function. The expect function plays a major role here. The body takes as argument the count variable which is passed by the run() function.1285 Then it starts with a while loop which loops until it gets the TOKEN_EOF and ends the process. Within the loop it tests for different patterns. On line 1287,1288,1289 we test for a function declaration. If it returns true, we check if the scope is empty. If it is not it means we are declaring a function within a scope(function, if, while, for etc.). This returns an error else if the scope is empty we call the run() function using the count as argument. care must be taken to make sure we pass the count around and return its new value from functions that alter it . The count serves as a pointer to the current token we are accessing within the tokenTable class.
1295 we test for function call. Major difference between function declaration (TOKEN_FUN) and function call (TOKEN_FUNCALL) is that the function call does have a return type . In token.cpp if a function is not preceeded by a return type we generate TOKEN_FUNCALL instead of TOKEN_FUN. To compile the function call, the compiler first pushes to the scope the word 'function' indicating that we are in a function call while appending the scopedepth which makes it unique. We increment the scopedepth to get a new value for subsequent push and set the global 'funcArgc' to zero since the now called function will counts its own arguments. Then we call the funExp() function using the count as an argument. This function will alter the count , therefore it will return the new count and set it as count. After this function has executed we pop the scope and set the funcArgc back to zero.
1303,1304,1305 We test for variable declaration. A variable type followed by a TOKEN_IDENTIFIER indicates a variable declaration. We call the tokenVar() function with the count as argument and the new count as the return variable.
1308 TOKEN_COUT followed by TOKEN_SHIFT_LEFT is the c++ print statement. We call the coutVar() function with the count as argument and the new count as the return variable.
1311 While compiling expression, the tenary operator is a sudden meet. It is only marked by '?' within an expression, as such we do not have a pattern for it except if we meet a '?' within an expression. Example:
F = (g*23) > d ? w+2 : 12*c;
Initially, the compiler will read: TOKEN_IDENTIFER(f) and TOKEN_EQUAL(=) and determine it is an expression, not until it gets to '?' then will simply push into the scope tenary and scopedepth indicating we are actually in a tenary operation. Next while we are in the body and tenary is on top of the scope then we call the tenary() function.
1314, An 'if' token followed by 'TOKEN_LEFT_PAREN' i.e 'if(' marks an IF statement, so we call the ifstatement() function.
1317, An identifier followed by a colon marks a label e.g 'compile:' .To compile to assembly the compiler will simply output the identifier with a colon which in assembly is a label. Then we increment the count by 2 to point to the next item after the colon.
1321, The TOKEN_WHILE followed by TOKEN_LEFT_PAREN marks a while loop. We call the whileLoop() function with count as argument and it returns the new count.
1324, The TOKEN_FOR followed by TOKEN_LEFT_PAREN marks a for loop. We call the forLoop() function.
1327 When we meet the return keyword we call the ret() function.
1330 The continue keyword followed by a semicolon marks the continue keyword. We call continueFunc() function.
1333, A TOKEN_IDENTIFIER or a TOKEN_PIN is passed in as an expression. We call the expression() function. This function handles all the compiler AL computations.
1336, For the switch statement. A TOKEN_SWITCH followed by a TOKEN_LEFT_PAREN marks the beginning of a switch statement. This calls the sw() function. A TOKEN_CASE marks the case statement and this calls caSeS() function. The default case is marked by TOKEN_DEFAULT followed by TOKEN_COLON. The break statement which transitions from one case to another is marked by TOKEN_BREAK followed by TOKEN_SEMICOLON. This break statement is different from breaks used to exit loop so that is why we also make sure that 'sw' is within the top scope. We call the swBreak() function for both the break and default option.
1347, The TOKEN_BREAK followed by a TOKEN_SEMICOLON marks a break statement. This break statement is different from the break found in the switch statement above. This one is used to exit loops. We make sure the 'sw' is not in the scope then we call the nBreak() function.
1351, A closing brace calls the rightbrace() function and finally if we can't figure out what token we are
expecting then we print out the token and call the error function. And finally when we meet the TOKEN_EOF we exit the function.
There is a function called expression(). This function handles all expressions and calculations. We would call this function very often in the subsequent unit until we finally discuss what it does.
14.19 int tokenVar(int count)
This function is called when we declare a variable: identifierType , identifier.
So we call the addVariable() which adds the value to the lVlist class. We then increment the pointer to move from the identifierType to the identifier. We then check if it is preceeded by a semi colon and we increment it by two e.g int counter;, in this case we only add the variable to the class list. If it is not a semicolon then it is an expression e.g int counter = a + b;. Then we call the expression function and pass as argument the start and end of the expression. E.g Int counter = a + b;
The count variable is currently pointing at the counter variable and that is the start of the expression and the end is the TOKEN_SEMICOLON. The expression will be responsible for compiling counter = a+b;. In a case we are not able to add the variable. We throw an error. Finally, we return the count.
14.20 int Tenary(int count)
We call this function when we meet the tenary operator within an expression e.g: F = (w > 0)? 12 : A;
Initially we called an expression to compile from count pointing at 'F' to semicolon, then we meet the tenary operator. We then call expression on (w>0) as we would see later after which we call expression on F = 12: As seen on line 1036. And finally F=A; on line 1039. For the example above, the algorithm:
  • F = 12 (Line 1036)
  • branch to tenaryend (Line 1037)
  • tenary1 (Line 1038)
  • F = A (Line 1039)
  • tenaryend (Line 1040)
On line 1041, we pop tenary from scope and set tenaryC a global variable we would use later to zero and finally return count.
14.21 int ifStatement(int counter)
If statement is expressed as :if(condition){.
Most programming language do not make the opening brace compulsory. Here, it is compulsory. On line 1048, we move the pointer from TOKEN_IF to '('. Then we use a while loop to make sure that the opening brace is there else we throw an error. paren is used to keep track of the number of the beginning '(' and close ')' of the if statement. Whenever it is 0, meaning that the opened parenthesis have all been closed and the next token is not TOKEN_LEFT_PAREN or '{' or opening brace then we throw an error. Else everything is ok, we simply push 'if' to the scope then call the expression function starting from the opening parenthesis which is the count pointer until the opening brace which ends the if expression. We return count on line 1065.
14.22 int whileLoop(int count)
The while loop is expressed as: while(condition){
We would repeat exactly the same procedure used for the if statement. The major difference is that the while loop has an increment variable which is within the body of the loop. Also you exit an if statement when you hit the closing brace but in a while loop when you get the closing brace you jump back to the beginning, only the while condition can exit the loop.
14.23 int forLoop(int count)
for(initialization; condition; pre-step){. The for loop as we can see contains three expression. The pre-step is used to increment or decrement the initialization variable and it is only calculated at the end of the loop. On line 1093 we increment the count by 2, which moves the pointer from TOKEN_FOR to the initialization expression. Next we check if the variable is being declared or just initialized. E.g for(int I = 0; condition; pre-step) and for(I = 0; condition; pre-step)..So if the variable is being declared also, we would call the addVariable function. Then for both condition we call the expression function and increment the count if the variable was declared so that the pointer moves from the identifierType. The expression starts from the current count and ends in the first semicolon. Next we push 'for' into the scope and print it also. Then we call expression() for the condition.
The pre-step is calculated at the end of the loop, so lets use the global variable forcount to mark the count at this point and then skip it. Next we push 'for' into the scope with the forcount variable, we will still extract the forcount later. Just like in if statement and while loop we test for the opening brace. But unlike the while loop and if statement, the for loop uses the count variable to test instead of a temporary variable called counter. So, this actually shifts the pointer to the opening brace unlike the rest that were not shifted at all. So we increment the count and return it.
14.24 int ret(int count)
return(expression);. We increment the pointer from TOKEN_RETURN to '('. Next we push 'return' to the scope and call expression starting from the count to TOKEN_SEMICOLON. Next an unconditional jump to the end of the function. We pop the scope and return count. The end of the function runs the terminate function which contains some preliminaries before exiting a function or returning from a function.
14.25 int continueFunc(int count)
continue;. We increment the pointer by 2, to point to the next item , then we call an unconditional jump which jumps to the beginning of the scope.
14.26 int sw(int count)
switch(switch variable){. Lets increment the count by two, this moves the pointer from TOKEN_SWITCH to the case variable. We use the global variable 'swCase' to mark the index of the switch variable. Next we increment the pointer and push 'sw_case' into the scope. We check for { case(. This token combination comes immediately after the switch. We increment by two which moves the pointer from the switch variable to the next item after the opening brace and we finally return the count.
14.27 int caSeS(int count)
case(case variable):. We increment the count to point to the '('. Then we push 'switch' to the scope and pass the case variable to expression. The expression function will then compare the switch variable to the case variable. Our case statement does not expect an opening brace. We return count.
14.28 int swBreak(int count)
break;. This TOKEN_BREAK is used to mark the end of a case statement. It is different from break in a loop. We increment it by two to jump to the next item. We end every case by jumping to the end of the switch if the case is true else we jump to the next item.
14.29 nBreak(int count)
break;. We move count 2 steps to point to the next token. Then jump unconditional to the scope before the last scope. Usually break is found within an if statement in a loop. So we remove the 'if' in scope so we can get the loop.
14.30 int rightBrace(int count)
This function treats how a closing brace is treated for each scope. We increment it to point to the next scope:
  • ELSE statement:
  • IF statement:
  • SWITCH statement:
  • WHILE loop:
  • FUNCTION
  • FOR LOOP
14.31 int coutVar(int count)
cout, expression1, expression2, expression n;. Let's increment the TOKEN_COUT this leaves the pointer on the first TOKEN_SHIFT_LEFT. We then push 'cout' to the scope. Using a while and an expect which starts at the current token and ends at TOKEN_SEMI_COLON which ends all 'cout' statement we loop the cout statement. The countsLL variable counts the number of TOKEN_SHIFT_LEFT which is equivalent to the number of items to output. If TOKEN_EOF is found within the loop we throw an error.
Using countSLL in a loop, we call the expression for all the times to print. CounterEnd is used as the new count and it returns a new value every time it calls expression. The expression starts at the current counterEnd and ends at the next TOKEN_SHIFT_LEFT except the last item which is when countSLL is equal to 1, then the expression ends at the TOKEN_SEMICOLON. Finally, the scope is popped, funArgc is set back to zero, the global cout function is called and count+1 is returned.
14.32 void addExpressList()
This function would be used to add new tokens to the expressList.
14.33 void addTokentable()
This function is used to add new tokens to the tokenTable.
14.34 int funcExp(int count)
TOKEN_FUNCALL. This function is called when we call a function. Eg. Swap(), control() etc. First some local variables are initialized.
Next, check if the has been initialized. We take count which points on the called function as index to the token table. The indexI is retrieved and used to get the function name from the tokenIdentifier table. This value is compared against all the function names in the functionList table. FuncIndex is the index in the function table and is equal to funcCount – 1. If the function exists, we increment the funcNexist variable, then retrieve the number of argument it has and break out of the loop.
Next, if funcNexist is equal to zero, it means the function has not been declared, so an error is thrown. We increment the count to point at the opening parenthesis after the function name and this index is stored in the global variable storeC. We increment count again to point to the first item within the function parenthesis.
Check if tempCount is greater than zero. This variable keeps track of local variables. If greater than zero it means, we have local variables. Then multiply by 16bits and allocate space then push to stack.
Just like we did in coutVar() function, push all the function argument to the expression() function. When they are more than one argument we call the expression function with the counter as start and TOKEN_COMMA as the end. When we are on the last argument we use the TOKEN_RIGHT_PAREN as the end in the expression.
Read this last part carefully, and compare it with line 861-864 in the expression function. E.g
F = 25 + SWAP(25-C, 8*2-W) – 25; The expression above is assumed as an ordinary expression until the count indexes the swap function and calls funcExp(). Then as we saw earlier, using a loop 25-C and 8*2-W were both sent to the expression() function. A new token is added to the token table at index storeC which is the index of '(' and TOKEN_IDENTIFIER is added at index storeC – 1 which is the TOKEN_FUNCALL and the last token TOKEN_START is added at the current counter-1 which is index of ')'.
When funcExp returns to the expression() function that called it. The expression will skip from TOKEN_START to TOKEN_END which is the function argument and the funcCall will now be TOKEN_IDENTIFIER which is the result of the funcCall. Already in line 743, we called a jump to the function name.
14.35 void checkIdentifier(int index)
This function is used to assign registers to variables in the expressList. The expression function will push some variables into the expressList as we would see later. But before they are pushed this function assigns variables to them. We first check if we are in a function then we check local variables else we skip and check global variables. If we are in the function then we get from the funcList the index in the argumentList where the function argument starts (As) and where it ends (Ae), this customizes our search to only the arguments of that particular function. Then also the index in the localVariable list where the function variable starts (Vs). Using a while loop , loop through the argumentTable and localVariable table using the Ae, As, Vs now endA, startA, startV as index in the loop. St.indexV is the current index of the localVariable list. Compare them with the tokenIdentifier[tokenTable] using the checkIdentifier argument as the index. If a match is found we return from the function and then also assign pseudoName which will hold the register name and the mType which will hold identifier type to the stored values in the table. The variable register are found in the saveV[] array while the argument variables are found in the argV[] array. ArgV[] from 5 above are used to store argument.
Same procedure is repeated for the global variable, but here we check the whole list. I will still optimize this part to use binary search. If we make it to the bottom of the function without returning, then we throw an error of undefined variable.
14.36 int Expression()
The expression() function is the main function used to compile arithmetic and logical expression. This function uses the postfix notation to solve an expression e.g:
Given F= 25*2+8/2-7(A-9)+3*B-C;
  • The expression function takes two arguments 1. The index of the first token 2. The last token
  • 1. From our example above the index of 'F' is 0. And the last token is TOKEN_SEMICOLON
  • 2. We use a while loop to count the total number of tokens in our case = 23
  • Therefore our tokens start from 0(counterStart) and ends at 23(counterEnd).
  • Looping through our tokens from counterEnd down to counterStart using the counterEnd as the token index:
  • The last token is ignored, in our case it is TOKEN_SEMICOLON.We rearrange all the tokens and push them into the expressList.
  • Next 'C' is pushed into the expressList. '-' is pushed into the stack. 'B' is pushed into the expressList. '*' has higher precedence than '-' which is ontop of the stack so it is moved into the stack also. '3' is moved into the expressList.
  • expressList => C,B,3. STACK=>*, -.
  • Next is '+' which is lower than top of the stack. So we pop everything in the stack starting from the top until the top of stack is lower or equal to '+'. Then we push '+'
  • expressList => C,B,3,*, -. STACK=> +
  • ')' is pushed to stack. It marks a beginning of new items. '9' is pushed to expressList. '-' is pushed to stack. 'A' is pushed to expressList. '(' will pop everything in stack into the expressList, until we meet ')', then we pop it.
  • Before::: expressList => C,B,3,*, -, 9,A. STACK=> (, -, ), +
  • After:::: expressList => C,B,3,*, -, 9,A,-. STACK=> +
  • Whenever an identifier is directly adjacent to '(', or ')' add a '*' to it e.g 7(A-9)4 becomes 7*( A-9)*4
  • Therefore '*' is added to the stack. '7' is added to the expressList. '-' is lower than '*' which is on top of the stack, therefore we pop the stack into the expressList until we meet a lower operator than '-'.
  • Before:::: expressList => C,B,3,*, -, 9, A,-, 7. STACK=> *, +
  • After:::: expressList => C,B,3,*, - , 9, A,-, 7, * . STACK=>-, +.
  • '2' is added to the expressList. '/' is added to the stack. '8' is added to expressList. '+' is lower than top of stack so we pop top of stack.
  • expressList => C,B,3,*, - , 9, A,-, 7, *, 2, 8, /, -. STACK=>+, +.
  • '2' is pushed to expressList. '*' is pushed to stack. '25' is pushed to expressList. '=' has the lowest precedence we pop everything in stack.
  • expressList => C,B,3,*, - , 9, A,-, 7, *, 2, 8, /, -, 2, 25, *, +, +. STACK=> =
  • 'F' is pushed to expressList. When we reach the end we pop all item in the stack into the expressList.
  • expressList => C,B,3,*, - , 9, A,-, 7, *, -, 2, 8, /, 2, 25, *, +, +, F, =. STACK=> empty
  • The expressList above will further be compiled when we discuss the readExpressList() function.
  • Using a while loop we loop starting from the first token which is TOKEN_IDENTIFIER('f') until we get to the terminating token, TOKEN_SEMICOLON while incrementing count. The loop also makes sure that parenthesis are all closed before terminating by making sure braces=0. Within the loop if we expect TOKEN_LEFT_PAREN we increment braces and if we meet TOKEN_RIGHT_PAREN we decrement braces until braces== 0. In the loop, If we expect TOKEN_EOF we throw an error of unexpected end of file. Also within the token list enum , Tokens above TOKEN_OUTPUT are not expected in an expression e.g includes TOKEN_IF, TOKEN_WHILE and all tokens above TOKEN_OUTPUT. TOKEN_SEMICOLON is also not expected since it is the terminating token of the loop and the expression.
  • If we suddenly expect '?' or tenary operator. E.g F= (25 > 2) ?: a-2 : c*3 ;
    Remember our loop was counting from 'f' to TOKEN_SEMICOLON before meeting a tenary operator and for tenary operations we have to evaluate (25>2) first to determine what value to assign to 'f'. So, we first of all push the 'tenary' as a string into the scopeCount and increment scopedepth and at the end of the tenary operation we pop it out. Then we use the global variable called tenary to store the 'F' token index which is counterStart. We then increment counterStart by two to now point '('. Since the current count is at '?'. We break out of the loop. Therefore, the loop has counted from '(' to ')'. Therefore, we have counted (25 > 2) which is the first expression we need to compile before determining the value of 'F'.
  • Assuming we meet a function call within an expression E.g F= 2*a-c+compile(c, d)-14 ;
    When we meet this compile function, we first push string 'function' into the scopeCount and increment scopedepth, then we set the funcArg to zero so as to count new functions argument. We call funExp() function , this function compiles the function call and also alters the count, so we use the new count value as its return value and also pass the current count value as an argument to it. At the end of this function the count will now point at '-'. We decrement the count then pop the scopeCount and set funcArgs back to zero and continue counting in the loop.
  • We also check for unwanted tokens and finally if the braces == 0 and we expect our terminating token we simply break the loop.
  • At the end of the loop, we decrement the count to point to the last value before the TOKEN_SEMICOLON, we also set the counterEnd as the new count value while the counterStart is the start value before the loop. This current count value will be returned to its calling function.
  • Remember we are using the postfix notation e.g: F = 25*a+16-b-(12/a);
    In postfix notation the expression above becomes: 'a12/b16a25*+- - F='. To solve the postfix expression, we loop again from counterEnd down to counterStart. The position of F is given as counterStart while ')' is given as counterEnd. We use a new tokenClass called expressionList whose index is 'st.index' indicating it is a variable of symbolTable. To solve the expression above we start from ')'. We first push it to the stack. Next is 'a' since it is a variable (or string or number, or bool) we push it to the expressList. Next is '/', we push it to the stack. When we meet '(', we pop everything in the stack until we get to ')', and push it to the expressList. So the expressList now becomes 'a12/', we do not push parenthesis into the expressList. Next is '-', we push it to the stack. Next is a variable 'b' we push it to the expressList. Next is '-' we push it to the stack. '16', will be pushed to the expressList. '+', will be pushed to the stack since it has higher precedence than '-'. 'a' will be pushed to the expressList. Multiplication will be pushed to the stack. '25' will be pushed to the expresslist. '=' has lower precedence than -, +, so we pop them from the stack into the expressList till its empty. So the expresslist will now be 'a12/b16a25*+--'. We now push '=' into the stack. Then finally 'f' is pushed into the expressList. Now we are done we pop all variable from stack to the expresslist. Which now becomes 'a12/b16a25*+- -F='. Take note of this result we will solve it in readExpressList() function but for now this what the expression() function does, which is to arrange operators and variables into the expressList. Assuming we want to push a sign to the stack and the stack top is of lower precedence, then we pop it to the expressList and re-check again until the top of the stack have a lower precedence before pushing it in. So this algorithm is what we would solve now. String expressions can only be of the form : e.g string car = "car";=> (TOKEN_STRINGVAR, TOKEN_IDENTIFIER, TOKEN_EQUAL, TOKEN_STRING).
    . So we check for string expression that do not obey this rule.
  • TOKEN_IDENTIFIER are function variables. Within this new while loop, whenever we meet a TOKEN_IDENTIFIER we assign register value to it using the checkIdentifier() function.
  • Tokens less than TOKEN_BANG are Arithmetic and logical operators. When a plus appear with another operator e.g 2-+2, 2*+2, 2/+2. Then the plus sign is ignored. If a minus sign appears with another operator. Then we push all signs less than TOKEN_MINUS in the stack into the express list and then add zero to the expresslist eg. 2*-3 will now become 2*0-3. So what we did is to insert a zero before the minus, if we do not do this the readExpressList() function as we would see later wont function properly.
  • The TOKEN_START and TOKEN_END will be discussed later. TOKEN_COLON, TOKEN_LEFT_BRACE and TOKEN_TENARY are simply ignored since they are not part of an ALU logic.
  • To use the Boolean values 'TRUE' and 'FALSE' in the expressionList we first of all convert them to TOKEN_NUMBER, set TOKEN_FALSE to '0' and TOKEN_TRUE to '1'.
  • Next in case we have an expression in which the right or left parenthesis is directly after or before a number or pin or identifier then we append a multiplication to it. E.g 5(2-d) becomes 5*(2-d) or (w+3)y becomes (w+3)*y.
  • If we have closing parenthesis we push it to the stack and decrement the braceC. This identifier is used to make sure all parenthesis was closed. The pushed parenthesis will mark a separate expression and isolates the current expression we are calculating.
  • Then when we have an opening parenthesis, since we checking from the back we would increment the braceC. This marks the end of the isolated expression, so we pop all items on the stack until we meet a closing parenthesis. If we do not meet a closing parenthesis until the stack is empty, then we throw an error of missing closing parenthesis.
  • Tokens less than TOKEN_LEFT_PAREN from the list are all operators either arithmetic or logical. So when we meet an operator we compare its precedence with the operator on top of the stack. If it is greater than the top of the stack we simply add it to the stack else if it is less than the operator on top of the stack, we pop the top of the stack and push it to the expression list until we get an operator that it is greater than and simply add it to the stack. We also stop when we meet a ')', which marks an isolated expression or when the stack is empty.
  • We decrement the counterEnd and loop again.
  • At the end of the Loop, we pop the rest of the items on the stack into our expressList except the parenthesis since they are not part of an AL operation.
  • Now at this point all our expression variables have been rearranged into the expressionList. Then for expressions in: return statement, tenary operation, function call, cout and switch statement they are some extra operations before we finalize the calculation.
  • Remember when we encountered the tenary operator in an expression eg: F = (W > 0)? C-2 : C+2;
    When We started by first solving it like a normal expression until we met the ''?" operator , then we marked position of 'F' to a global variable 'tenary'. Because we will still solve the expression F = C – 2 and F = C + 2. So we need the position of F. Next we first incremented counter to start from '(' to ')', thereby we solved (W>0) at this point another global variable 'tenaryC' which is initialized to zero indicates that we are solving the tenary condition, after which it increments. As we will see later, the tenary function will call expression again to solve C - 2 and C + 2 and at this point tenaryC has already been incremented indicating we are now solving the values. Therefore, we would add 'F=' to the expressionList which completes each of the two conditions.
  • Next, the return statement. In RISC-V and most compilations. The return statement argument is moved to the argument register before a function call. This register is 'x10' which is the first register in the argV[] array. For example : return(0);
    When we pass '0' to the expression function we need to convert it to x10 = 0
    Therefore, for every return statement we generate two tokens and add to the expression list . These tokens are TOKEN_EQUAL and TOKEN_IDENTIFIER whose mType = 'x10'.
  • Just like in return statement. All function arguments are passed to the expression() function and assigned to the argument register e.g swap(f-2, 25, w+14);
    The function above has three arguments and each of them is assignment to an argument register by generating and adding two tokens to the expressionList. TOKEN_EQUAL and TOKEN_IDENTIFIER which is the argument register. The funArgC is used to select a new register for each argument that gets passed to the expression() function.
  • When we call a 'cout' all values are pushed to an argument just like in function call.
    X10 = "hello world", X11 = swap(), X12 = X, X13 = 25.
  • Finally, in a switch statement. e.g
    switch(a){case(1); case(2); case(3);}
    The index of 'a' is set as a global variable called 'swCase'. Therefore, whenever we are solving the case statement we always assign the swCase to each case. i.e swCase = 1; swCase = 2; swCase = 3;
    Therefore, for every case statement we add 'swCase' and TOKEN_EQUAL to the expressionList.
  • Finally, we call the readExpressList() function with the st.indexE as the start and 0 as the end. We increment the count to point to the next TOKEN and return the count.
14.37 void readExpressList(int index, int indexEnd)
  • expressList => C,B,3,*, - , 9, A,-, 7, *, -, 2, 8, /, 2, 25, *, +, +, F, =. STACK=> empty
  • Above is the expressList we calculated. The readExpressList function will loop continuously through the list searching for the combination. identifier, identifier, operator , extract it and replace it with TOKEN_OUTPUT. Identifier represents all tokens that are not operators including: TOKEN_NUMBER, TOKEN_STRING, TOKEN_BOOL, TOKEN_IDENTIFIER etc.
  • It starts from index 0. First combination is B, 3, *. It extracts it from the expressList to the call() function. So expressList becomes:
  • TO => TOKEN_OUTPUT.
  • expressList => C, TO, - , 9, A,-, 7, *, -, 2, 8, /, 2, 25, *, +, +, F, =
  • expressList => TO , 9, A,-, 7, *, -, 2, 8, /, 2, 25, *, +, +, F, =
  • expressList => TO , TO, 7, *, -, 2, 8, /, 2, 25, *, +, +, F, =
  • expressList => TO , TO, -, 2, 8, /, 2, 25, *, +, +, F, =
  • expressList => TO , 2, 8, /, 2, 25, *, +, +, F, =
  • expressList => TO , TO, 2, 25, *, +, +, F, =
  • expressList => TO , TO, TO, +, +, F, =
  • expressList => TO , TO, +, F, =
  • expressList => TO , F, =
  • expressList => TO
  • So, this is how the readExpressList() function reduces the expressList until it is empty or remaining one TO.
  • Unary operators: !, ++, --. The expressList also traces their patterns identifier, unary operator.
  • The function takes two arguments. 1. The start index in the expressList and the end index. When the expression function first calls it. The start is usually the total number of items in the list and the end is zero.
  • The function calls three other functions. 1. CheckBang() – Used to compile '!' or TOKEN_BANG. 2. CheckLor() – Used to compile '||' or TOKEN_LOGICAL_OR. 3. CheckLand() – Used to compile 'and' or TOKEN_LOGICAL_AND.
  • These three tokens above alter the algorithm used to compile the expressList, therefore these special functions are used to compile them separately.
  • The for loop, loops from indexEnd down to index. The first pattern is for unary operators (++,--). It searches for identifier, unary operator. Identifier => TOKEN_PIN, TOKEN_NUMBER, TOKEN_OUTPUT, TOKEN_IDENTIFIER.
  • If we spot a pattern, then we pick the tokens and their index, 'i' into the call function which compiles it. Then using another for loop we delete them from the express list.
  • We also check for the second pattern. identifier, identifier, operator. Identifier represent all tokens in the expressList greater than TOKEN_SEMICOLON. They include the ones listed above.
  • We call the call() function again inserting the tokens with their index.
  • Tokens which are less than TOKEN_LOGICAL_OR in the token table are conditional operators. They do not return any output therefore its three combination are completely deleted eg. W > A has no output, so delete all. But 2 + r, returns an output so we delete two tokens from the combination and use the last one to create TOKEN_OUTPUT for it.
  • Most times, in a conditional statement and loops we have a single Boolean variable e.g if(n) or while(n), or f = (n) ? 0:1, etc ; where n is a Boolean value. We compile it on line 629
  • Finally if the difference between the start and finish is greater than '1', we call the readExpressList again.
14.38 int checkBang(int index, int indexEnd)
We call this function when the expressList contains TOKEN_BANG or '!'. Assume we have :
  • !(p > 3 || n); where n is a Boolean variable. After passing through the expression() function. Our expressList will contain.
  • n , 3 , p , > , || , !
  • The function will loop through the expressList searching for the '!' token. If unsuccessful it returns the original index else, it calls another function called countBang() using the index of '!' as argument. If they are more than one instances '!', it will call countBang all the time. After that it deletes the '!' token using a for loop and returns the new index.
14.39 int countBang(int index)
  • n , 3 , p , > , || , !
  • The expressList is given above. And the index of '!' is given as argument to the function.
  • We use one variables. Operand which keeps track when the expression ends. To do this, we loop through the expressList whenever we meet an identifier we increment operand. When we meet an operator we decrement operand.
  • from our expressList above, looping from right to left:
  • ||: operand = -1.
  • >: operand = -2
  • p: operand = -1
  • 3: operand = 0
  • n: operand = 1
  • Whenever operand is equal to one we terminate the loop. If the identifier is a Boolean variable we apply the '!' to it by calling the call function and passing the token and the our index token which is '!' to the call function. Tokens below TOKEN_STAR are all identifiers and tokens below TOKEN_LOGICAL_OR are all logical operators. If we meet logical operators we call changeSign function , this function inverts the logical operators.
  • This is how we treat '!' in our expressList.
14.40 void changeSign(int i)
This function inverts logical operators whenever it is preceeded by '!' or TOKEN_BANG.
14.41 int checkLand(int index, int indexEnd)
  • This function compiles '&&' or TOKEN_LOGICAL_AND. Just like above, it loops through the expressList searching for the operator. If unsuccessful it returns without doing anything. If successful, it deletes '&&' using a for loop, then it returns the index by subtracting countAnd() function. The countAnd() function takes as argument the index of '&&' and returns the total number of compiled tokens which will be deducted from the main index and returned to the calling function.
14.42 int countAnd(int index)
This function compiles '&&' expression, it takes as argument the index of '&&'. It uses two operators. Operand that tracks the end of the expression and counter that tracks the number of compiled tokens.
  • given expressList => n, 3, p, >, &&
  • The expressList is given above.
  • Loop through the expressList whenever we meet an identifier we increment operand. When we meet an operator we decrement operand. If the operand is one it means we have hit one half of the '&&' operator expression, we set operator to zero and start afresh.
  • from our expressList above, looping from right to left:
  • >: operand = -1
  • p: operand = 0
  • 3: operand = 1
  • n: operand = 1
  • Every '&&' operator have two halves. part1 && part2. Whenever we count the second part we terminate the loop to avoid adding tokens that do not belong the '&&' operation. The variable 'tn' keeps track and exits the loop when tn == 2. By doing 'Index-I', we calculate the number of tokens in each part and when we are done we add them up as compiled tokens and subtract them from the original tokens. Whenever operand is equal to one, it means we have counted one part. Then we take that part to the readExpressList to compile and set operand back to zero.
14.43 int checkLor(int index, int indexEnd)
  • CheckLor (check logical Or) and checkLand(check Logical And) are two similar functions. The former compiles expression with '||' while the later compiles expression with '&&'. CheckLor takes as argument the start and end index of the tokens in the expressList. Then it loops through the expressList starting from the start index to the end searching for '||'.
  • Once the index is spotted, Using a for loop, it is deleted. Just like in checkLand, here the countSignO() is called using the index as argument to compile it. This function returns the number of compiled tokens which is subtracted from the main index and returned to the calling function.
14.44 int countSignO(int index, int countR)
  • Given, if(x0 || x1 || x2 || x3)
  • expressList: x3, x2, x1, x0, ||, ||, ||
  • The expression above is compiled as:
  • bgt x0, 0, cOr1
  • bgt x1, 0, cOr1
  • cOr1:
  • bgt x2, 0, cOr2
  • bgt x3, 0, cOr2
  • cOr2:
  • the rest of the code follows:
14.45 void call(token1, token2, token3, i)
This function outputs the compiled tokens to the command line.
  • It takes three four arguments. Three consecutive tokens that matched the compile pattern. identifier, identifier, operator. The last argument is the index of the first token.
  • Strings only uses the assignment operator '='. Anyother operator gives error.
  • The second token register is used to store computation and registers are only assigned to identifiers. Therefore if the second token is a number or pin we would assign a temporary register to it, and add the value in the identifier table into the register using 'addi'. Also, it is converted to TOKEN_OUTPUT since it would contain result of a computation.
  • Next we compile arithmetic instructions. The 'instrCall' holds the operation and is selected from the instruction array using 'token3' – 1. The 'storeReg' is the destination register, while 'reg1' and 'reg2' are the two computed registers. We add 'I' to the instruction name if the token1 is a number, this stands for 'immediate'.
  • We use the load and loadi (ld, li) as the instruction for assignment operators. String variables are also assigned here.
  • Finally we compute logical operations.
14.46 void call(token1, token2, index)
  • This function compiles unary operations. Strings are not allowed in unary operations.
  • Three operators ++, --, ! are compiled . ++ and – are only applied to identifiers whose type are number or TOKEN_VAR, while ! is applied to identifiers whose type is boolean else an error is called. For '!' operator we simply xor the register with xFFFF.


Tokenizer Loading...