What is Lexical Analyzer?
We’ll make a Lexical Analyzer in C, or a C program that detects tokens in a C program. Lexical Analysis sometimes referred to as a scanner, is the initial phase of the compiler. It delivers the input program to the syntax analyzer after converting it into a series of Tokens.
The simplest unit of code is the token, which can be a Keyword, Identifier, Constant, String Literal, or Symbol.
In C, below are some examples of various token kinds.
1) Keywords:
Examples- for, while, if, printf etc.
2) Identifier:
Examples- variable name, function name etc.
3) Operators:
Examples- '+', '++', '-' etc.
4) Separators:
Examples- ', ' ';' etc
Lexical Analyzer in C Program
Below is the code for Lexical Analyzer in C program.
#include <stdbool.h> #include <stdio.h> #include <string.h> #include <stdlib.h> // Returns 'true' if the character is a DELIMITER. bool isDelimiter(char ch) { if (ch == ' ' || ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == ',' || ch == ';' || ch == '>' || ch == '<' || ch == '=' || ch == '(' || ch == ')' || ch == '[' || ch == ']' || ch == '{' || ch == '}') return (true); return (false); } // Returns 'true' if the character is an OPERATOR. bool isOperator(char ch) { if (ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == '>' || ch == '<' || ch == '=') return (true); return (false); } // Returns 'true' if the string is a VALID IDENTIFIER. bool validIdentifier(char* str) { if (str[0] == '0' || str[0] == '1' || str[0] == '2' || str[0] == '3' || str[0] == '4' || str[0] == '5' || str[0] == '6' || str[0] == '7' || str[0] == '8' || str[0] == '9' || isDelimiter(str[0]) == true) return (false); return (true); } // Returns 'true' if the string is a KEYWORD. bool isKeyword(char* str) { if (!strcmp(str, "if") || !strcmp(str, "else") || !strcmp(str, "while") || !strcmp(str, "do") || !strcmp(str, "break") ||!strcmp(str, "continue") || !strcmp(str, "int") || !strcmp(str, "double") || !strcmp(str, "float") || !strcmp(str, "return") || !strcmp(str, "char") || !strcmp(str, "case") || !strcmp(str, "char") || !strcmp(str, "sizeof") || !strcmp(str, "long") || !strcmp(str, "short") || !strcmp(str, "typedef") || !strcmp(str, "switch") || !strcmp(str, "unsigned") || !strcmp(str, "void") || !strcmp(str, "static") || !strcmp(str, "struct") || !strcmp(str, "goto")) return (true); return (false); } // Returns 'true' if the string is an INTEGER. bool isInteger(char* str) { int i, len = strlen(str); if (len == 0) return (false); for (i = 0; i < len; i++) { if (str[i] != '0' && str[i] != '1' && str[i] != '2' && str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] != '6' && str[i] != '7' && str[i] != '8' && str[i] != '9' || (str[i] == '-' && i > 0)) return (false); } return (true); } // Returns 'true' if the string is a REAL NUMBER. bool isRealNumber(char* str) { int i, len = strlen(str); bool hasDecimal = false; if (len == 0) return (false); for (i = 0; i < len; i++) { if (str[i] != '0' && str[i] != '1' && str[i] != '2' && str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] != '6' && str[i] != '7' && str[i] != '8' && str[i] != '9' && str[i] != '.' || (str[i] == '-' && i > 0)) return (false); if (str[i] == '.') hasDecimal = true; } return (hasDecimal); } // Extracts the SUBSTRING. char* subString(char* str, int left, int right) { int i; char* subStr = (char*)malloc( sizeof(char) * (right - left+2)); for (i = left; i <= right; i++) subStr[i - left] = str[i]; subStr[right - left + 1] = '\0'; return (subStr); } // Parsing the input STRING. void parse(char* str) { int left = 0, right = 0; int len = strlen(str); while (right <= len && left <= right) { if (isDelimiter(str[right]) == false) right++; if (isDelimiter(str[right]) == true && left == right) { if (isOperator(str[right]) == true) printf("'%c' IS AN OPERATOR\n", str[right]); right++; left = right; } else if (isDelimiter(str[right]) == true && left != right || (right == len && left != right)) { char* subStr = subString(str, left, right - 1); if (isKeyword(subStr) == true) printf("'%s' IS A KEYWORD\n", subStr); else if (isInteger(subStr) == true) printf("'%s' IS AN INTEGER\n", subStr); else if (isRealNumber(subStr) == true) printf("'%s' IS A REAL NUMBER\n", subStr); else if (validIdentifier(subStr) == true && isDelimiter(str[right - 1]) == false) printf("'%s' IS A VALID IDENTIFIER\n", subStr); else if (validIdentifier(subStr) == false && isDelimiter(str[right - 1]) == false) printf("'%s' IS NOT A VALID IDENTIFIER\n", subStr); left = right; } } return; } // DRIVER FUNCTION int main() { // maximum length of string is 100 here char str[100]; printf("Enter the String: \n"); scanf("%[^\n]s", &str); parse(str); // calling the parse function return (0); }
Sample Input and Output
Input
int a = b + 4c
Output
'int' IS A KEYWORD
'a' IS A VALID IDENTIFIER
'=' IS AN OPERATOR
'b' IS A VALID IDENTIFIER
'+' IS AN OPERATOR
'4c' IS NOT A VALID IDENTIFIER
Explanation of Code
Main Function
int main() { // maximum length of string is 100 here char str[100]; printf("Enter the String: \n"); scanf("%[^\n]s", &str); parse(str); // calling the parse function return (0); }
In this section, everything is like a normal C program. Where we can have problems is on the scanf()
functions. So let’s see,
[^\n]
is a kind of regular expression.
[...]
: it matches a nonempty sequence of characters from the scanset (a set of characters given by …).^
means that the scanset is “negated”: it is given by its complement.^\n
: the scanset is all characters except\n
.
In simple words, ^\n
means scan till “\n
” or an enter key.
We can use gets()
instead of scanf()
. If we use gets()
the code will look like something like this,
gets(str);
isDelimiter
bool isDelimiter(char ch) { if (ch == ' ' || ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == ',' || ch == ';' || ch == '>' || ch == '<' || ch == '=' || ch == '(' || ch == ')' || ch == '[' || ch == ']' || ch == '{' || ch == '}') return (true); return (false); }
First, we need to understand what is Delimiter? A Delimiter is a character that separates text strings by one or more characters. Commas (,), semicolon (;), quotes (“, ‘), braces (), pipes (|), and slashes (/) are all common delimiters. When software saves sequential or tabular data, it uses a preset character to delimit each item of data.
==
means “Equal to”. That means when “ch” is equal to any of the given delimiters, then it will return True, otherwise False.
isOperator
bool isOperator(char ch) { if (ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == '>' || ch == '<' || ch == '=') return (true); return (false); }
It is similar to the Delimiter. When the “ch” value is equal to any of the given operators, then it will return true, otherwise false.
isValidIdentifier
bool validIdentifier(char* str) { if (str[0] == '0' || str[0] == '1' || str[0] == '2' || str[0] == '3' || str[0] == '4' || str[0] == '5' || str[0] == '6' || str[0] == '7' || str[0] == '8' || str[0] == '9' || isDelimiter(str[0]) == true) return (false); return (true); }
It is checking if the string is a Valid Identifier or not. To be a valid identifier, the first letter of an identifier can’t be a number and special character (Delimiter). It is checking whether the first letter or character of the identifier is a Number or Special Character (Delimiter).
If the first letter of a string is a number or special character then this function will return false, otherwise, return true.
isKeyword
bool isKeyword(char* str) { if (!strcmp(str, "if") || !strcmp(str, "else") || !strcmp(str, "while") || !strcmp(str, "do") || !strcmp(str, "break") || !strcmp(str, "continue") || !strcmp(str, "int") || !strcmp(str, "double") || !strcmp(str, "float") || !strcmp(str, "return") || !strcmp(str, "char") || !strcmp(str, "case") || !strcmp(str, "char") || !strcmp(str, "sizeof") || !strcmp(str, "long") || !strcmp(str, "short") || !strcmp(str, "typedef") || !strcmp(str, "switch") || !strcmp(str, "unsigned") || !strcmp(str, "void") || !strcmp(str, "static") || !strcmp(str, "struct") || !strcmp(str, "goto")) return (true); return (false); }
In this section, we need to understand strcmp()
. The C library function int strcmp(const char *str1, const char *str2)
compares the string pointed to, by str1 to the string pointed to by str2.
This function returns the difference between two strings:
- < 0: the first character that does not match has a lower value in str1 than in str2.
- 0: the contents of both strings are equal
- > 0: the first character that does not match has a greater value in str1 than in str2.
It is common for comparison functions to return 0 on “equals” so that they can also return a negative number for “less than” and a positive number for “greater than”. strcmp()
and memcmp()
work like this.
It is, however, idiomatic for zero to be false and nonzero to be true because this is how the C flow control and logical boolean operators work. That’s why we are using the !
operator. !
It means Logical NOT. True only if the operand is 0.
So when the string value is equal to any of the Keywords, the strcmp()
functions give 0 and !
operator makes the value 1, and our if
functions work. if
statements only work if the condition is true, so it needs 1 to work.
So when the string is equal to Keywords the function returns true, otherwise, it will return false.
**If you are facing issues understanding this section, you can check the code below. Hoping it will give you a clear idea.
#include <stdio.h> int main() { int i=0; //Change to 1 to see the magic if(i){ printf("Hello World"); } return 0; }
isInteger
bool isInteger(char* str) { int i, len = strlen(str); if (len == 0) return (false); for (i = 0; i < len; i++) { if (str[i] != '0' && str[i] != '1' && str[i] != '2' && str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] != '6' && str[i] != '7' && str[i] != '8' && str[i] != '9' || (str[i] == '-' && i > 0)) return (false); } return (true); }
In this section, first, we check if the string length is 0. If the string length is 0, then we don’t need to check it and it returns the value False.
But if the length is not 0, then it will go to the for loop and then if condition statement. In the if conditional statement we are seeing if the string does not contain any of the Digit or contain – (minus) and if the array is not 0.
Many of you can ask why we are seeing if it contains minus (-) we already checked it in isOperator functions. It is just a precautionary measure. The code will run smoothly without || (str[i] == '-' && i > 0)
this section. You can try it. Don’t forget to tell us what occurs.
So, in simple words, in this section, we are checking for digits. If the string contains any character other than a digit then it will return false, otherwise true.
isRealNumber
bool isRealNumber(char* str) { int i, len = strlen(str); bool hasDecimal = false; if (len == 0) return (false); for (i = 0; i < len; i++) { if (str[i] != '0' && str[i] != '1' && str[i] != '2' && str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] != '6' && str[i] != '7' && str[i] != '8' && str[i] != '9' && str[i] != '.' || (str[i] == '-' && i > 0)) return (false); if (str[i] == '.') hasDecimal = true; } return (hasDecimal); }
It is similar to the isInteger()
function. But we are checking if it has a decimal (.) value. If it has a decimal value it will return true, otherwise false.
Parse
void parse(char* str) { int left = 0, right = 0; int len = strlen(str); while (right <= len && left <= right) { if (isDelimiter(str[right]) == false) right++; if (isDelimiter(str[right]) == true && left == right) { if (isOperator(str[right]) == true) printf("'%c' IS AN OPERATOR\n", str[right]); right++; left = right; } else if (isDelimiter(str[right]) == true && left != right || (right == len && left != right)) { char* subStr = subString(str, left, right - 1); if (isKeyword(subStr) == true) printf("'%s' IS A KEYWORD\n", subStr); else if (isInteger(subStr) == true) printf("'%s' IS AN INTEGER\n", subStr); else if (isRealNumber(subStr) == true) printf("'%s' IS A REAL NUMBER\n", subStr); else if (validIdentifier(subStr) == true && isDelimiter(str[right - 1]) == false) printf("'%s' IS A VALID IDENTIFIER\n", subStr); else if (validIdentifier(subStr) == false && isDelimiter(str[right - 1]) == false) printf("'%s' IS NOT A VALID IDENTIFIER\n", subStr); left = right; } } return; }
We are getting to the end of the code, just think simple, and you will understand everything.
We are checking a string of characters. We will not know where we are now after coding, so we are assigning two variables called left and right. Which will help to identify our current location. Then everything is a conditional statement, you’ll understand. I will elaborate on the main thing that you may not understand.
char* subStr = subString(str, left, right - 1);
To identify any of the given inputs, first, we will check if it’s the operator. If it’s not, then we will send it to the subString()
function to create a substring.
So, we are calling the subStr()
function and passing the string value and where to start and finish by using the left
and right
variables. So you may ask why we are giving right - 1
instead of right
. The right
variable starts from 1, but we know that a string starts indexing from 0. So, if we pass the right
variable value, it will be a waste of memory. For better memory management, we are decreasing it to the size of the string or array.
Your code will work fine if you just pass the right
variable without decreasing it. It may contain some whitespace. Go give it a try and see what happens.
subStr
char* subString(char* str, int left, int right) { int i; char* subStr = (char*)malloc( sizeof(char) * (right - left+2)); for (i = left; i <= right; i++) subStr[i - left] = str[i]; subStr[right - left + 1] = '\0'; return (subStr); }
Most people don’t understand this section. Think simply and you’ll be able to understand. Here, we are dynamically allocating memory for better memory management.
char* subStr = (char*)malloc(sizeof(char) * (right - left+2));
The name “malloc” stands for memory allocation. This line means, a character pointer variable called subStr is assigned to allocate memory. And how much memory will it allocate? It will allocate sizeof(char)
multiply by (right-left+2)
.
char
is a character and sizeof(char)
is defined to be 1. But if we have char*
which is a pointer to a character then sizeof(char*)
depends on the environment. It is typically 4 in a 32-bit environment and 8 in a 64-bit environment.
But you may ask why we are adding 2, not 1, or something else? It is because we have to assign a NULL character (\0
). The string literal just happens to end with \0
. We can add 1 and the program will run smoothly. Less than that, it may cause errors. And greater than 2 will not be used because it will be unused, but if you want you can add more than 2. The program will run smoothly with less memory optimization.
After ending the string literal with \0
we will return the data of subStr
to parse function.
This concludes our code explanation.
With what you learned today and a good understanding of core C programming concepts, you can now easily implement Lexical Analyzer in C program.
Contact Us if you have any queries regarding the above program for lexical analyzer in C, and program explanation.