diff --git a/HACKING.md b/HACKING.md deleted file mode 100644 index feb31b0..0000000 --- a/HACKING.md +++ /dev/null @@ -1,59 +0,0 @@ -Hacking CompilerDesign -====================== -## Getting started -1. [Install `git` for your platform](http://git-scm.com/). Click the Download button. -2. Once installed, in Git Bash (or the Terminal in Mac or Linux), configure git: - - git config --global user.name "FirstNameGoesHere LastNameGoesHere" - git config --global user.email yourEmailaddressGoesHereButDontTypeThisInLiterally - -3. [Fork the project here.](https://github.com/lawrancej/CompilerDesign/fork) -4. Clone the project locally. In Git Bash (or the Terminal in Mac or Linux), type: - - git clone https://github.com/YourGithubLoginNameGoesHereButDontTypeThisInLiterally/CompilerDesign.git - cd CompilerDesign - -5. Once in `CompilerDesign`, set up remote repositories and install required dependencies (Java, LaTeX, Pandoc, diction, LibreOffice). - - ./collaborators.sh setup - ./generate.sh install - -6. Build CompilerDesign, and check for issues. - - ./generate.sh pdf - ./generate.sh check - -7. [Learn how to contribute.](#how-to-contribute) See the [git cheat sheet](git.md) - - git pull upstream master # Get the latest and greatest. - git checkout -b issueXYZ # Work on an issue in a new topic branch, based off of the upstream master branch. - ... Hack away ... # Your text is free from content, style, grammar and spelling errors, right? - ./check.sh all 03 # Let's see if everything's good in section 03. (Replace the number as necessary.) - git commit -a -m "Fixed issueXYZ" # Great! Commit your changes. - git push origin issueXYZ # Push your changes to your repo. Send in a pull request. - git checkout master # Switch back to master. Rinse and repeat. - -## How to contribute. -The Saylor Foundation has compiled a [free compiler course](http://www.saylor.org/courses/cs304/), but a Creative Commons licensed textbook is not yet available. - -### Find (or open) an issue to work on -Version control is not a substitute for communication, so we use github's issue tracker to manage our participation. - - - [Work on open issues in github's issue tracker](https://github.com/lawrancej/CompilerDesign/issues) (comment on issues to get dibs). - * Pro tip: pair up and split the work on an issue with someone else. E.g., you write, they proofread. - * [Follow the conventions.](CONVENTIONS.md) - - - [Open new issues.](https://github.com/lawrancej/CompilerDesign/issues/new) - * [Examine the compiler course mapping outline for deficiencies](http://www.saylor.org/content/coursemapping/CourseMappingFormCS304.xls) - -### Review contributions -[Review contributions for quality issues (comment on pull requests).](https://github.com/lawrancej/CompilerDesign/pulls) - - - [Does the contribution follow the conventions?](CONVENTIONS.md) - -### Use topic branches for your work -Topic branches isolate chunks of work so that it's easier to merge in changes. - -### Send in a pull request for feedback -Switch to your branch in github, and send in a pull request that describes what you did. -Do so when you think your changes are ready to be merged in, but do not hesitate to push works in progress. diff --git a/textbook/01-overview.md b/textbook/01-overview.md index 16c6b8d..731fb14 100644 --- a/textbook/01-overview.md +++ b/textbook/01-overview.md @@ -338,14 +338,20 @@ Executable Code: is the code that runs on your machines, which is usually linked Last, Object Code: is act as the transitional form between the source code and the Executable code. ### Platform Independent Compilers -Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled. -Java compiler is one example of Platform Independent Compilers. All operating system uses same java compiler. -When java compiler compiles the java source code, it outputs java byte code which is not directly executable. +Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled. + +Java compiler is one example of Platform Independent Compilers. +All operating system uses same java compiler. + +When java compiler compiles the java source code, it outputs java byte code which is not directly executable. + The java byte code is interpreted to machine language through JVM(Java Virtual Machine) in respective platform. ### Hardware Compilation -Hardware compilation is the process of compiling a program lagnuage into a digital circuit. -Hardware compilers produce implementation of hardware from some specification of hardware. +Hardware compilation is the process of compiling a program lagnuage into a digital circuit. + +Hardware compilers produce implementation of hardware from some specification of hardware. + Instead of producing machine code which most of the software compiler does, hardware compiler compiles a program into some hardware designs. # Compiler Design @@ -398,7 +404,8 @@ Athough it adds another step, IR provides advantage of abstraction and cleaner s Compiler analyzes the source code to create intermediate representation of source code in front end. #### Manages Symbol Table -Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references. +Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references. + Compiler manages symbol table when it analyzes the source code. This is done in several steps. @@ -418,7 +425,8 @@ There are usually only a small number of tokens for a programming language: cons Lexical analyzer is responsible for lexical analysis. #### Syntax Analysis -In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code. +In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code. + Syntax analysis is closely related with semantic analysis. Normally, a parse tree is built in this process. It determines if the source code of the program is syntatically correct or not so that the program can be further processed for semantic analysis. diff --git a/textbook/02-lexical-analysis.md b/textbook/02-lexical-analysis.md index 225bc7f..e5c5d2e 100644 --- a/textbook/02-lexical-analysis.md +++ b/textbook/02-lexical-analysis.md @@ -58,20 +58,22 @@ Lexical Analysis To find if a language is regular, one must employ a *pumping lemma*: - All sufficiently long words in a regular language may be "pumped." - - A middle section of the word can be repeated any number of times to produce a new word which also lies within the same language. + - The middle section of the word repeats itself any number of times to produce a new word that is within language syntax. - i.e. abc, abbc, abbbc, etc. - In a regular language $L$, there exists an integer $p$ depending only on said language that every string $w$ of "pumping length" $p$ can be written as $w = xyz$ satisfying the following conditions: 1. $|y| \ge 1$ 2. $|xy| \le p$ 3. for all $i \ge 0$, $xy^iz \in L$ - - Where $y$ is the substring that can be pumped. + - Where $y$ is the pumpable substring. [If the language is finite, it is regular](#why-are-all-finite-languages-regular)? ### Why are all finite languages regular? > TODO: prove this - + ### What is a regular grammar? A regular grammar is a [formal grammar](#what-is-a-grammar) limited to productions of the following forms: @@ -95,7 +97,7 @@ Match a single character. #### Operations: -If `a` and `b` are regular expressions, then the following are regular expressions: +If `a` and `b` are regular expressions, then the following are also considered such: - `ab`. Catenation. Match `a` followed by `b`. @@ -103,12 +105,14 @@ If `a` and `b` are regular expressions, then the following are regular expressio Match `a` or `b`. - `a*`. Kleene closure. Matches `a` zero or more times. - -### What is a finite automaton? -A finite automaton, or finite state machine, can only be in a finite number of states in which it transititons between. + +### What is a finite state machine? +A finite state machine, also known as an automaton, can only be in a finite number of states in which it transitions between. An example is that when an automaton sees a symbol for input. -It then transititons to another state based on the next input symbol. +It then transitions to another state based on the next input symbol. It has: @@ -120,7 +124,7 @@ It has: ### What is an nondeterministic finite automaton? It is a finite automaton in which we have a choice of where to go next. -The set of transitions is from (state, character) to set of states. +The set of transitions is from (state, character) to a group of states. ### What is a deterministic finite automaton? It is a finite automaton in which we have only one possible next state. @@ -130,54 +134,58 @@ The set of transitions is from (state, character) to state. ### What is the difference between deterministic and nondeterministic? Deterministic finite automaton's (DFA's) are specific in regard to the input that they accept and the output yielded by the automaton. -The next state that the machine goes to is literally determined by the input string it is given. -A nondeterministic finite automaton is not as particular, and depending on its state and input, could change into a several -possible new states. +The input string determines the next state that the machine goes to. +A nondeterministic finite automaton is not as particular. +Depending on its state and input, it could change into possible new states. -Simple put the difference between a DFA and an NFA is that a DFA has no epilsons between the transitional states. -The reasons that this makes a difference is that when we place an epsilon between our states it is not always possible to figure out the correct path to go without looking aheard in the current string we are parsing. -This means that we are using something that is nondeterminsitic. -Where as if we know the correct path to go at all times, it is determnistic. +The difference between a DFA and an NFA is that a DFA has no epsilon between the transitional states. -Deterministic and nondeterministic are very similar and there is no huge difference between them. -The main difference is that nondeterministic essentially chooses on a whim which state to go to while deterministic does not do this at random. +Despite placing an epsilon between the states, it is not always possible to figure out the correct path to go without looking ahead in the current string the program is parsing. +This is an example of a nondeterministic finite automaton. +Where as if we know the correct path to go at all times, it is deterministic. -### How to convert an NFA to a DFA? -Since both automaton's only accept regular languages as input, an NFA is able to be simplified and converted to a DFA. +Deterministic and nondeterministic are similar, with 1 distinctable difference between them. +The main difference is that nondeterministic essentially chooses the state indiscriminately, while deterministic doesn't. -The process is called a powerset (or subset) construction and it takes the possible states of the NFA and translates them +### How to convert an NFA to a DFA? +Since both automatons only accept regular languages as input, they can simplify an NFA to convert to a DFA. +The process, referred as a powerset (or subset) construction, takes the possible states of the NFA and translates them into a map of states accessible to a DFA. -This process is not without a cost, since deterministic finite automaton's are -much less complex than their nondeterministic counterparts there will always be a loss of potential states in conversion. -All of the states of the NFA will still exist, but they will be unreachable from the origin once converted and thus obsoleted. -A converted NFA will have N^2 the number of states when converted where N is the number of states that the NFA originally had. +This process is not without a cost. -### What is the derivative of a regular expression? +Deterministic finite automaton's are much less complex than their nondeterministic counterparts; there will always be a loss of potential states in conversion. +All states of the NFA will still exist, but they will be unreachable from the origin once converted and obsoleted. +A converted NFA has N^2 the number of states; N represents the original amount before conversion. +### What is the derivative of a regular expression? + ### What is a scanner (lexical analyzer)? > TODO: Merge these definitions. Some of these definitions are misconceptions, which we should include to address why they're wrong. A scanner is a program in a parser that converts characters into tokens. -This already has the information it needs about whatever characters that can be tokenized. -This then matches any string that was put in to possible tokens and processes said information. +It contains information about what it can tokenize. +It matches inputted strings to possible tokens and processes the information. -Lexical analysis or scanning is the process where the stream of characters making up the -source program is read from left-to-right and grouped into tokens. +Lexical analysis or scanning +- A process where it reads the stream of characters making up the source program from left-to-right and groups them into tokens. Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words. -A lexical analyzer is a piece of software that takes in a string as input, from that string it generates tokens based off of pre-defined rules. -This is done to help for the actual compilation proccess later, as well as error checking. +A lexical analyzer is a piece of software that takes in a string as input, then generates tokens based off of pre-defined rules. +This helps for the compilation process and error checking later on. #### Example -Lets take a look at some basic code with some basic rules. + int a = sum(7,3) -We define the rules as. +Rules: + VARIABLE_TYPE = int | float | double | char ASSIGNMENT_OPERATOR = = OPEN_PARANTHESIS = ( @@ -186,16 +194,14 @@ DIVIDER = , NUMBER = all numbers NAME = any that remain -Using these rules we can now figure out what everything in this piece of code is. +These rules simplify understanding the code sample below: VARIABLE_TYPE NAME ASSIGNMENT_OPERATOR NAME OPEN_PARENTHESIS NUMBER DIVIDER NUMBER CLOSE_PARANTHESIS -We can pass that on to the next step of the compilation proccess and it will now know what each of those words/symbols means. +The analyzer passes these values to the next step of the compilation process to process. + + -Scanner, also know as Lexical analyzer or Lexer is a program which performs lexical analysis. -It converts a sequence of characters into string of characters with a collective meaning following some rules. -These rules contain identifier, assignment operator, number etc. -The lexical analyzer takes a source program as input, and produces a stream of tokens as output. Source Program -----> Lexical Analyzer ---------> Token stream | @@ -205,10 +211,9 @@ Source Program -----> Lexical Analyzer ---------> Token stream > TODO: Let's use SVG instead of ASCII art. -A Scanner is used within lexical analysis to match token character strings that -are passed through it. -Scanners use finite-state machines (FSM) to hold all possible combinations of tokens -so they may quickly process large amounts of data. +The lexical analysis uses a scanner to match strings passed into it to token characters. + +Scanners use finite-state machines (FSM) to hold all possible combinations of tokens so they may quickly process large amounts of data. A program or function which can parse a sequence of characters into usable tokens. Sequences are typically delimited in some way using characters (i.e. @@ -218,32 +223,45 @@ Sequences are typically delimited in some way using characters (i.e. Examples > TODO: Add some examples - + ### What is a lexeme? -A lexeme is a string of characters that follow a set of rules in a language, which is then categorized by a [token][#what-is-a-token]. +A lexeme is a string of characters that follow a set of rules in a language, categorized by a [token][#what-is-a-token]. ### What is a token? - -A token is a single element of a programming language. Tokens could be keywords ( a word that is reserved by a program because the word has a special meaning), operators (elements in a program that are usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks. A token is a single element of a programming language. -Tokens could be keywords, operators, or punctuation marks. - -A token is a string of characters that are categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA). +Tokens could be keywords ( a word reserved by a program because the word has a special meaning), operators (elements in a program usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks. + + +A token is a string of characters categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA). They are frequently defined by regular expressions. -Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes in the input, then categorizes them into the tokens. +Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes, then categorizes them into the tokens. #### Example - + + + Consider this example for clarification: Input: int x = 3; -- int is a numeric variable type. -- x is an identifier variable. -- = is an assignment operator. -- 3 is a number value. -- ; is the end of a statement. +- int is a numeric variable type token. +- x is an identifier variable token. +- = is an assignment operator token. +- 3 is a value token. +- ; is the end of a statement token. diff --git a/textbook/03-parsing.md b/textbook/03-parsing.md index fca4a68..1d6204b 100644 --- a/textbook/03-parsing.md +++ b/textbook/03-parsing.md @@ -326,12 +326,14 @@ What the table says is simply that for our non terminal symbol "S" we have three The stack sequence for our string "bababa" is as follows: [b,a,b,a,b,a,`$`] -The first step for the parser is to look at the input symbol "b" and the stack-top symbol S. Since "b" is the input symbol, the parser compares that to the stack-top symbol S, +The first step for the parser is to look at the input symbol "b" and the stack-top symbol S. +Since "b" is the input symbol, the parser compares that to the stack-top symbol S, and since the rule for "b" is to replace "b" with "bS", the stack now becomes: [b,S,a,b,a,b,a,`$`] -Since the input symbol "b" did not match the stack-top symbol S, the "b" is put as the stack-top symbol and not processed further in the first step. Had it been a match, we would further +Since the input symbol "b" did not match the stack-top symbol S, the "b" is put as the stack-top symbol and not processed further in the first step. +Had it been a match, we would further process the terminal symbol as defined by the production rules (for example if the first symbol was S, we could have applied any of the three rules producing a stack of [a,a,b,a,b,a,`$`] or [a,s,a,b,a,b,a,`$`]). @@ -347,7 +349,8 @@ with the Non terminal stack becoming: [S,`$`] -The third iteration continues on and processes the input character "a". Now since we have two production rules with "a" listed, the parser has a choice. Also, our parser +The third iteration continues on and processes the input character "a". Now since we have two production rules with "a" listed, the parser has a choice. +Also, our parser only has a lookahead of 1. We will assume the parser is lazy and takes the rules sequentially, so our production rule on the input symbol "a" will be refactored by rule 1 which is simply "a". Again the input symbol and stack-top symbol do not match so the "a" is not removed yet but is refactored as so by rule 1 and the stack-top symbol becomes "a": [a,b,a,b,a,`$`] @@ -362,7 +365,8 @@ and writing rule #1 to the output stream: [3,1] -Again our input symbol is "b" so we process as we did in the first and second iteration. For brevity's sake I will keep it shorthand: +Again our input symbol is "b" so we process as we did in the first and second iteration. +For brevity's sake I will keep it shorthand: current stack: [b,a,b,a,`$`] @@ -393,12 +397,15 @@ After the Non-terminal stack is resituated, we re-evaluate, and because the Non- [b,a,`$`] [3,1,3,1] -As you can see where this is going, I'll sum up the next two. +As you can see where this is going, I'll sum up the next two. + [b,a,`$`] => [3,1,3,1,3] [a,`$`] => [3,1,3,1,3,1] -Once our parser reaches the special terminator character, it knows it has done it's job and is done. -It's important to note that had we instead chosen rule #2 to replace A, it would have produced the same output. In fact, it would be a good excercise to prove this result +Once our parser reaches the special terminator character, it knows it has done it's job and is done. + +It's important to note that had we instead chosen rule #2 to replace A, it would have produced the same output. +In fact, it would be a good excercise to prove this result yourself. Excercises 1. Given the same grammar and production rules, what would be the output stream produced by an LL(1) parser for the string "aabaa"? diff --git a/textbook/04-ast-and-symbol-tables.md b/textbook/04-ast-and-symbol-tables.md index dae59ce..94c85d0 100644 --- a/textbook/04-ast-and-symbol-tables.md +++ b/textbook/04-ast-and-symbol-tables.md @@ -1,90 +1,90 @@ - -\pagebreak - - - -Abstract Syntax Trees and Symbol Tables -======================================= -### What is an abstract syntax tree? -An abstract Syntax Tree is the data structure compilers/interpreters use in order to perform the actual code generation. -It represents the hirearchy of the programmers code. -An important note is that not all syntax of the code is displayed in the tree in the case of grouping paranthesis. - -#### Example - - int doSomething(int a) - { - if(a > 10) - a = a % 10; - return a; - } - -![Abstract syntax tree for `doSomething`.](images/ast-example.svg) - -An [abstract syntax tree (AST)](http://en.wikipedia.org/wiki/Abstract_syntax_tree) is simply a tree representation of the structure of source code. -Each node of the tree represents a part of the code. - -The "abstract" part of the AST comes from the fact that the tree does not represent the syntax down to the character level. -Tokens like parenthesis and brackets are not nodes on the tree, and are instead represented implicitly by the structure of the tree itself. - -If the code cannot be represented accurately as a tree, the parsed language is not [context-free](#what-is-a-context-free-language). - -In addition to representing the structure of the code, the AST is the output of a parser. -Every node is a structure of a particular type of node. - -Each node is created by creating a function which will return a pointer to a structure that will signify that node. - -#### Example - - struct Signature - { - struct AttributeList *attributes; - struct Identifier *name; - struct DeclarationList *arguments; - struct TerminationSet *responses; - }; - extern struct Signature *node_signature ( - struct AttributeList *attributes, - struct Identifier *name, - struct DeclarationList *arguments, - struct TerminationSet *responses); - - The example above is from the following website: - http://www.ansa.co.uk/ANSATech/95/Primary/155101.pdf - -### What is the difference between an abstract syntax tree and a parse tree? - -Parse Tree: are the rules to match the input text where as a syntax tree record the structure of the input. - - -Syntax Tree: It will be less sensitivity from the "Parse tree" as it focuses more on the structure of the language not the grammar. - - + +\pagebreak + + + +Abstract Syntax Trees and Symbol Tables +======================================= +### What is an abstract syntax tree? +An abstract Syntax Tree is the data structure compilers/interpreters use in order to perform the actual code generation. +It represents the hirearchy of the programmers code. +An important note is that not all syntax of the code is displayed in the tree in the case of grouping paranthesis. + +#### Example + + int doSomething(int a) + { + if(a > 10) + a = a % 10; + return a; + } + +![Abstract syntax tree for `doSomething`.](images/ast-example.svg) + +An [abstract syntax tree (AST)](http://en.wikipedia.org/wiki/Abstract_syntax_tree) is simply a tree representation of the structure of source code. +Each node of the tree represents a part of the code. + +The "abstract" part of the AST comes from the fact that the tree does not represent the syntax down to the character level. +Tokens like parenthesis and brackets are not nodes on the tree, and are instead represented implicitly by the structure of the tree itself. + +If the code cannot be represented accurately as a tree, the parsed language is not [context-free](#what-is-a-context-free-language). + +In addition to representing the structure of the code, the AST is the output of a parser. +Every node is a structure of a particular type of node. + +Each node is created by creating a function which will return a pointer to a structure that will signify that node. + +#### Example + + struct Signature + { + struct AttributeList *attributes; + struct Identifier *name; + struct DeclarationList *arguments; + struct TerminationSet *responses; + }; + extern struct Signature *node_signature ( + struct AttributeList *attributes, + struct Identifier *name, + struct DeclarationList *arguments, + struct TerminationSet *responses); + + The example above is from the following website: + http://www.ansa.co.uk/ANSATech/95/Primary/155101.pdf + +### What is the difference between an abstract syntax tree and a parse tree? + +Parse Tree: are the rules to match the input text where as a syntax tree record the structure of the input. + + +Syntax Tree: It will be less sensitivity from the "Parse tree" as it focuses more on the structure of the language not the grammar. + + diff --git a/textbook/05-semantic-analysis.md b/textbook/05-semantic-analysis.md index e3d767f..0ca93ad 100644 --- a/textbook/05-semantic-analysis.md +++ b/textbook/05-semantic-analysis.md @@ -137,4 +137,5 @@ In C, a direct function call is statically bound (bound before the program is ru In C++, a virtual method call is dynamically bound (bound during runtime). Since generally speaking a specific type of a polymorphic object is not known before runtime, the executed function must be dynamically bound. - \ No newline at end of file + \ No newline at end of file