From 68e9c0dd30bd4766e3eec01d1594f1eb2c598803 Mon Sep 17 00:00:00 2001 From: Philip On Date: Sun, 5 Aug 2012 18:02:26 -0400 Subject: [PATCH 1/4] Why can't i get back my textbook folder --- textbook/00-preface.md | 26 ++ textbook/01-overview.md | 438 ++++++++++++++++++ textbook/02-lexical-analysis.md | 249 +++++++++++ textbook/03-parsing.md | 487 +++++++++++++++++++++ textbook/04-ast-and-symbol-tables.md | 90 ++++ textbook/05-semantic-analysis.md | 140 ++++++ textbook/06-intermediate-representation.md | 95 ++++ textbook/07-optimization.md | 323 ++++++++++++++ textbook/08-code-generation.md | 73 +++ textbook/background.md | 120 +++++ 10 files changed, 2041 insertions(+) create mode 100644 textbook/00-preface.md create mode 100644 textbook/01-overview.md create mode 100644 textbook/02-lexical-analysis.md create mode 100644 textbook/03-parsing.md create mode 100644 textbook/04-ast-and-symbol-tables.md create mode 100644 textbook/05-semantic-analysis.md create mode 100644 textbook/06-intermediate-representation.md create mode 100644 textbook/07-optimization.md create mode 100644 textbook/08-code-generation.md create mode 100644 textbook/background.md diff --git a/textbook/00-preface.md b/textbook/00-preface.md new file mode 100644 index 0000000..a63baf9 --- /dev/null +++ b/textbook/00-preface.md @@ -0,0 +1,26 @@ +\pagebreak + +Preface +======= + +> Tell me and I'll forget; show me and I may remember; involve me and I'll understand. +--- Chinese proverb + +This book aims to teach the concepts of compiler design Socratically through questions and follow-up. +It keeps definitions and explanations brief, because these alone do not teach. +Instead, it focuses on showing examples, counterexamples, and identifying misconceptions. +Examples and counterexamples illustrate terms to improve retention. +Understanding material involves clearing misconceptions, which is why the book identifies them. + +This book is a response to the Saylor Foundation's bounty for open textbooks. +It is under the [Creative Commons Attribution 3.0 Unported License.](http://creativecommons.org/licenses/by/3.0/) +Therefore, anyone may read, alter and improve upon this without charge. + +Undergraduate students wrote this book in a compiler design class. +Without their contributions, this book would not be possible. +On behalf of my coauthors, I dedicate this book to students everywhere who learn for the sake of learning. +I hope you find this book useful, and would love [feedback](https://github.com/lawrancej/CompilerDesign) for improvements. + +[Joey Lawrance](mailto:joey.lawrance@gmail.com) (Editor), July 2012 + +[![CC-BY](images/cc-by.svg)](http://creativecommons.org/licenses/by/3.0/) diff --git a/textbook/01-overview.md b/textbook/01-overview.md new file mode 100644 index 0000000..16c6b8d --- /dev/null +++ b/textbook/01-overview.md @@ -0,0 +1,438 @@ + +\pagebreak + + + +Introduction +============ + +## Overview + +### What is a compiler? + +A compiler translates from a source [language](#what-is-a-language) to a target language. + +Examples: + +- GCC, Clang, Visual C++ translate C into machine code +- LaTeX, Pandoc translate document markup into PDF, HTML, etc. + +Follow-up: + +- [How do compilers work](#what-are-the-phases-of-a-compiler)? +- [Who developed the first compiler](#who-was-grace-hopper)? + +### What are the phases of a compiler? +Compilers consist of several distinct phases split among the front and back end. + +![Phases of a compiler](images/compiler-phases.svg) + +#### Front end +The front end processes the source language and consists of these phases: + +- [Scanning (Lexical analysis)](#what-is-a-scanner). Split source code (a [string](#what-is-a-string)) into a token sequence. +- [Parsing (Syntactic analysis)](#what-is-a-parser). Check if token sequence conforms to language grammar and construct the [parse tree](#what-is-a-parse-tree) or [abstract syntax tree](#what-is-an-abstract-syntax-tree). +- [Type checking (Semantic analysis)](#what-is-a-type-checker). Check if the program is [semantically valid](#what-is-semantics). + +#### Back end +The back end generates the target language and consists of these phases: + +- [Translation](#what-is-a-translator). Convert an abstract syntax tree into an [intermediate representation](#what-is-an-intermediate-representation). +- [Analysis](#what-is-analysis). Collect information necessary for optimization. + +- [Optimization](#what-is-optimization). Improve [intermediate representation](#what-is-an-intermediate-representation) code. +- [Code generation](#what-is-code-generation). Produce machine code from an intermediate representation or an [abstract syntax tree](#what-is-an-abstract-syntax-tree). + +## Compiler and interpreters + + +### What is an interpreter? +An interpreter reads in souce code and executes immediately without producing an executable. + +#### Examples: + +- Debuggers +- Scripting languages + +#### Follow-up: + +- [How do interpreters work](#how-do-interpreters-work)? +- [Which is better, compilers or interpreters](#which-is-better-compilers-or-interpreters)? + +### How do interpreters work? +Interpreters share many [phases of a compiler](#what-are-the-phases-of-a-compiler), but execute instead of [generating machine code](#what-is-code-generation). +Interpreter implementations vary: + +- Trivial interpreters execute code while parsing (e.g., early versions of Lisp, Python, Perl, Basic) +- Traditional interpreters omit the code generator, and execute the intermediate representation. +- Complex interpreters execute precompiled code as part of a compiler-interpreter system. + +### Which is better, compilers or interpreters? +It depends. + +Because an interpreter never produces an executable, interpreted code is always up to date. +However, an interpreter must process source code every time it executes, thus it can be slower than compiled machine code. + +### The C compiler is in C; how can that be? +With the exception of the first C compiler, it is possible to write a C compiler in C and then compile it using another existing C compiler. +Writing the first C compiler in C required [bootstrapping](#what-is-bootstrapping-and-how-does-it-work). + +### What is bootstrapping and how does it work? +The term "bootstrapping" comes from the saying "to pull yourself up by your bootstraps", which means to improve yourself as a result of your own efforts. +In computing the term is often used to describe the act of building a system using itself, or a previous version of the system. +More specifically, when referring to compilers, bootstrapping means writing a compiler in its own target language, creating a self-hosting compiler that can compile its own source code. + +The first few versions of a compiler for a new language are written on an existing reliable language until the new compiler become reliable enough to be self-hosting. +The first couple of C compilers were written in assembly, but now they are written in C. + +Examples of self-hosting compilers: +- Basic +- C +- C++ +- Java +- Python +- Scheme + + +## Theory of computation + +### What is a language? +A [set](#what-is-a-set) of [strings](#what-is-a-string). +Typically, a [formal grammar](#what-is-a-grammar) defines the language. + +Examples and counterexamples: + +- "I love you dearly!" is in English. +- "Love I dearly you!" is not in English, despite the English words. +- "int main() { return 0; }" is in C. + +### What is a grammar? +A grammar consists of: + +- A set of [productions](#what-is-a-production). +- A set of terminals. +- A set of nonterminals. +- A start symbol (a nonterminal) + +#### Example +This two production grammar (written in a variant of [Backus Naur Form](#what-is-backus-naur-form)) matches balanced parentheses. + +$Parens \to (Parens)*$ + +$Parens \to [^\wedge ()]*$ + +### What is a production? +A production, or rewriting rule, consists of a left hand side (LHS) and a right hand side (RHS). + +$LHS \to RHS$ + +[Depending on the class of grammar](#what-is-chomskys-hierarchy), the left hand side and right hand side can be sequences of [terminals](#what-is-a-terminal) and [nonterminals](#what-is-a-nonterminal). + +### What is a Nonterminal? +A nonterminal is anything in a grammar that can be replaced, and corresponds to [parent nodes](#what-is-a-parent-node) in a [parse tree](#what-is-a-parse-tree). + +### What is a Terminal? +A terminal is a primitive unit in a grammar (a [symbol](#what-is-a-symbol) or [token](#what-is-a-token)) that corresponds to the [leaf nodes](#what-is-a-leaf-node) in a [parse tree](#what-is-a-parse-tree). +Terminal Symbols/Tokens cannot be broken down. +Example: +1. s can become sg +2. s can become gs +G is terminal because no rule can change the s. +S however is nonterminal because there are two rules that can modify the s in the lexical analysis portion. + +### What is a containment hierarchy? +A containment hierarchy is a hierarchical ordering of nested sets that are uniquely different from each other. +There are two types of containment hierarchy, one where the parent includes its children (subsumptive), and the other where the parent is made up of its children (compositional). + +- Subsumptive: all cars are vehicles, but not all vehicles are cars, so the vehicle class subsumes the car class. +- Compositional: cars contain engines and tires, so the car class is composed of the engine and tire objects. + + +### What is Chomsky's hierarchy? +The Chomsky hierarchy, as the name implies, is a [containment hierarchy](#what-is-a-containment-hierarchy) of classes of [formal grammars](#what-is-a-grammar). +The hierarchy consists of four levels: + +1. [Unrestricted grammars](#what-is-an-unrestricted-grammar). Recognized by [Turing machines](#what-is-a-turing-machine). +2. [Context sensitive grammars](#what-is-a-context-sensitive-grammar). Recognized by a [bounded Turing machine](#what-is-a-bounded-turing-machine). +3. [Context-free grammars](#what-is-a-context-free-grammar). Recognized by a [pushdown automaton](#what-is-a-pushdown-automaton). +4. [Regular grammars](#what-is-a-regular-grammar). Recognized by a [finite state machine](#what-is-a-finite-automaton). + +#### Follow-up questions +- [What are the implications of Chomsky's hierarchy](#what-are-the-implications-of-chomskys-hierarchy)? + +### What are the implications of Chomsky's hierarchy? +The difference between regular, context-free, and context-sensitive languages is in the structure of strings. +The difference has nothing to do with meaning or semantics. + +Context-sensitive +------------------ +Rules are of the form: + ?A? +? ?B? + S ? ? +where + A, S ? N + ?, ?, B ? (N ? ?)? + B ? ? + +Context-free +------------------ +Rules are of the form: + A ? ? +where + A ? N + ? ? (N ? ?)? + +Regular +------------------ +Rules are of the form: + A ? ? + A ? ? + A ? ?B +where + A, B ? N and ? ? ? + +### What is an unrestricted grammar? +An unrestricted grammar's productions can include sequences of terminals and nonterminals in both the left and right hand sides of productions. + +### What is a context-sensitive grammar? +> TODO: define context-sensitive here. +Why does more than one symbol on the left hand side make a language context free? + +In order to explain this lets look at some grammar in which that occurs. + +$A\to h$ +$B\to k$ +$AB \to Asd$ + +Now if we look at the language and we have AB, do we mean that we have A and B or do we mean that we have AB? +This cannot be determined simply by just using the rules of the grammar. +We would need other information. +This means that the language is not context free. + + +## History of compilers + + +### Who was [Grace Hopper](http://www.smbc-comics.com/?id=2516)? + + +![Grace Hopper. +Official U.S. +Navy Photograph.](images/grace-hopper.jpg) + +Grace Hopper developed the first compiler for a computer programming language and influenced subsequent programming languages. +Her [distinguished naval career](#what-did-grace-hoppers-naval-career-have-to-do-with-compilers) led to her [contributions to computer science](#what-did-grace-hopper-contribute-to-computer-science). + +#### Follow-up questions + +- [What did Grace Hopper contribute to computer science](#what-did-grace-hopper-contribute-to-computer-science)? +- [What did Grace Hopper's naval career have to do with compilers](#what-did-grace-hoppers-naval-career-have-to-do-with-compilers)? + +### What did Grace Hopper contribute to computer science? +Grace Hopper: + + - Conceptualized machine-independent programming languages. + - Coined the term "compiler". + - Popularized the term "debugging". + - Influenced the design of COBOL. + - Guided the standardization of Fortran and COBOL. + +#### What did Grace Hopper's naval career have to do with compilers? +> TODO: answer the question above + +The Navy's David Taylor Model Basin was one of the government agencies that sponsored the development of COBOL. +Grace Hopper's position in the Navy allowed her to work with the latest technology at the time, and it was the Navy that assigned her the task of overseeing the development of a set of programs and procedures for validating COBOL compilers as part of a standardization program for the entire Navy. + + + - Sworn into the United States Navy Reserve in 1943. + - Volunteered to serve in the WAVES. + - Trained at the Smith College in Northampton, MA. + - Graduated first in her class in 1944. + - Assigned to the Bureau of Ships Computation Project at Harvard University as a lieutenant. + - Served on the Mark I computer programming staff. + - Was declined entry to the standard Navy due to her age. + - Continued serving in the Navy Reserve. + - Continued working in the Harvard Computation Lab until 1949 under a Navy Contract. + +#### Honors + - Computer Sciences Man of the Year award from the Data Processing Management Association in 1969. + - Made a Distinguished Fellow of the British Computer Society in 1973. + - Defense Distinguished Service Medal in 1986. + - Computer History Museum Fellow Award in 1987. + - Golden Gavel Award at the Toastmasters International convention in 1988. + - National Medal of Technology in 1991. + +## Purpose + +As we mention in the definition of the compiler is to translate the source language to a source language, so, the purpose of the compiler will be to make these high lever languages easy for the computer to understand because the computer only understand the 0 and 1. In addition, a compiler will let to communicate with hardware. + +### Translate Source Language to Target Language +The purpose of a compiler is to translate a program into computer language. + + +### Object Code and Executables +Let's first define each of Source code, Object code and Executable and then later we can talk about how they work with the compiler. +Source Code: is the code that the programmers write it and run it on his/her machines. +Executable Code: is the code that runs on your machines, which is usually linked to the source code. +Last, Object Code: is act as the transitional form between the source code and the Executable code. + +### Platform Independent Compilers +Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled. +Java compiler is one example of Platform Independent Compilers. All operating system uses same java compiler. +When java compiler compiles the java source code, it outputs java byte code which is not directly executable. +The java byte code is interpreted to machine language through JVM(Java Virtual Machine) in respective platform. + +### Hardware Compilation +Hardware compilation is the process of compiling a program lagnuage into a digital circuit. +Hardware compilers produce implementation of hardware from some specification of hardware. +Instead of producing machine code which most of the software compiler does, hardware compiler compiles a program into some hardware designs. + +# Compiler Design + +## One-Pass vs Multi-Pass +> TODO: add One-Pass vs Multi-Pass + +### One Pass +> TODO: add One Pass +> TODO: add Simple to Implement +> TODO: add Limited Optimization +A one pass compiler only passes through the parts of each compilation unit once and immediately translates each part into its final machine code. +The implementation of a one pass compiler is much easier since there is no need to keep track of special cases and have one well defined understanding of all code. +While the one pass method is also much faster it has some inherent disadvantages. +One pass compilers are are unable to generate as efficient programs with their limimited scope and need forward declaration of identifiers. +Loops, subroutines, and modules can need more than one pass to more effectly optimize them. + +### Multi-Pass +> TODO: add Multi-Pass +> TODO: add Enhanced Optimization +> TODO: add Easier to Prove Correctability +> TODO: add Source-to-Source Compilation Possible (Translators) +> TODO: add Source-Bytecode-Native Code +A multi-pass compiler traverses the program multiple times. +Each pass takes the result of the previous pass as input and creates an intermediate output. +This retraversal gives the multi-pass compiler a much bigger scope as it allows it to see the entire program being compiled as opposed to a one pass compiler that can only see a small portion of the program being compiled. +A multi-pass compiler is easier to prove correct. +Each pass is its own unit and self contained which can be checked each pass for correctness independantly of eachother. +Each intermediate step is able to perform simpleer, easier to prove correct, opperations on each pass. + + + + + +## Structure + +### Front End +The front end of compiler analyzes the source code that is being compiled. +It also creates the intermediate representation(IR) of the source code and manages symbol table. + +#### Create Intermediate Representation +Normally, compiler first translates the source code into some form of intermediate representation of source code. +Athough it adds another step, IR provides advantage of abstraction and cleaner seperation between front end and back end. +Compiler analyzes the source code to create intermediate representation of source code in front end. + +#### Manages Symbol Table +Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references. +Compiler manages symbol table when it analyzes the source code. +This is done in several steps. + +#### Steps + +#### Preprocessing +Preprocessing is process of performing preliminary operation on source code before it gets actually compiled. +Only few compiler includes this step. +In this phase, the preprocessor looks through source code to find out specific instruction for compilation process. +C, C++, C# uses preprocessor. + + +#### Lexical Analysis +Lexical analysis or scanning is the process where the stream of characters making up the source program is read from left-to-right and grouped into tokens. +Tokens are sequences of characters with a collective meaning. +There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words. +Lexical analyzer is responsible for lexical analysis. + +#### Syntax Analysis +In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code. +Syntax analysis is closely related with semantic analysis. +Normally, a parse tree is built in this process. +It determines if the source code of the program is syntatically correct or not so that the program can be further processed for semantic analysis. + +#### Semantic Analysis +In this phase, semantic information is added into parse tree that was built during syntax analysis. +Semantic analysis consist of tracking variable type, function type, declaration type and type checking. +It checks if all of the variables, functions and classes are properly defined or not. +Typically, symbol table is created during this phase. + +### Back End +> TODO: add Back End + +#### Steps +> TODO: add Analysis +> TODO: add Optimization +> TODO: add Code Generation diff --git a/textbook/02-lexical-analysis.md b/textbook/02-lexical-analysis.md new file mode 100644 index 0000000..225bc7f --- /dev/null +++ b/textbook/02-lexical-analysis.md @@ -0,0 +1,249 @@ + +\pagebreak + + + +Lexical Analysis +================ +### What is a regular language? +[Regular expressions](#what-is-a-regular-expression) define the regular languages. +[Regular grammars](#what-is-a-regular-grammar) and [finite automata](#what-is-a-finite-automaton) recognize regular languages. + +#### Follow-up questions +- [What is a regular expression](#what-is-a-regular-expression)? +- [How can you tell if a language is regular](#how-can-you-tell-if-a-language-is-regular)? +- [What is a finite automaton](#what-is-a-finite-automaton)? +- [What is a regular grammar](#what-is-a-regular-grammar)? + +### How can you tell if a language is regular? +To find if a language is regular, one must employ a *pumping lemma*: + +- All sufficiently long words in a regular language may be "pumped." + - A middle section of the word can be repeated any number of times to produce a new word which also lies within the same language. + - i.e. +abc, abbc, abbbc, etc. +- In a regular language $L$, there exists an integer $p$ depending only on said language that every string $w$ of "pumping length" $p$ can be written as $w = xyz$ satisfying the following conditions: + 1. $|y| \ge 1$ + 2. $|xy| \le p$ + 3. for all $i \ge 0$, $xy^iz \in L$ + - Where $y$ is the substring that can be pumped. + +[If the language is finite, it is regular](#why-are-all-finite-languages-regular)? + +### Why are all finite languages regular? +> TODO: prove this + +### What is a regular grammar? +A regular grammar is a [formal grammar](#what-is-a-grammar) limited to productions of the following forms: + +- $A \to a B$ +- $B \to c$ +- $D \to \epsilon$ + +Regular grammars also define the regular languages. + +### What is a regular expression? +Regular expressions consist of: + +#### Primitives: + +- $\emptyset$. The empty set. +Reject everything. +- $\epsilon$. The empty string. +Match the empty string: "" +- `c`. Character. +Match a single character. + +#### Operations: + +If `a` and `b` are regular expressions, then the following are regular expressions: + +- `ab`. Catenation. + Match `a` followed by `b`. +- `a|b`. Alternation. +Match `a` or `b`. +- `a*`. Kleene closure. +Matches `a` zero or more times. + +### What is a finite automaton? +A finite automaton, or finite state machine, can only be in a finite number of states in which it transititons between. + +An example is that when an automaton sees a symbol for input. +It then transititons to another state based on the next input symbol. + + +It has: +- A start state +- A set of states +- A set of accepting states +- A set of transitions from (state, character) to something + +### What is an nondeterministic finite automaton? +It is a finite automaton in which we have a choice of where to go next. + +The set of transitions is from (state, character) to set of states. + +### What is a deterministic finite automaton? +It is a finite automaton in which we have only one possible next state. + +The set of transitions is from (state, character) to state. + +### What is the difference between deterministic and nondeterministic? +Deterministic finite automaton's (DFA's) are specific in regard to the input that they accept and the output yielded +by the automaton. +The next state that the machine goes to is literally determined by the input string it is given. +A nondeterministic finite automaton is not as particular, and depending on its state and input, could change into a several +possible new states. + +Simple put the difference between a DFA and an NFA is that a DFA has no epilsons between the transitional states. +The reasons that this makes a difference is that when we place an epsilon between our states it is not always possible to figure out the correct path to go without looking aheard in the current string we are parsing. +This means that we are using something that is nondeterminsitic. +Where as if we know the correct path to go at all times, it is determnistic. + +Deterministic and nondeterministic are very similar and there is no huge difference between them. +The main difference is that nondeterministic essentially chooses on a whim which state to go to while deterministic does not do this at random. + +### How to convert an NFA to a DFA? +Since both automaton's only accept regular languages as input, an NFA is able to be simplified and converted to a DFA. + +The process is called a powerset (or subset) construction and it takes the possible states of the NFA and translates them +into a map of states accessible to a DFA. +This process is not without a cost, since deterministic finite automaton's are +much less complex than their nondeterministic counterparts there will always be a loss of potential states in conversion. +All of the states of the NFA will still exist, but they will be unreachable from the origin once converted and thus obsoleted. +A converted NFA will have N^2 the number of states when converted where N is the number of states that the NFA originally had. + +### What is the derivative of a regular expression? + +### What is a scanner (lexical analyzer)? +> TODO: Merge these definitions. +Some of these definitions are misconceptions, which we should include to address why they're wrong. +A scanner is a program in a parser that converts characters into tokens. +This already has the information it needs about whatever characters that can be tokenized. +This then matches any string that was put in to possible tokens and processes said information. + +Lexical analysis or scanning is the process where the stream of characters making up the +source program is read from left-to-right and grouped into tokens. +Tokens are sequences +of characters with a collective meaning. +There are usually only a small number of tokens +for a programming language: constants (integer, double, char, string, etc.), operators +(arithmetic, relational, logical), punctuation, and reserved words. + +A lexical analyzer is a piece of software that takes in a string as input, from that string it generates tokens based off of pre-defined rules. +This is done to help for the actual compilation proccess later, as well as error checking. + +#### Example + +Lets take a look at some basic code with some basic rules. +int a = sum(7,3) + +We define the rules as. +VARIABLE_TYPE = int | float | double | char +ASSIGNMENT_OPERATOR = = +OPEN_PARANTHESIS = ( +CLOSE_PARANTHESIS = ) +DIVIDER = , +NUMBER = all numbers +NAME = any that remain + +Using these rules we can now figure out what everything in this piece of code is. + +VARIABLE_TYPE NAME ASSIGNMENT_OPERATOR NAME OPEN_PARENTHESIS NUMBER DIVIDER NUMBER CLOSE_PARANTHESIS + +We can pass that on to the next step of the compilation proccess and it will now know what each of those words/symbols means. + +Scanner, also know as Lexical analyzer or Lexer is a program which performs lexical analysis. +It converts a sequence of characters into string of characters with a collective meaning following some rules. +These rules contain identifier, assignment operator, number etc. +The lexical analyzer takes a source program as input, and produces a stream of tokens as output. + +Source Program -----> Lexical Analyzer ---------> Token stream + | + | + | + Error Message + +> TODO: Let's use SVG instead of ASCII art. + +A Scanner is used within lexical analysis to match token character strings that +are passed through it. +Scanners use finite-state machines (FSM) to hold all possible combinations of tokens +so they may quickly process large amounts of data. + +A program or function which can parse a sequence of characters into usable tokens. +Sequences are typically delimited in some way using characters (i.e. +[,],[|],[~]) + +#### Follow-up: +Examples +> TODO: Add some examples + + +### What is a lexeme? +A lexeme is a string of characters that follow a set of rules in a language, which is then categorized by a [token][#what-is-a-token]. + +### What is a token? + +A token is a single element of a programming language. Tokens could be keywords ( a word that is reserved by a program because the word has a special meaning), operators (elements in a program that are usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks. +A token is a single element of a programming language. +Tokens could be keywords, operators, or punctuation marks. + +A token is a string of characters that are categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA). +They are frequently defined by regular expressions. +Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes in the input, then categorizes them into the tokens. + + +#### Example + + +Consider this example for clarification: +Input: int x = 3; + +- int is a numeric variable type. +- x is an identifier variable. +- = is an assignment operator. +- 3 is a number value. +- ; is the end of a statement. + + + diff --git a/textbook/03-parsing.md b/textbook/03-parsing.md new file mode 100644 index 0000000..fca4a68 --- /dev/null +++ b/textbook/03-parsing.md @@ -0,0 +1,487 @@ + +\pagebreak + +>TODO : The next step of the compilation process is parsing. +Parsing takes input from the Lexical Analysis step and builds a parse tree, which will be used in future steps to develop the machine code. + In this unit, we will define parsing and identify its uses. + We will also discuss two parsing strategies, Top-Down Parsing and Bottom-Up Parsing, examining what it means to approach parsing from each standpoint and taking a look at an example of each. + By the end of the unit, you will understand parsing techniques with regards to compilers and be able to discuss each of the two main approaches. + + + + + +Parsing +======= + +### 3.1 Parsing Overview +Syntax Analysis also known as parsing is the process of analyzing tokens and +recombining them into a syntax tree. + + +#### 3.1.1 Function +Syntax analysis will verify that the input`s syntax is valid. + + +##### 3.1.1.1 Input: Tokens from Lexical Analysis +Lexical analysis splits input into tokens which the syntax analyzer then +recombines into a syntax tree. + +##### 3.1.1.2 Output: Program Parse Tree +Recombining of a syntax parse tree during lexical analysis is done according to +the syntax specification. +The leaves of the parse tree are the tokens generated +during lexical analysis. + +#### 3.1.2 Examples + +##### 3.1.2.1 Given an Arbitrary Function + +##### 3.1.2.2 Produce: + +###### 3.1.2.2.1 Parser Input + +###### 3.1.2.2.2 Parse Tree + +#### 3.1.3 Context-Free Grammar + +### 3.2 Top-Down Parsing + +#### 3.2.1 Traversing a Parse Tree + +##### 3.2.1.1 Definition + +##### 3.2.1.2 Example + +#### 3.2.2 Backus-Naur Form Production Rules + +#### 3.2.3 LL Parser + +#### 3.2.4 Process + +##### 3.2.4.1 Starts at Left-most Symbol Yielded from Production Rule + +##### 3.2.4.2 Continues to Next Production Rule for Each Non-Terminal Symbol + +##### 3.2.4.3 Proceeds "Down" the Parse Tree + +### 3.3.1 Bottom-Up Parsing + +##### 3.3.1.1 Definition + +##### 3.3.1.2 Examplel + +#### 3.3.2 Process + +##### 3.3.2.1 Identify Terminal Symbols First + +##### 3.3.2.2 Combine Terminal Symbol to Produce Nonterminals + + +### What is a context-free language? +A language generated by [context-free grammar](#what-is-a-context-free-grammar). + +### What is a context-free grammar? +A context-free grammar is a [formal grammar](#what-is-a-grammar) in which: + +- The left-hand side of every [production](#what-is-a-production) is a single [nonterminal](#what-is-a-nonterminal) symbol. +- The right-hand side of every production is a sequence of terminals and nonterminals. +If the sequence is empty, as in $A \to \epsilon$, the nonterminal [derives](#what-is-a-derivation) the empty string. + +#### Examples +This grammar is [context-free](#what-is-a-context-free-grammar), but [improper](#what-is-an-improper-context-free-grammar), because it is impossible to derive B into just terminal symbols. +$B \to hB$ + +This grammar is [context-free](#what-is-a-context-free-grammar) and [regular](#what-is-a-regular-grammar) (it matches `h*`). +$B \to hB$ +$B \to \epsilon$ + +This grammar is [context-free], but not [regular](#what-is-a-regular-grammar), since it has [left-recursion](#what-is-left-recursion) (it matches balanced parentheses). +$S \to S (S)$ +$S \to \epsilon$ + +#### Follow-up questions + +- [How can you tell if a language is context-free](#how-can-you-tell-if-a-language-is-context-free)? +- [Is English context-free](http://cs.haifa.ac.il/~shuly/teaching/08/nlp/complexity.pdf)? +- [When a language is context free, do terminals have only one meaning](#what-are-the-implications-of-chomskys-hierarchy)? +- [Is infinite recursion allowed in context-free grammars](#what-is-left-recursion)? + +### How can you tell if a language is context-free? +A language is context-free if it follows context-free grammar. +In context-free language, the left side is always single nonterminal producing rules of the form "'' ::= a", such as '' ::= ''*. The right side is sequence of terminal and non-terminal symbols. +"*" is non-terminal symbol. + +A grammar is context-free if left-hand sides of all productions contain exactly one non-terminal symbol. +So, a context-free language should always form "'' ::= a" where left side is always single non terminal symbol. + +The authenticity of a context free language can be determined by a mathematical property known as the pumping lemma that all context free languages have. +A language L would be considered context free if an integer pumping length q >= 1 exists so that any string t in L with |t| >= q can be defined as + + t = jmacs + +where t is split into the substrings j, m, a, c, and s under the following conditions. + + a. +|mac| <= q + b. +|mc >= 1| + n n + c. +jm^ a^ cs is in the language L for every n + +These proofs may be tested against any language in contempt of questionable context. +All conditions stated above must be met in order for a grammar to be considered context free. + + +A grammar is context-free if left-hand sides of all productions contain exactly one non-terminal symbol. + +### What is left recursion? + +Left recursion is defined as a grammer set where the nonterminal starting symbol will eventually derive itself back to its original form with the starting symbol as the the left-symbol. +For example: + + Z -> Zy|e + +### What is the difference between a regular language and a context free language? + +[Formal regular expressions](#what-is-a-regular-expression) define [regular languages](#what-is-a-regular-language), +and can be accepted by [deterministic and non-deterministic](#what-is-the-difference-between-deterministic-and-nondeterministic) [finite state machines](#what-is-a-finite-automaton). +Regular languages also do not accept arbitrary nesting, like [recursion](#what-is-recursion). +[Context-free grammars](#what-is-a-context-free-grammar) define context-free languages, and can be accepted by [pushdown automata](#what-is-a-pushdown-automaton) + +#### Example: + +- The [language](#what-is-a-language) of balanced parentheses is context-free, but not regular. +Thus, it is impossible to construct a regular expression (but possible to construct a context-free grammar) that matches balanced parentheses. + +### What is a derivation? + +A derivation refers to the way the syntax of a grammar is broken down to create sentence structures based on the rules of the grammar. + + +Add examples + + +### What is a leftmost derivation? + +The process of determining a leftmost derivation involves replacing the left hand non-terminal for each step of derivation until all nonterminals have been used. + + Grammar Derivation + + J -> J M J + J -> M J M + M -> 5 J M M + M -> 6 M M M + M -> 7 6 M M + M -> 8 6 7 M + M -> 9 6 7 8 + +### What is a rightmost derivation? + +The process of determining a rightmost derivation involves replacing the right hand non-terminal for each step of derivation until all nonterminals have been used. + + Grammar Derivation + + J -> J M J + J -> M J M + M -> 5 J 5 + M -> 6 J M 5 + M -> 7 J 6 5 + M -> 8 M 6 5 + M -> 9 7 6 5 + +### What is an ambiguous grammar? + +An ambiguous grammar exists if a string has multiple possible outcomes of its leftmost derivation. +That is, the grammar would yield more than one parse tree when determining the outcome of the grammar's derivation. +When it comes to compiler interpretation, it must be determined whether or not a string has been declared or not within the scope of the program. +There isn't always a correct choice for a compiler to pick when it is processing a language with an ambiguous grammar structure. + +Given the ambiguous grammar, two different means of derivation are shown. + + X -> X + X|x + + X -> X + X X -> X + X + -> x + X -> X + X + X + -> x + X + X -> x + X + X + -> x + x + X -> x + x + X + -> x + x + x -> x + x + x + +If there exists a string which can be generated by the grammar in more than one way the grammar is said to be ambiguous. +An example of this is +if the grammar has more than one leftmost derivation it is ambiguous. + +### What is a LL(k) grammar? + +A LL(k) grammar parses sentences from the top down in left to right order of input without returning backwards. +The (k) refers to an incoming number of k token strings that the parser is able to take into consideration as it determines rules. +Every step of derivation must be already be defined in the grammar's parse tree and k tokens for it to be LL(k) grammar. + + +### What is a LR(k) grammar? + +A LL(k) grammar parses sentences from the bottom up in left to right order of input without returning backwards, and results in a reversed rightmost derivation. +Like LL(k) grammars, the (k) refers to a number of k lookahead input as it parses symbols appearing earlier than k. + +### What is Backus-Naur Form? + + +A BNF specification is a set of derivation rules, written as + + ' ::= __expression__' + +where '' is a nonterminal, and the '__expression__' consists of one or more sequences of symbols; more sequences are separated by the vertical bar, '|', indicating a choice, the whole being a possible substitution for the symbol on the left. +Symbols that never appear on a left side are terminals. +On the other hand, symbols that appear on a left side are non-terminals and are always enclosed between the pair <>. + +The '::=' means that the symbol on the left must be replaced with the expression on the right + + +### LL Parser +The LL Parser is a top-down parser that works on some context free grammars. + + +The basic operation needs three things: + * An input buffer to hold the code to be parsed + * A stack structure to store the terminal and non terminal symbols (See above for explanation of terminal and non terminal symbols) + * A parsing table which might contain rules such as identifier syntax and reserved words to interpret the next token on the stack + +As an aside, LL parsers become LL(k) parsers for k amount of lookahead tokens. + +Example: Consider an LL(1) Parser (The first L tells us that this parser is starting at the Leftmost point, the second L tells us that it does Leftmost derivation and the (1) tells us we are using two tokens of lookahead) + +Let the following grammar below represent a context free grammar: +S => a | aS | bS + +or + +1. S => a +2. S => aS +3. S => bS + +Example Strings: +a; +aa; +aabaa; +aaaaba; +baaaaa; +bababa; +For our purposes, let us demonstrate an LL(1) Parser on the string "bababa". + +Now an LL(1) Parser table for this string would look like this: +------------------------------- +Non-terminal | a | aS | bS | `$` +------------------------------- + S | 1 | 2 | 3 | - +------------------------------- + +Note that the special symbol `$` denotes the end of the input. + +What the table says is simply that for our non terminal symbol "S" we have three terminal symbols and the special terminator symbol `$`. The numbers in each of the columns correspond to the production rules stated above. + +The stack sequence for our string "bababa" is as follows: +[b,a,b,a,b,a,`$`] + +The first step for the parser is to look at the input symbol "b" and the stack-top symbol S. Since "b" is the input symbol, the parser compares that to the stack-top symbol S, +and since the rule for "b" is to replace "b" with "bS", the stack now becomes: +[b,S,a,b,a,b,a,`$`] + + +Since the input symbol "b" did not match the stack-top symbol S, the "b" is put as the stack-top symbol and not processed further in the first step. Had it been a match, we would further +process the terminal symbol as defined by the production rules (for example if the first symbol was S, we could have applied any of the three rules producing a stack +of [a,a,b,a,b,a,`$`] or [a,s,a,b,a,b,a,`$`]). + +This yields a Nonterminal stack of: +[b,S,S,`$`] + +The second step now process the "b" and since the second token is now "S" (the stack-top symbol), the b is removed because the stack-top symbol is "b" +and the S is removed because the third rule states that "S" can produce "bS" which is the current top of the stack, leaving the stack to look like: +[a,b,a,b,a,`$`] +and the output stream writes rule #3: +[3] +with the Non terminal stack becoming: +[S,`$`] + + +The third iteration continues on and processes the input character "a". Now since we have two production rules with "a" listed, the parser has a choice. Also, our parser +only has a lookahead of 1. We will assume the parser is lazy and takes the rules sequentially, so our production rule on the input symbol "a" will be refactored by rule 1 which is +simply "a". Again the input symbol and stack-top symbol do not match so the "a" is not removed yet but is refactored as so by rule 1 and the stack-top symbol becomes "a": +[a,b,a,b,a,`$`] +with the Nonterminal stack: +[a,S,`$`] + +Now because "a" is the current stack-top symbol, the parser removes it leaving the stack as: +[b,a,b,a,`$`] +The nonterminal stack: +[S,`$`] +and writing rule #1 to the output stream: +[3,1] + + +Again our input symbol is "b" so we process as we did in the first and second iteration. For brevity's sake I will keep it shorthand: + +current stack: +[b,a,b,a,`$`] + +"b" does not match current stack-symbol S, so we push "b" onto the stack and refactor: +[b,S,a,b,a,`$`] +Non terminal stack: +[b,S,S,`$`] + +We process again and since the next two symbols match, the third rule is written to the output stream and the stack becomes: +[a,b,a,`$`] +Non terminal stack: +[S,`$`] +With the output stream becoming: +[3,1,3] + +And again we encounter "a" as our terminal input symbol and process as we did in the third and fourth iteration: + +current stack: +[a,b,a,`$`] +current non-terminal stack: +[S,`$`] + +"a" does not match "S" so the parser pushes "a" onto the Non-terminal stack: +[a,S,`$`] + +After the Non-terminal stack is resituated, we re-evaluate, and because the Non-terminal stack and input stack match, we remove the "a", write output rule #1 and process the current stack: +[b,a,`$`] +[3,1,3,1] + +As you can see where this is going, I'll sum up the next two. +[b,a,`$`] => [3,1,3,1,3] +[a,`$`] => [3,1,3,1,3,1] + +Once our parser reaches the special terminator character, it knows it has done it's job and is done. +It's important to note that had we instead chosen rule #2 to replace A, it would have produced the same output. In fact, it would be a good excercise to prove this result +yourself. +Excercises +1. Given the same grammar and production rules, what would be the output stream produced by an LL(1) parser for the string "aabaa"? +2. If we added a production rule S => acS, what would the parse table and output stream be for the string "aaacaca"? + + +### What is a pushdown automaton? +A pushdown automaton (PDA) is a finite state machine with [stack](#what-is-a-stack) memory. + +It manipulates a stack by choosing an indexing variable within the stack, a current state, and a symbol at the top of the stack. + + +> TODO: It'd be nice to have a picture of a pushdown automaton, in a vector format such as SVG. + +### What is a deterministic pushdown automaton? + +Deterministic pushdown automaton is a linear pushdown automaton where all future operations and stack combinations are known as soon as parsing has begun. +Operarations are only performed on the head of the stack, as the order of the stack is "pushed down" in its determined order as it is processed and parsed. +Grammars accepted by deterministic pushdown automatons must not be ambiguous, since DPDA's only have one possible action at all times, and not all context free languages can be used unless they are simplified. + +### What is a nondeterministic pushdown automaton? + +A nondeterministic pushdown automaton will always have a variety of possible outcome for any instance of its input on a stack. +NPDAs are capable of handling any context-free grammar and will create multiple branches to test all output possibilities. +Some specific instances may even yield multiple outcomes. +To handle instances like this, the automaton makes use of backtracking for the most efficient results. +Nondeterministic pushdown automatons are slower than deterministic pushdown automatons, because they are capable of handling more complex inputs. + +### What is a parser? +A parser: +- Checks for [syntax errors](#what-is-a-syntax-error) +- Constructs a [parse tree](#what-is-a-parse-tree) or an [abstract syntax tree](#what-is-an-abstract-syntax-tree). + +Typically, a [scanner](#what-is-a-scanner) first [tokenizes](#what-is-tokenization) the source code into a [token](#what-is-a-token) [sequence](#what-is-a-sequence) that the parser reads as input. +However, scanner-less parsers work directly with source code as input. + +Parsers do not [produce assembly or object code](#what-is-code-generation). + +Follow ups: +- [How do parsers work](#how-do-parsers-work)? + +### What is a syntax error? + +Syntax of a language is the rule that determines what is allowed in that language. +It specifies how a program can be written using statements, loops and functions etc. +This rule also applies to how different types of such loops, statements and functions are constructed. +Basically, syntax of a programming language is a vocabulary of that language. +A syntatically correct program can be sucessfully compiled and interpreted to a machine language. +Therefore, a syntatically incorrect program cannot be compiled and they are supposed to have syntax errors. +Syntax errors are errors in the structure of the language. +When writing a program, if syntax of the language is not followed, the program will have syntax error. + +For e.g, in java programming language, semicolon(;) is required at the end of every statement. +If semicolon is not typed after every statement in java, you will get a syntax error while trying to compile your program. +Let's see this in the real statement in java. +Let's say, a programmer wants to display "Hello World!" in his monitor using java programming language. + + System.out.println("Hello Word!"); + +This statement displays "Hello Word!" in the monitor without quotations when executed even though programmer meant to type "Hello World!". This is syntatically correct statement and will compile without any error. + + System.out.println("Hello World!") + +In the second example, the program will get syntax error even though "Hello World!" is typed correctly. +As we can see, the programmer forgot to type semicolon at the end of the statement. +In java, semicolon at the end of the statement is part of its syntax. +Hence, a program that doesn't follow the syntax of the language will get a compilation error as the program contains syntax error/s. +So, A programmer has to have a detail knowledge of the syntax of the language to be expert in that programming language. + +A parser first tokenizes the source code depending on its syntax. +It takes the structure of the code and uses said tokens to convert it to object code. +After evaluation it will convert it to ASM code if there are no syntax errors. + +### What is a parse tree? + +A parse tree for a grammar G is a tree where + +- The root is the start symbol for G +- The interior nodes are the nonterminals of G +- The leaf nodes are the terminal symbols of G. +- The children of a node T (from left to right) correspond to the symbols on the right hand side of some production for T in G. + + +Every terminal string generated by a grammar has a corresponding parse tree; every valid parse tree represents a string generated by the grammar (called the yield of the parse tree). + +### How do parsers work? diff --git a/textbook/04-ast-and-symbol-tables.md b/textbook/04-ast-and-symbol-tables.md new file mode 100644 index 0000000..94c85d0 --- /dev/null +++ b/textbook/04-ast-and-symbol-tables.md @@ -0,0 +1,90 @@ + +\pagebreak + + + +Abstract Syntax Trees and Symbol Tables +======================================= +### What is an abstract syntax tree? +An abstract Syntax Tree is the data structure compilers/interpreters use in order to perform the actual code generation. +It represents the hirearchy of the programmers code. +An important note is that not all syntax of the code is displayed in the tree in the case of grouping paranthesis. + +#### Example + + int doSomething(int a) + { + if(a > 10) + a = a % 10; + return a; + } + +![Abstract syntax tree for `doSomething`.](images/ast-example.svg) + +An [abstract syntax tree (AST)](http://en.wikipedia.org/wiki/Abstract_syntax_tree) is simply a tree representation of the structure of source code. +Each node of the tree represents a part of the code. + +The "abstract" part of the AST comes from the fact that the tree does not represent the syntax down to the character level. +Tokens like parenthesis and brackets are not nodes on the tree, and are instead represented implicitly by the structure of the tree itself. + +If the code cannot be represented accurately as a tree, the parsed language is not [context-free](#what-is-a-context-free-language). + +In addition to representing the structure of the code, the AST is the output of a parser. +Every node is a structure of a particular type of node. + +Each node is created by creating a function which will return a pointer to a structure that will signify that node. + +#### Example + + struct Signature + { + struct AttributeList *attributes; + struct Identifier *name; + struct DeclarationList *arguments; + struct TerminationSet *responses; + }; + extern struct Signature *node_signature ( + struct AttributeList *attributes, + struct Identifier *name, + struct DeclarationList *arguments, + struct TerminationSet *responses); + + The example above is from the following website: + http://www.ansa.co.uk/ANSATech/95/Primary/155101.pdf + +### What is the difference between an abstract syntax tree and a parse tree? + +Parse Tree: are the rules to match the input text where as a syntax tree record the structure of the input. + + +Syntax Tree: It will be less sensitivity from the "Parse tree" as it focuses more on the structure of the language not the grammar. + + diff --git a/textbook/05-semantic-analysis.md b/textbook/05-semantic-analysis.md new file mode 100644 index 0000000..e3d767f --- /dev/null +++ b/textbook/05-semantic-analysis.md @@ -0,0 +1,140 @@ + +\pagebreak + + + + +Semantic Analysis +================= + +### What is semantics? + + +Semantics is the field concerned with the rigorous mathematical study of the meaning of programming languages. +It does so by evaluating the meaning of syntactically legal strings defined by a specific programming language, showing the computation involved. +In such a case that the evaluation would include syntactically illegal strings, it would not be compilable. +Semantics describes the processes a computer follows when executing a program in that specific language. + +### What is static semantics? + + + +Static semantics are enforced at compile time. +Some examples of static semantics are undeclared variables and type mismatches. +These semantical errors can be detected by the parser, or in separate semantic analysis passes. + + +The semantic analyzer will start by traversing the [abstract syntax tree](#what-is-an-abstract-syntax-tree) created by the parser. +For each scope in the program, the semantic analyzer will process the declarations and add new entries to the [symbol table](#abstract-syntax-trees-and-symbol-tables). +At this point, the semantic analyzer will report variables with multiple declarations. +Next, the analyzer will process the statements in the program. +This serves the dual purpose of finding uses of undeclared variables as well as linking the nodes of the AST to the symbol table. +Lastly, the semantic analyzer will process all of the statements in the program again. +This time, the analyzer will use the symbol table information from the previous step to find type errors. + + + +### What is runtime semantics? + + + +Runtime semantics are enforced during the execution of the program. +Examples of this include division by zero and out-of-bound array indexing. +These semantics will display to the user informative messages in regards to the errors. +Conversely, the addition of these semantics will often result in slower compile and execution times. + +Dynamic checking is also used by higher level languages such as Java and C++ to allow for polymorphism. +Since objects can be of multiple types, it is to difficult to do type checking during compilation. +Two seperate paths in the code can mean the difference between a shape being a square or circle. +Instead during run time when we create the new Circle object and try to place it into a variable defined as a shape. +A seperate proccess running in our code determines wheter or not the circle can be a shape. + +### What is type-checking? + + +The Parse Tree is the means by which multi stage compilers check the semantics of a program in a nonlinear fashion. +An additional symbol table is added to the Parse Tree during the semantic analysis. + +### What is Space Time Complexity + +When dealing with code there are a few considerations to keep in mind. +The first thing to think about is how much memory the application may use. +The next thing is how much time the application will take to run. +The goal is, of course, to keep the time and memory usage both as low as possible. +The problem is that in some instances we trade must space for time or time for space. +In terms of imperative languages we can judge how fast a program is by looking at how many operations the program performs. +With regards to memory, we can look at how much is allocated at any given point. +Because we can examine the dynamics of the complexity, we can adjust these accordingly by changing our algorithims or using different ones all together. + +For a more in depth anaylsis on how to analyze the space time complexity of a program look at: +[A Function Semantics for Space and Time by Catherine Hope](http://www.cs.nott.ac.uk/Research/fop/hope-thesis.pdf) + +#### Object Binding + + +In programming, object binding is the association of objects with identifiers. +This binding can occur both statically and dynamically. +In C, a direct function call is statically bound (bound before the program is run). +In C++, a virtual method call is dynamically bound (bound during runtime). +Since generally speaking a specific type of a polymorphic object is not known before runtime, the executed function must be dynamically bound. + + \ No newline at end of file diff --git a/textbook/06-intermediate-representation.md b/textbook/06-intermediate-representation.md new file mode 100644 index 0000000..5b02199 --- /dev/null +++ b/textbook/06-intermediate-representation.md @@ -0,0 +1,95 @@ + +\pagebreak + + + +###### Types + +###### Types of Types + +###### Primitive +A primitive is defined as the base units that data can be stored in. +If we look at the C language we see variable types of char, int and double. +Each one of those is a primitive type. +Though C also has structures that can combine primitive types into one variable type. +While the data inside the structure can be primitive, the structure itself is not. + +struct Color +{ + int red; + int green; + int blue +}; + +Color in this example is not a primitve type, because it is not a base unit of the language. +Red, green and blue are primitves because they are integers and intergers are a base unit in C. + +###### Reference + +A reference is a value that tells a program where specific data exists. +References are commonly refered to as pointers because they "point to" the data they reference. +This means that the pointer holds the address of where the data is stored in memory rather than the actual data. +When using references they can also point to the first element in an array. +An array in this case be traversed by incrementing the pointer by the size of the type being held by the array. + +###### Null + +Null is the computational equivalent to nothing. +It can be used in many ways and implemented differently in different programming languages. +Most commonly, Null will equal zero or a pointer to zero (memory location 0x0000). + +###### Object + +###### Function + +###### Type Checking + +###### Static Typing + +###### Dynamic Typing + +###### Strong Typing + +###### Weak Typing \ No newline at end of file diff --git a/textbook/07-optimization.md b/textbook/07-optimization.md new file mode 100644 index 0000000..f3a94c6 --- /dev/null +++ b/textbook/07-optimization.md @@ -0,0 +1,323 @@ + +\pagebreak + + + +Optimization +============ + +### What is optimization? +Optimization is the penultimate [compiler phase](#what-are-the-phases-of-a-compiler). +Optimizers improve code performance, size, and efficiency toward an optimum. + +#### Example optimizations: + +- [Peephole optimization](#what-is-peephole-optimization) +- [Loop unrolling](#what-is-loop-unrolling) +- [Method inlining](#what-is-method-inlining) +- [Dead code](#what-is-dead-code) elimination + +### What is the point of optimization? +Unoptimized programs do not fully exploit underlying hardware capabilities, since [high-level languages](#what-is-a-high-level-langauge) abstract away from machine code. +Therefore, optimization can make programs: + + - Faster. + - More efficient. + - Smaller. + +### What is peephole optimization? +Peephole optimizers replace small subsequences of instructions with fewer or faster instructions. +The sequence of instructions that the optimization operates on is called the "peephole" or "window". + +#### Example +Depending on the target language, a peephole optimizer would replace the following code: + + j = i * 16; + +with this faster code (left bit shift is faster than generic multiplication): + + j = i << 4; + +### What is single static assignment (SSA)? + +### What is loop unrolling? +Loop unrolling, or loop unwinding, removes or precalculates control operations. +The optimization improves speed by removing expensive branches, but comes at the cost of space complexity. + +Loop unrolling includes these optimizations: + +- Precalculating the end of loop condition +- Precalculating pointer increments +- Optimizing memory access +- Running independent iterations in parallel + +#### Example + +Original Code + + for (int i = 0; i < 5; i++) + { + if (i == 0) + Console.WriteLine("I'm the beginning"); + else if (i % 2 == 0) + Console.Writeline("I'm even"); + else + Console.Writeline("I'm odd"); + } + +Unrolled Loop + + // i = 0 + Console.Writeline("I'm the beginning"); + // i = 1 + Console.Writeline("I'm odd"); + // i = 2 + Console.Writeline("I'm even"); + // i = 3 + Console.Writeline("I'm odd"); + // i = 4 + Console.Writeline("I'm even"); + +### What is method inlining? +Compiler optimization that replaces a function call with the body of the caller(i'm not very sure what this part means). This optimization may improve time and space usage when ran, but might make the program bigger. + +### What is dead code? +Dead code is any code whose result is never used, and therefore a waste of resources. + +Examples: +- Definitions of uncalled functions +- Computations that do not affect output +- `if (false) { /* Dead code */ }` + + +### Complexity + + +#### Many Optimizations Are NP-Complete + +In compiler design, many code optimization problems are NP-complete, or undecidable. + NP-complete problems are decision problems that take a long and inefficient amount of time to find a solution, yet verifying the solution can be done quickly. + NP stands for "nondeterministic polynomial time", referring to running time of an algorithm that can exhibit different behaviors or different runs. + Undecidable problems are decision problems that a single algorithm is not sufficient to accurately lead to a correct answer. + + +#### Memory Major Limitation in Other + + Memory limitations exist for optimization, as optimization is a cpu-heavy and memory heavy process. + In addition, even the programmer's time to wait for a compiler to complete also places restrictions on optimization. + +### Effectiveness + + +#### What is the Target Architecture? + +Architectural patterns are patterns that represent general functions required by the system. +The target architecture is one of those patterns to show the programmers a base to what they want to achieve. +The programmer then programs around this architecture taking into account all of the details. +The programmer cannot always accommodate all of the details as they may conflict with the architecture though. + +##### The Machine on which the Program Will Run + +When designing a program, the programmer will always have the target computer in mind. +The computer may have fewer capabilities than the one that the software is being programmed on +It may run on your own machine, but not the one in which the program will run. + +##### What are the Factors that affect Effectiveness? + +The factors in which will decide if the program will run on the computer are: CPU Registers, Pipelining, Cache, and the Hardware Design. + +###### What are CPU Registers? + +The first is the CPU Registers, or a small amount of storage given to the processor so that mechanisms other than main memory can be accessed more quickly. +This can help the effectiveness of computer programs as they can store values that are accessed frequently in registers to improve performace. + +###### What is Pipelining? + +A pipeline is a set of data processing elements connected so that the output of one element is the input of the next. +The idea of this is to get processes done more efficiently, like an assembly line. +Each element is in charge of doing one part and is often done in parallel with the others so that everything gets done much faster and more efficiently. + +###### What is a Cache? +A cache is a method of processing where the computer stores data temporarily until future requests for the data can be served faster. +Caches are generally smaller due to cost efficiency and efficient use of data. +This leads to effectiveness because it allows for great efficiency. + +###### What is Hardware Design? + +The hardware design is big for effectiveness because, if you don't know what the hardware is going to be, you cannot write for that hardware. +A good example is like writing in assembly; you cannot write for an intel processor architecture when the computer has an AMD processor. +It is good to know the hardware design as it leads to an effective program. + +#### Host Architecture + + +##### The Machine Doing the Compilation + + +##### Factors + + +###### CPU Speed + +CPU speed can be a major factor in compilation times. +The speed at which a compiler can perform lexical analysis, parsing, code generation, optimization, etcetera, weighs heavily on the effeciciency of the CPU. + +###### Pipelining + + Pipelining is a method of optimizing loops via out-of-order execution (instructions are executed in order of data availability rather than code position) where reordering is done by a compiler rather than the processor. + +###### RAM Capacity and Architecture + + Like the CPU, RAM speed can be a major factor in compilation times. +Many CPU-based operations rely on RAM to temporarily store information in regards to the operations being executed, thus the amount of data that can be stored, and the speed at which it can be read and written, can affect the speed at which these operations will complete. + +####### Hard Disks + Certain phases of compilation, such as the reading of source files and headers and writing of object files, are disk-intensive. +The read/write spead of the disk can directly effect the rate at which these operations can be executed. + While disk read/write speeds do not have as large of an impact in compilation times as CPU and RAM performance, they still have an impact on the overall execution time. + +##### Program Usage + + +###### Release vs Debugging + + In a debug build, the complete symbolic debug information is emitted for testing and debugging purposes. +Code optimization is not a priority in debug releases. + Release builds do not emit the symbolic debugging info, reducing the size of the final executable file. +The speed of execution may vary between debug and release builds depending on the compiler. + +###### Peephole Optimization +Peephole Optimization is a type of optimization that works on very small sets of instructions in generated code at a time. +It then refactors sets so that it can be replaced by faster, more efficient code segments. + +Take the instruction set: + a = b + c; + d = a + e; + +What is actually implemented in machine code is: + MOV b, R0 + ADD c, R0 + MOV R0, a + MOV a, R0 + ADD e, R0 + MOV R0,d + +This machine can be optimised. + +It can be condensed to: + MOV b, R0 + ADD c, R0 + MOV R0, a + ADD e, R0 + MOV R0, d + +By taking the code at its base level and optimising it there, the code will run much faster and have to execute even less commands to achieve the same result. + diff --git a/textbook/08-code-generation.md b/textbook/08-code-generation.md new file mode 100644 index 0000000..39c0dc5 --- /dev/null +++ b/textbook/08-code-generation.md @@ -0,0 +1,73 @@ +\pagebreak + + + +Code generation +=============== + +### What is code generation? +Code generation is the final [compiler phase](#what-are-the-phases-of-a-compiler). +It produces code in the target language, which is typically a machine language (e.g., x86, arm), but may be assembly or even a high-level language. + +The code generator is distinct from the [parser](#what-is-a-parser) and the [translator](#what-is-a-translator). + +Code generators try to optimize the generated code by doing several different things including using faster instructions, using fewer instructions, +exploit available registers, and avoid redundant computations. + diff --git a/textbook/background.md b/textbook/background.md new file mode 100644 index 0000000..6d3474d --- /dev/null +++ b/textbook/background.md @@ -0,0 +1,120 @@ +\pagebreak + +Background +========== +What should I know already to write a compiler? + +- Most of the ACM [Body of Knowledge](http://www.acm.org/education/curricula/ComputerScience2008.pdf) +- [How to be a programmer](http://samizdat.mines.edu/howto/HowToBeAProgrammer.html) +- [Discrete structures](#discrete-structures) +- [Algorithms and data structures](#algorithms-and-data-structures) +- [Software Engineering](#software-engineering) + +Discrete structures +------------------- +Writing a compiler requires familiarity with discrete structures. + +- What does [discrete](#what-does-discrete-mean) mean? +- What is the difference between [sets](#what-is-a-set), [sequences](#what-is-a-sequence), and [bags](#what-is-a-bag)? +- What is the difference between an [alphabet](#what-is-an-alphabet), a [string](#what-is-a-string), and a [character](#what-is-a-character)? +- What is a [stack](#what-is-a-stack)? + +### What does discrete mean? +Discrete means: + +- The opposite of continuous +- Separate +- Distinct + +Discrete does not mean: + +- Respecting privacy +- Avoiding attention + +That's discreet. + +### What is a set? +An unordered, possibly infinite, collection of unique objects. + +Examples and counterexamples: + +- {apple,pear,banana} is a set. +- {apple,apple,pear} is not a set, because apple is not unique. +It's a [bag](#what-is-a-bag). +- The integers form an infinite set. +- {1,2,3} is the same set as {3,1,2}, because order does not matter. + +### What is a stack? + +### What is a bag? +An unordered, possibly infinite, collection of objects. + +Examples: + +- {apple,apple,pear} is a bag. +- {1,2,1} is the same bag as {1,1,2}, because order does not matter. + +### What is a sequence? +An ordered collection of objects. + +Examples: + +- {apple,pear, banana} is a sequence. +- {apple, apple, pear} is a sequence, because objects need not be unique. +- {1,2,3} is different from {3,1,2}, because order matters. + +### What is a string? +A sequence of characters. + +Examples: + +- "This is a string" +- "Strings are surrounded by quote marks" +- "This string" is not "this string." Case and punctuation matter. + +### What is a character? +A symbol in an alphabet. + +### What is an alphabet? +A finite set (of symbols). + +Examples: + +- Roman, Greek alphabet +- Numerals +- ASCII +- Unicode + +Algorithms and data structures +------------------------------ +Writing a compiler requires working with trees. + +- What is a tree? +- What is the difference between an inorder, preorder, and postorder traversal? + +### What is code? +A [sequence](#what-is-a-sequence) of [instructions](#what-is-an-instruction). + +### What is an instruction? +A basic operation that a machine can perform. + +Examples: + +- Arithmetic instructions (e.g., addition, subtraction, multiplication, division) +- Logic and bitwise instructions (e.g., and, or, not, exclusive or, shift-left, shift-right) +- Control instructions (e.g., goto, jump) +- Relational instructions (e.g., equal, less than, greater than) +- Data movement instructions (e.g., move) + +Software Engineering +-------------------- +Knowledge of some design patterns, version control, and testing is necessary to write a compiler. + +- [What are design patterns?](#what-are-design-patterns) +- [What is the visitor design pattern?](#what-is-the-visitor-design-pattern) +- What is the composite design pattern? +- What is the strategy design pattern? + +### What are design patterns? + +### What is the visitor design pattern? From c6e0ab9d70d1125a874b706c2883544444fa7d2d Mon Sep 17 00:00:00 2001 From: Philip On Date: Sun, 5 Aug 2012 21:00:34 -0400 Subject: [PATCH 2/4] Shortened some sentences and fixed more grammatical issues --- textbook/01-overview.md | 22 +++++--- textbook/02-lexical-analysis.md | 92 +++++++++++++++++--------------- textbook/03-parsing.md | 21 +++++--- textbook/05-semantic-analysis.md | 3 +- 4 files changed, 80 insertions(+), 58 deletions(-) diff --git a/textbook/01-overview.md b/textbook/01-overview.md index 16c6b8d..731fb14 100644 --- a/textbook/01-overview.md +++ b/textbook/01-overview.md @@ -338,14 +338,20 @@ Executable Code: is the code that runs on your machines, which is usually linked Last, Object Code: is act as the transitional form between the source code and the Executable code. ### Platform Independent Compilers -Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled. -Java compiler is one example of Platform Independent Compilers. All operating system uses same java compiler. -When java compiler compiles the java source code, it outputs java byte code which is not directly executable. +Platform Independent compilers compiles the source code irrespective of the platform(operating systems) on which it is being compiled. + +Java compiler is one example of Platform Independent Compilers. +All operating system uses same java compiler. + +When java compiler compiles the java source code, it outputs java byte code which is not directly executable. + The java byte code is interpreted to machine language through JVM(Java Virtual Machine) in respective platform. ### Hardware Compilation -Hardware compilation is the process of compiling a program lagnuage into a digital circuit. -Hardware compilers produce implementation of hardware from some specification of hardware. +Hardware compilation is the process of compiling a program lagnuage into a digital circuit. + +Hardware compilers produce implementation of hardware from some specification of hardware. + Instead of producing machine code which most of the software compiler does, hardware compiler compiles a program into some hardware designs. # Compiler Design @@ -398,7 +404,8 @@ Athough it adds another step, IR provides advantage of abstraction and cleaner s Compiler analyzes the source code to create intermediate representation of source code in front end. #### Manages Symbol Table -Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references. +Symbol table is compile-time data structure which holds information needed to locate and relocate a program's symbolic definitions and references. + Compiler manages symbol table when it analyzes the source code. This is done in several steps. @@ -418,7 +425,8 @@ There are usually only a small number of tokens for a programming language: cons Lexical analyzer is responsible for lexical analysis. #### Syntax Analysis -In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code. +In this phase, the token from lexical analysis is parsed to determine the grammatical structure of source code. + Syntax analysis is closely related with semantic analysis. Normally, a parse tree is built in this process. It determines if the source code of the program is syntatically correct or not so that the program can be further processed for semantic analysis. diff --git a/textbook/02-lexical-analysis.md b/textbook/02-lexical-analysis.md index 225bc7f..182954a 100644 --- a/textbook/02-lexical-analysis.md +++ b/textbook/02-lexical-analysis.md @@ -58,14 +58,14 @@ Lexical Analysis To find if a language is regular, one must employ a *pumping lemma*: - All sufficiently long words in a regular language may be "pumped." - - A middle section of the word can be repeated any number of times to produce a new word which also lies within the same language. + - The middle section of the word repeats itself any number of times to produce a new word that is within language syntax. - i.e. abc, abbc, abbbc, etc. - In a regular language $L$, there exists an integer $p$ depending only on said language that every string $w$ of "pumping length" $p$ can be written as $w = xyz$ satisfying the following conditions: 1. $|y| \ge 1$ 2. $|xy| \le p$ 3. for all $i \ge 0$, $xy^iz \in L$ - - Where $y$ is the substring that can be pumped. + - Where $y$ is the pumpable substring. [If the language is finite, it is regular](#why-are-all-finite-languages-regular)? @@ -95,7 +95,7 @@ Match a single character. #### Operations: -If `a` and `b` are regular expressions, then the following are regular expressions: +If `a` and `b` are regular expressions, then the following are also considered such: - `ab`. Catenation. Match `a` followed by `b`. @@ -105,10 +105,10 @@ Match `a` or `b`. Matches `a` zero or more times. ### What is a finite automaton? -A finite automaton, or finite state machine, can only be in a finite number of states in which it transititons between. +A finite automaton, can only be in a finite number of states in which it transitions between. An example is that when an automaton sees a symbol for input. -It then transititons to another state based on the next input symbol. +It then transitions to another state based on the next input symbol. It has: @@ -120,7 +120,7 @@ It has: ### What is an nondeterministic finite automaton? It is a finite automaton in which we have a choice of where to go next. -The set of transitions is from (state, character) to set of states. +The set of transitions is from (state, character) to a group of states. ### What is a deterministic finite automaton? It is a finite automaton in which we have only one possible next state. @@ -130,27 +130,28 @@ The set of transitions is from (state, character) to state. ### What is the difference between deterministic and nondeterministic? Deterministic finite automaton's (DFA's) are specific in regard to the input that they accept and the output yielded by the automaton. -The next state that the machine goes to is literally determined by the input string it is given. -A nondeterministic finite automaton is not as particular, and depending on its state and input, could change into a several -possible new states. +The input string determines the next state that the machine goes to. +A nondeterministic finite automaton is not as particular. +Depending on its state and input, it could change into possible new states. -Simple put the difference between a DFA and an NFA is that a DFA has no epilsons between the transitional states. -The reasons that this makes a difference is that when we place an epsilon between our states it is not always possible to figure out the correct path to go without looking aheard in the current string we are parsing. -This means that we are using something that is nondeterminsitic. -Where as if we know the correct path to go at all times, it is determnistic. +The difference between a DFA and an NFA is that a DFA has no epsilon between the transitional states. -Deterministic and nondeterministic are very similar and there is no huge difference between them. -The main difference is that nondeterministic essentially chooses on a whim which state to go to while deterministic does not do this at random. +Despite placing an epsilon between the states, it is not always possible to figure out the correct path to go without looking ahead in the current string the program is parsing. +This is an example of a nondeterministic finite automaton. +Where as if we know the correct path to go at all times, it is deterministic. -### How to convert an NFA to a DFA? -Since both automaton's only accept regular languages as input, an NFA is able to be simplified and converted to a DFA. +Deterministic and nondeterministic are similar, with 1 distinctable difference between them. +The main difference is that nondeterministic essentially chooses the state indiscriminately, while deterministic doesn't. -The process is called a powerset (or subset) construction and it takes the possible states of the NFA and translates them +### How to convert an NFA to a DFA? +Since both automatons only accept regular languages as input, they can simplify an NFA to convert to a DFA. +The process, referred as a powerset (or subset) construction, takes the possible states of the NFA and translates them into a map of states accessible to a DFA. -This process is not without a cost, since deterministic finite automaton's are -much less complex than their nondeterministic counterparts there will always be a loss of potential states in conversion. -All of the states of the NFA will still exist, but they will be unreachable from the origin once converted and thus obsoleted. -A converted NFA will have N^2 the number of states when converted where N is the number of states that the NFA originally had. +This process is not without a cost. + +Deterministic finite automaton's are much less complex than their nondeterministic counterparts; there will always be a loss of potential states in conversion. +All states of the NFA will still exist, but they will be unreachable from the origin once converted and obsoleted. +A converted NFA has N^2 the number of states; N represents the original amount before conversion. ### What is the derivative of a regular expression? @@ -158,26 +159,27 @@ A converted NFA will have N^2 the number of states when converted where N is the > TODO: Merge these definitions. Some of these definitions are misconceptions, which we should include to address why they're wrong. A scanner is a program in a parser that converts characters into tokens. -This already has the information it needs about whatever characters that can be tokenized. -This then matches any string that was put in to possible tokens and processes said information. +It contains information about what it can tokenize. +It matches inputted strings to possible tokens and processes the information. -Lexical analysis or scanning is the process where the stream of characters making up the -source program is read from left-to-right and grouped into tokens. +Lexical analysis or scanning +- A process where it reads the stream of characters making up the source program from left-to-right and groups them into tokens. Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words. -A lexical analyzer is a piece of software that takes in a string as input, from that string it generates tokens based off of pre-defined rules. -This is done to help for the actual compilation proccess later, as well as error checking. +A lexical analyzer is a piece of software that takes in a string as input, then generates tokens based off of pre-defined rules. +This helps for the compilation process and error checking later on. #### Example -Lets take a look at some basic code with some basic rules. + int a = sum(7,3) -We define the rules as. +The rules are defined as follows: + VARIABLE_TYPE = int | float | double | char ASSIGNMENT_OPERATOR = = OPEN_PARANTHESIS = ( @@ -186,14 +188,14 @@ DIVIDER = , NUMBER = all numbers NAME = any that remain -Using these rules we can now figure out what everything in this piece of code is. +These rules simplify understanding the code sample below: VARIABLE_TYPE NAME ASSIGNMENT_OPERATOR NAME OPEN_PARENTHESIS NUMBER DIVIDER NUMBER CLOSE_PARANTHESIS -We can pass that on to the next step of the compilation proccess and it will now know what each of those words/symbols means. +These values are passed to the next step of the compilation process, and the analyzer will understand them. Scanner, also know as Lexical analyzer or Lexer is a program which performs lexical analysis. -It converts a sequence of characters into string of characters with a collective meaning following some rules. +It converts a sequence of characters into a string with a collective meaning following some rules. These rules contain identifier, assignment operator, number etc. The lexical analyzer takes a source program as input, and produces a stream of tokens as output. @@ -205,8 +207,8 @@ Source Program -----> Lexical Analyzer ---------> Token stream > TODO: Let's use SVG instead of ASCII art. -A Scanner is used within lexical analysis to match token character strings that -are passed through it. +The lexcial analysis uses a scanner to match strings passed into it to token characters. + Scanners use finite-state machines (FSM) to hold all possible combinations of tokens so they may quickly process large amounts of data. @@ -218,23 +220,27 @@ Sequences are typically delimited in some way using characters (i.e. Examples > TODO: Add some examples - + ### What is a lexeme? -A lexeme is a string of characters that follow a set of rules in a language, which is then categorized by a [token][#what-is-a-token]. +A lexeme is a string of characters that follow a set of rules in a language, categorized by a [token][#what-is-a-token]. ### What is a token? -A token is a single element of a programming language. Tokens could be keywords ( a word that is reserved by a program because the word has a special meaning), operators (elements in a program that are usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks. +A token is a single element of a programming language. +Tokens could be keywords ( a word reserved by a program because the word has a special meaning), operators (elements in a program usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks. A token is a single element of a programming language. Tokens could be keywords, operators, or punctuation marks. - -A token is a string of characters that are categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA). + +A token is a string of characters categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA). They are frequently defined by regular expressions. -Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes in the input, then categorizes them into the tokens. +Tokens are generally formed by having a lexical analyzer read the input sent to it, identify the lexemes, then categorizes them into the tokens. #### Example - + Consider this example for clarification: Input: int x = 3; @@ -242,7 +248,7 @@ Input: int x = 3; - int is a numeric variable type. - x is an identifier variable. - = is an assignment operator. -- 3 is a number value. +- 3 is a value. - ; is the end of a statement. diff --git a/textbook/03-parsing.md b/textbook/03-parsing.md index fca4a68..1d6204b 100644 --- a/textbook/03-parsing.md +++ b/textbook/03-parsing.md @@ -326,12 +326,14 @@ What the table says is simply that for our non terminal symbol "S" we have three The stack sequence for our string "bababa" is as follows: [b,a,b,a,b,a,`$`] -The first step for the parser is to look at the input symbol "b" and the stack-top symbol S. Since "b" is the input symbol, the parser compares that to the stack-top symbol S, +The first step for the parser is to look at the input symbol "b" and the stack-top symbol S. +Since "b" is the input symbol, the parser compares that to the stack-top symbol S, and since the rule for "b" is to replace "b" with "bS", the stack now becomes: [b,S,a,b,a,b,a,`$`] -Since the input symbol "b" did not match the stack-top symbol S, the "b" is put as the stack-top symbol and not processed further in the first step. Had it been a match, we would further +Since the input symbol "b" did not match the stack-top symbol S, the "b" is put as the stack-top symbol and not processed further in the first step. +Had it been a match, we would further process the terminal symbol as defined by the production rules (for example if the first symbol was S, we could have applied any of the three rules producing a stack of [a,a,b,a,b,a,`$`] or [a,s,a,b,a,b,a,`$`]). @@ -347,7 +349,8 @@ with the Non terminal stack becoming: [S,`$`] -The third iteration continues on and processes the input character "a". Now since we have two production rules with "a" listed, the parser has a choice. Also, our parser +The third iteration continues on and processes the input character "a". Now since we have two production rules with "a" listed, the parser has a choice. +Also, our parser only has a lookahead of 1. We will assume the parser is lazy and takes the rules sequentially, so our production rule on the input symbol "a" will be refactored by rule 1 which is simply "a". Again the input symbol and stack-top symbol do not match so the "a" is not removed yet but is refactored as so by rule 1 and the stack-top symbol becomes "a": [a,b,a,b,a,`$`] @@ -362,7 +365,8 @@ and writing rule #1 to the output stream: [3,1] -Again our input symbol is "b" so we process as we did in the first and second iteration. For brevity's sake I will keep it shorthand: +Again our input symbol is "b" so we process as we did in the first and second iteration. +For brevity's sake I will keep it shorthand: current stack: [b,a,b,a,`$`] @@ -393,12 +397,15 @@ After the Non-terminal stack is resituated, we re-evaluate, and because the Non- [b,a,`$`] [3,1,3,1] -As you can see where this is going, I'll sum up the next two. +As you can see where this is going, I'll sum up the next two. + [b,a,`$`] => [3,1,3,1,3] [a,`$`] => [3,1,3,1,3,1] -Once our parser reaches the special terminator character, it knows it has done it's job and is done. -It's important to note that had we instead chosen rule #2 to replace A, it would have produced the same output. In fact, it would be a good excercise to prove this result +Once our parser reaches the special terminator character, it knows it has done it's job and is done. + +It's important to note that had we instead chosen rule #2 to replace A, it would have produced the same output. +In fact, it would be a good excercise to prove this result yourself. Excercises 1. Given the same grammar and production rules, what would be the output stream produced by an LL(1) parser for the string "aabaa"? diff --git a/textbook/05-semantic-analysis.md b/textbook/05-semantic-analysis.md index e3d767f..0ca93ad 100644 --- a/textbook/05-semantic-analysis.md +++ b/textbook/05-semantic-analysis.md @@ -137,4 +137,5 @@ In C, a direct function call is statically bound (bound before the program is ru In C++, a virtual method call is dynamically bound (bound during runtime). Since generally speaking a specific type of a polymorphic object is not known before runtime, the executed function must be dynamically bound. - \ No newline at end of file + \ No newline at end of file From 19a1f23e8ff4edb234e5c8dd8b829c4f902dc993 Mon Sep 17 00:00:00 2001 From: Philip On Date: Sun, 5 Aug 2012 21:16:57 -0400 Subject: [PATCH 3/4] Properly commented the covered topics so check.sh uncovered works and displays the correct topics not covered --- textbook/02-lexical-analysis.md | 58 ++++++++++++++++++++------------- 1 file changed, 35 insertions(+), 23 deletions(-) diff --git a/textbook/02-lexical-analysis.md b/textbook/02-lexical-analysis.md index 182954a..d01041d 100644 --- a/textbook/02-lexical-analysis.md +++ b/textbook/02-lexical-analysis.md @@ -71,7 +71,9 @@ abc, abbc, abbbc, etc. ### Why are all finite languages regular? > TODO: prove this - + ### What is a regular grammar? A regular grammar is a [formal grammar](#what-is-a-grammar) limited to productions of the following forms: @@ -103,9 +105,11 @@ If `a` and `b` are regular expressions, then the following are also considered s Match `a` or `b`. - `a*`. Kleene closure. Matches `a` zero or more times. - -### What is a finite automaton? -A finite automaton, can only be in a finite number of states in which it transitions between. + +### What is a finite state machine? +A finite state machine, also known as an automaton, can only be in a finite number of states in which it transitions between. An example is that when an automaton sees a symbol for input. It then transitions to another state based on the next input symbol. @@ -154,7 +158,9 @@ All states of the NFA will still exist, but they will be unreachable from the or A converted NFA has N^2 the number of states; N represents the original amount before conversion. ### What is the derivative of a regular expression? - + ### What is a scanner (lexical analyzer)? > TODO: Merge these definitions. Some of these definitions are misconceptions, which we should include to address why they're wrong. @@ -194,10 +200,8 @@ VARIABLE_TYPE NAME ASSIGNMENT_OPERATOR NAME OPEN_PARENTHESIS NUMBER DIVIDER NUMB These values are passed to the next step of the compilation process, and the analyzer will understand them. -Scanner, also know as Lexical analyzer or Lexer is a program which performs lexical analysis. -It converts a sequence of characters into a string with a collective meaning following some rules. -These rules contain identifier, assignment operator, number etc. -The lexical analyzer takes a source program as input, and produces a stream of tokens as output. + + Source Program -----> Lexical Analyzer ---------> Token stream | @@ -207,10 +211,9 @@ Source Program -----> Lexical Analyzer ---------> Token stream > TODO: Let's use SVG instead of ASCII art. -The lexcial analysis uses a scanner to match strings passed into it to token characters. +The lexical analysis uses a scanner to match strings passed into it to token characters. -Scanners use finite-state machines (FSM) to hold all possible combinations of tokens -so they may quickly process large amounts of data. +Scanners use finite-state machines (FSM) to hold all possible combinations of tokens so they may quickly process large amounts of data. A program or function which can parse a sequence of characters into usable tokens. Sequences are typically delimited in some way using characters (i.e. @@ -220,18 +223,18 @@ Sequences are typically delimited in some way using characters (i.e. Examples > TODO: Add some examples - ### What is a lexeme? A lexeme is a string of characters that follow a set of rules in a language, categorized by a [token][#what-is-a-token]. ### What is a token? - A token is a single element of a programming language. Tokens could be keywords ( a word reserved by a program because the word has a special meaning), operators (elements in a program usually used to assist in testing conditions (OR, AND, =, >, etc.)), or punctuation marks. -A token is a single element of a programming language. -Tokens could be keywords, operators, or punctuation marks. - A token is a string of characters categorized based on the types used (e.g., IDENTIFIER, NUMBER, COMMA). They are frequently defined by regular expressions. @@ -239,17 +242,26 @@ Tokens are generally formed by having a lexical analyzer read the input sent to #### Example - + + Consider this example for clarification: Input: int x = 3; -- int is a numeric variable type. -- x is an identifier variable. -- = is an assignment operator. -- 3 is a value. -- ; is the end of a statement. +- int is a numeric variable type token. +- x is an identifier variable token. +- = is an assignment operator token. +- 3 is a value token. +- ; is the end of a statement token. From f4d1d8ca7ba7ea4c9af8b2c68fdcdb5a3e9c1cfa Mon Sep 17 00:00:00 2001 From: Philip On Date: Sun, 5 Aug 2012 21:21:45 -0400 Subject: [PATCH 4/4] Fixed more passive voices --- textbook/02-lexical-analysis.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/textbook/02-lexical-analysis.md b/textbook/02-lexical-analysis.md index d01041d..e5c5d2e 100644 --- a/textbook/02-lexical-analysis.md +++ b/textbook/02-lexical-analysis.md @@ -184,7 +184,7 @@ This helps for the compilation process and error checking later on. int a = sum(7,3) -The rules are defined as follows: +Rules: VARIABLE_TYPE = int | float | double | char ASSIGNMENT_OPERATOR = = @@ -198,7 +198,7 @@ These rules simplify understanding the code sample below: VARIABLE_TYPE NAME ASSIGNMENT_OPERATOR NAME OPEN_PARENTHESIS NUMBER DIVIDER NUMBER CLOSE_PARANTHESIS -These values are passed to the next step of the compilation process, and the analyzer will understand them. +The analyzer passes these values to the next step of the compilation process to process.