{"id":8466,"date":"2017-10-02T14:31:12","date_gmt":"2017-10-02T21:31:12","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vbteam\/?p=8466"},"modified":"2024-07-05T12:36:14","modified_gmt":"2024-07-05T19:36:14","slug":"roslyn-primer-part-i-anatomy-of-a-compiler","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/vbteam\/roslyn-primer-part-i-anatomy-of-a-compiler\/","title":{"rendered":"Roslyn Primer &#8211; Part I: Anatomy of a Compiler"},"content":{"rendered":"<p>So, you\u2019ve heard that VB (and C#) are open source now and you want to dive in and contribute. If you haven\u2019t spent your life building compilers, you probably don\u2019t know where to start. No worries, I\u2019ll walk you through it. This post is the first of a series of blog posts focused on the Roslyn codebase. They\u2019re intended as a primer for prototyping language features proposed on <a href=\"https:\/\/github.com\/dotnet\/vblang\">the VB Language Design repo<\/a>; and contributing compiler and IDE features, and bug fixes on <a href=\"https:\/\/github.com\/dotnet\/roslyn\">the Roslyn repo<\/a>, both on GitHub. Despite the topic, these posts are written from the perspective of someone who\u2019s never taken a course in compilers (I haven\u2019t).<\/p>\n<h2>Phases of compilation<\/h2>\n<p>At a high level, here\u2019s what happens:<\/p>\n<ol>\n<li>Scanning (also called lexing)<\/li>\n<li>Parsing<\/li>\n<li>Semantic Analysis (also called Binding)<\/li>\n<li>Lowering<\/li>\n<li>Emit<\/li>\n<\/ol>\n<p>Some phases overlap and infringe on others a bit but that\u2019s basically what the compiler is doing.<\/p>\n<h2>Compiling is a lot like reading<\/h2>\n<p>By analogy, when you read this blog post you look at a series of characters. You decide that some runs of letters form words, some is punctuation, some is whitespace. That\u2019s what the scanner does. Then you decide that some punctuation groups things into a parenthetical, or a quotation, or terminates a sentence. Some dots are decimal points in numbers or abbreviations or initialisms. That\u2019s what the parser does. Then you import your massive vocabulary of what words mean and look at all the words and decide what those words refer to and in combination what the sentences mean. Occasionally, you find a word with multiple meanings (overloaded terms) and you look at some amount of context to decide which of the multiple meanings is intended (like overload resolution). All of that assignment of meaning is semantic analysis. Lowering and emit don\u2019t really have natural language equivalents other than perhaps translating from one language to another (think of it like translating an article from modern English to simplified English to another very primitive language).<\/p>\n<h2>But you\u2019re way smarter than a compiler<\/h2>\n<p>Of course, you don\u2019t do all of this one phase at a time. You don\u2019t read a sentence in three passes because you can usually pick out words and sentences and their meaning all at once. But the compiler isn\u2019t as smart as a human, so it does these things in phases to keep the problems simple. Every now and then, I get a bug report where someone says, \u201cthe compiler decided I meant <em>that<\/em>\u00a0but obviously I meant this other thing because <em>that<\/em>\u00a0doesn\u2019t make any sense\u201d. The compiler doesn\u2019t know something doesn\u2019t make sense until phase 3. And once it knows that, it can\u2019t go back to phase 1 or 2 to correct itself (unlike you and me).<\/p>\n<h2>\u201cCompiling\u201d HelloWorld<\/h2>\n<p>Let\u2019s go back to programming languages and look at what the compiler does to compile a simple program. The simple program just consists of the statement <strong>Call Console.WriteLine(\u201cHello, World!\u201d)<\/strong><\/p>\n<h2>Scanning<\/h2>\n<p><strong>The Scanner<\/strong> runs over all the text in the files and breaks down everything into tokens:<\/p>\n<ul>\n<li>Keyword &#8211; Call<\/li>\n<li>Identifier \u2013 Console<\/li>\n<li>Dot<\/li>\n<li>Identifier \u2013 WriteLine<\/li>\n<li>Left Parenthesis<\/li>\n<li>String \u2013 \u201cHello, World!\u201d<\/li>\n<li>Right Parenthesis<\/li>\n<\/ul>\n<p>These tokens are just like words and punctuation in natural languages. Whitespace isn\u2019t usually important since it just separates tokens. But in VB, some whitespaces, like newlines, are significant and \u00a0interpreted as an \u201cEndOfStatement\u201d token.<\/p>\n<h2>Parsing<\/h2>\n<p><strong>The Parser<\/strong> then looks at the list of tokens and sees how those tokens go together:<\/p>\n<ul>\n<li>Parse a statement.<\/li>\n<li>Look at the first token. Found a Call keyword. That starts a Call statement. Parse a Call statement.<\/li>\n<li>A Call statement starts with the Call keyword and then an expression. Parse an expression.<\/li>\n<li>Look at the next token. Found an identifier \u201cConsole\u201d. That\u2019s a name expression.<\/li>\n<li>This might be part of a bigger expression. Look for things that could go after an identifier to make an even bigger expression.<\/li>\n<li>Found a dot. An identifier followed by a dot is the beginning of a member access expression. Look for a name. Found another identifier \u201cWriteLine\u201d. This is a member access that says \u201cConsole.WriteLine\u201d.<\/li>\n<li>Still could be part of a bigger expression (maybe there are more dots after this?). Look for another continuing token.<\/li>\n<li>Found a left parenthesis. You can\u2019t just have a left parenthesis after an expression \u2013 this must be an invocation expression.<\/li>\n<li>An invocation looks like an expression followed by an argument list. An argument list is a list of expressions (it\u2019s more complicated than this but ignore that) separated by commas. Parse expressions and commas until you hit a right parenthesis.<\/li>\n<li>Found a string literal expression. The argument list has one argument.<\/li>\n<\/ul>\n<p>The parse produces a tree that looks like this:<\/p>\n<ul>\n<li>CallStatement\n<ul>\n<li><strong>CallKeyword<\/strong><\/li>\n<li>InvocationExpression\n<ul>\n<li>MemberAccessExpression\n<ul>\n<li>IdentifierName\n<ul>\n<li><strong>IdentifierToken<\/strong><\/li>\n<\/ul>\n<\/li>\n<li><strong>DotToken<\/strong><\/li>\n<li>IdentifierName\n<ul>\n<li><strong>IdentifierToken<\/strong><\/li>\n<\/ul>\n<\/li>\n<li>ArgumentList\n<ul>\n<li><strong>OpenParenthesisToken<\/strong><\/li>\n<li>SimpleArgument\n<ul>\n<li>StringLiteralExpression\n<ul>\n<li><strong>StringToken<\/strong><\/li>\n<\/ul>\n<\/li>\n<li><strong>CloseParenthesisToken<\/strong><\/li>\n<\/ul>\n<\/li>\n<li><strong>EndOfFileToken<\/strong><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>That\u2019s a source file!<\/p>\n<h2>Semantic Analysis<\/h2>\n<p>To be clear, the compiler still has no idea (unlike you and I) that Console.WriteLine is a shared method on the Console class in the System namespace and that it has an overload that takes one string parameter and returns nothing. After all, anyone could make a class called Console. Maybe there isn\u2019t a method called WriteLine. Maybe WriteLine is a type. That\u2019s a dumb name for a type but the compiler doesn\u2019t know that. If it is a type, then the program doesn\u2019t make any sense. Piecing all of that together is semantic analysis.\n<strong>The Binder<\/strong> looks at the references provided to the compiler: the namespaces, types, type members in those references, the project-level imports, and the imports in your source file. And then it starts figuring out what\u2019s what.<\/p>\n<ul>\n<li>What does Console mean?\n<ul>\n<li>Is there something called Console in scope?\n<ul>\n<li>Checking the containing block: No.<\/li>\n<li>Checking the containing method: No.<\/li>\n<li>Checking the containing type: No.<\/li>\n<li>Checking the containing type\u2019s base types: No.<\/li>\n<li>Checking the containing type\u2019s containing type or namespace: No.<\/li>\n<li>Checking the containing namespace\u2019s containing namespaces: No.<\/li>\n<li>Are there import statements? No.<\/li>\n<li>Are there project-level imports? Yes.\n<ul>\n<li>Check each namespace imported one by one.<\/li>\n<li>Found one and only one? Yes.<\/li>\n<\/ul>\n<\/li>\n<li>Console is a type. This must be a shared member.<\/li>\n<li>Look for shared member named WriteLine in [mscorlib]System.Console type.<\/li>\n<li>Found 19 of them. They\u2019re all methods.<\/li>\n<li>Bind all the argument expressions.\n<ul>\n<li>One argument is a string literal. String literal has content \u201cHello, World!\u201d and type of [mscorlib]System.String.<\/li>\n<\/ul>\n<\/li>\n<li>Based on number and types of the arguments, it checks how many of the 19 methods could take one string argument. In VB, the answer is 14. But there are rules that decide which ones are better and it turns out that the one that actually takes a string is better than the one that takes object, or the one that takes a string but passing an empty ParamArray argument list, or performing an implicit narrowing conversion to any of the numeric types, Boolean, or the intrinsic conversion from string to Char or Char array.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>The compiler has determined that the program is an invocation of the shared void [mscorlib]System.Console::WriteLine(string) method. Passing the string literal \u201cHello, World\u201d.<\/p>\n<h2>Lowering<\/h2>\n<p>What lowering does is take high-level language constructs that only exist in VB and translate them to lower-level constructs that the CLR\/JIT compiler understands. Here are some examples of things that don&#8217;t exist at the Intermediate Language (IL) level:<\/p>\n<ul>\n<li>Loops: IL only has goto\u2014called &#8220;br&#8221; for branching\u2014and conditional goto\u2014br.true for branch when true and br.false for branch when false.<\/li>\n<li>Variable scope: All variables are &#8220;in scope&#8221; for the entire method.<\/li>\n<li>Using blocks: IL only has try\/catch\/finally so the compiler <em>lowers<\/em> a using block into a try\/catch\/finally block that initializes a variable and disposes of it in the finally block.<\/li>\n<li>Lambda expressions: The compiler first translates lambdas into ordinary methods. If they capture any local variables, the compiler has to translate those variables into fields of an object behind the scenes.<\/li>\n<li>Iterator methods: The compiler translates an Iterator method with Yield statements inside to a giant state machine, which is essentially just a giant Select Case that says, &#8220;last time you called me I was at step 1 so skip to step 2 this time&#8221;.<\/li>\n<\/ul>\n<p>Even though IL has a much simpler set of instructions than a higher-level language like VB everything you can write in a VB program is ultimately composed of simple instructions. In the same way that the greatest works of English literature still use just 26 letters. All of the simplicity, safety, and expressiveness of a higher-level language is what makes VB so powerful.\nThis example of a simple call to a Shared method isn&#8217;t very complex. IL already understands method calls and string literals so there isn&#8217;t really any lowering to be done.<\/p>\n<h2>Emit<\/h2>\n<p>Emit is simple. Once the compiler digests your program into simple operations the CLR understands, it writes out these operations (usually to disk) into a binary file in a well-specified format.<\/p>\n<h2>Wrapping up<\/h2>\n<p>In this post, we looked at what a compiler does abstractly and how that process compares to how a human being might read a page of text. In the next post, we&#8217;ll dive into how the Visual Basic compiler specifically is organized.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So, you\u2019ve heard that VB (and C#) are open source now and you want to dive in and contribute. If you haven\u2019t spent your life building compilers, you probably don\u2019t know where to start. No worries, I\u2019ll walk you through it. This post is the first of a series of blog posts focused on the [&hellip;]<\/p>\n","protected":false},"author":258,"featured_media":8818,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[192,8,195],"tags":[130],"class_list":["post-8466","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-featured","category-misc","category-visual-basic","tag-roslyn"],"acf":[],"blog_post_summary":"<p>So, you\u2019ve heard that VB (and C#) are open source now and you want to dive in and contribute. If you haven\u2019t spent your life building compilers, you probably don\u2019t know where to start. No worries, I\u2019ll walk you through it. This post is the first of a series of blog posts focused on the [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/posts\/8466","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/users\/258"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/comments?post=8466"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/posts\/8466\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/media\/8818"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/media?parent=8466"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/categories?post=8466"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/tags?post=8466"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}