{"id":31653,"date":"2021-01-27T12:35:09","date_gmt":"2021-01-27T19:35:09","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=31653"},"modified":"2021-01-27T12:35:09","modified_gmt":"2021-01-27T19:35:09","slug":"using-c-source-generators-to-create-an-external-dsl","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/using-c-source-generators-to-create-an-external-dsl\/","title":{"rendered":"Using C# Source Generators to create an external DSL"},"content":{"rendered":"<p>This post looks at how to use C# Source Generators to build an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Domain-specific_language\">external DSL<\/a> to represent mathematical expressions.<\/p>\n<p>The code for this post is on the <a href=\"https:\/\/github.com\/dotnet\/roslyn-sdk\/blob\/master\/samples\/CSharp\/SourceGenerators\/SourceGeneratorSamples\/MathsGenerator.cs\">roslyn-sdk repository<\/a>.<\/p>\n<h2>A recap of C# Source Generators<\/h2>\n<p>There are two other articles describing C# Source Generators on this blog, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/introducing-c-source-generators\/\">Introducing C# Source Generators<\/a> and <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/new-c-source-generator-samples\/\">New C# Source Generator Samples<\/a>. If you&#8217;re new to generators, you might want to read them first.<\/p>\n<p>Let&#8217;s just remind ourselves of what they are. You can think of a Source Generator as a function that runs at compile time. It takes some inputs and produces C# code.<\/p>\n<pre><code class=\"dotnetcli\">Program Parse Tree -&gt; Additional Files -&gt; File Specific Options -&gt; C# Code\r\n<\/code><\/pre>\n<p>This conceptual view is implemented in the <code>ISourceGenerator<\/code> interface.<\/p>\n<pre><code class=\"csharp\">    public interface ISourceGenerator {\r\n        void Execute(GeneratorExecutionContext context);\r\n        void Initialize(GeneratorInitializationContext context);\r\n}\r\n<\/code><\/pre>\n<p>You implement the <code>Execute<\/code> method and get the inputs through the <code>context<\/code> object. The <code>Initialize<\/code> function is more rarely used.<\/p>\n<p>The <code>context<\/code> parameter to <code>Execute<\/code> contains the inputs.<\/p>\n<ul>\n<li><code>context.Compilation<\/code> is the parse tree for the program and everything else needed by the compiler (settings, references, etc.).<\/li>\n<li><code>context.AdditionalFiles<\/code> gives you the additional files in the project.<\/li>\n<li><code>context.AnalyzerConfigOptions.GetOptions<\/code> provides the options for each additional file.<\/li>\n<\/ul>\n<p>The additional files are added to the project file using this syntax. Also, notice the file specific options that you can retrieve in your generator code.<\/p>\n<pre><code class=\"xml\">&lt;AdditionalFiles Include=\"Cars.csv\" CsvLoadType=\"OnDemand\" CacheObjects=\"true\" \/&gt;\r\n<\/code><\/pre>\n<p>You are not limited to these inputs. A C# generator is just a bit of code that runs at compile time. The code can do whatever it pleases. For example, it could download information from a website (not a good idea). But the three inputs above are the most logical ones as they are part of the project. It is the recommended way to do it.<\/p>\n<p>As a side note, a different source generators&#8217; metaphor is the anthropomorphization of the compiler. Mrs. Compiler goes about her business of generating the parse tree and then she stops and asks you: &#8220;Do you have anything to add to what I have done so far?&#8221;<\/p>\n<h2>The scenario<\/h2>\n<p>You work for an engineering company that employes many mathematicians. The formulas that underpin the business are spread out through the large C# codebase. The company would like to centralize them and make them easy to write and understand for their mathematicians.<\/p>\n<p>They would like the calculations to be written in pure math, but have the same performance as C# code. For example, they would like the code to end up being inlined at the point of usage. Here is an example of what they would like to write:<\/p>\n<pre><code class=\"dotnetcli\">AreaSquare(l)       = pow(l, 2)\r\nAreaRectangle(w, h) = w * h\r\nAreaCircle(r)       = pi * r * r\r\nQuadratic(a, b, c)  = {-b + sqrt(pow(b,2) - 4 * a * c)} \/ (2 * a)\r\n\r\nGoldenRatio         = 1.61803\r\nGoldHarm(n)         = GoldenRatio + 1 * \u2211(i, 1, n, 1 \/ i)\r\n\r\nD(x', x'', y', y'') = sqrt(pow([x'-x''],2) + pow([y'-y''], 2))\r\n<\/code><\/pre>\n<p>You notice several things that differentiate this language from C#:<\/p>\n<ol>\n<li>No type-annotations.<\/li>\n<li>Different kinds of parenthesis.<\/li>\n<li>Invalid C# characters in identifiers.<\/li>\n<li>Special syntax for the summation symbol (<code>\u2211<\/code>).<\/li>\n<\/ol>\n<p>Despite the differences, the language structure is similar to C# methods and properties. You think you should be able to translate each line of the language to a snippet of valid C# code.<\/p>\n<p>You decide to use Source Generators for this task because they plug directly into the normal compiler workflow and because in the future the code might need to access the parse tree for the enclosing program.<\/p>\n<p>One could use Regex substitutions to go from this language to C#, but that approach is problematic for two reasons.<\/p>\n<ol>\n<li>The language structure is not completely identical to C# (i.e., you need to generate special code for <code>\u2211<\/code>)<\/li>\n<li>More importantly, you expose yourself to <a href=\"https:\/\/owasp.org\/www-community\/attacks\/Code_Injection\">code injection attack<\/a>. A disgruntled mathematician could write code to mint bitcoins inside your language. By properly parsing the language you can whitelist the available functions.<\/li>\n<\/ol>\n<h2>Hooking up the inputs<\/h2>\n<p>Here is the implementation of the <code>Execute<\/code> method for the <code>ISourceGenerator<\/code> interface.<\/p>\n<pre><code class=\"csharp\">        public void Execute(GeneratorExecutionContext context)\r\n        {\r\n\r\n            foreach (AdditionalText file in context.AdditionalFiles)\r\n            {\r\n                if (Path.GetExtension(file.Path).Equals(\".math\", StringComparison.OrdinalIgnoreCase))\r\n                {\r\n                    if(!libraryIsAdded)\r\n                    {\r\n                        context.AddSource(\"___MathLibrary___.cs\", SourceText.From(libraryCode, Encoding.UTF8));\r\n                        libraryIsAdded = true;\r\n                    }\r\n                    \/\/ Load formulas from .math files\r\n                    var mathText = file.GetText();\r\n                    var mathString = \"\";\r\n\r\n                    if(mathText != null)\r\n                    {\r\n                        mathString = mathText.ToString();\r\n                    } else\r\n                    {\r\n                        throw new Exception($\"Cannot load file {file.Path}\");\r\n                    }\r\n\r\n                    \/\/ Get name of generated namespace from file name\r\n                    string fileName = Path.GetFileNameWithoutExtension(file.Path);\r\n\r\n                    \/\/ Parse and gen the formulas functions\r\n                    var tokens = Lexer.Tokenize(mathString);\r\n                    var code = Parser.Parse(tokens);\r\n\r\n                    var codeFileName = $@\"{fileName}.cs\";\r\n\r\n                    context.AddSource(codeFileName, SourceText.From(code, Encoding.UTF8));\r\n                }\r\n            }\r\n        }\r\n<\/code><\/pre>\n<p>The code scans the additional files from the project file and operates on the ones with the extension <code>.math<\/code>.<\/p>\n<p>Firstly, it adds to the project a C# library file containing some utility functions. Then it gets the text for the Math file (aka the formulas), parses the language, and generates C# code for it.<\/p>\n<p>This snippet is the minimum code to hook up a new language into your C# project. You can do more here. You can inspect the parse tree or gather more options to influence the way the language is parsed and generated, but this is not necessary in this case.<\/p>\n<h2>Writing the parser<\/h2>\n<p>This section is standard compiler fare. If you are familiar with lexing, parsing, and generating code, you can jump directly to the next section. If you are curious, read on.<\/p>\n<p>We are implementing the following two lines from the code above.<\/p>\n<pre><code class=\"csharp\">var tokens = Lexer.Tokenize(mathString);\r\nvar code = Parser.Parse(tokens);\r\n<\/code><\/pre>\n<p>The goal of these lines is to take the Math language and generate the following valid C# code. You can then call any of the generated functions from your existing code.<\/p>\n<pre><code class=\"csharp\">using static System.Math;\r\nusing static ___MathLibrary___.Formulas; \/\/ For the __MySum__ function\r\n\r\nnamespace Maths {\r\n\r\n    public static partial class Formulas {\r\n\r\n        public static double  AreaSquare (double  l ) =&gt; Pow ( l , 2 ) ;\r\n        public static double  AreaRectangle (double  w ,double  h ) =&gt; w * h ;\r\n        public static double  AreaCircle (double  r ) =&gt; PI * r * r ;\r\n        public static double  Quadratic (double  a ,double  b ,double  c ) =&gt; ( - b + Sqrt ( Pow ( b , 2 ) - 4 * a * c ) ) \/ ( 2 * a ) ;\r\n\r\n        public static double  GoldenRatio =&gt; 1.61803 ;\r\n        public static double  GoldHarm (double  n ) =&gt; GoldenRatio + 1 * ___MySum___ ((int) 1 ,(int) n ,i =&gt;  1 \/ i ) ;\r\n\r\n        public static double  D (double  xPrime ,double  xSecond ,double  yPrime ,double  ySecond ) =&gt; Sqrt ( Pow ( ( xPrime - xSecond ) , 2 ) + Pow ( ( yPrime - ySecond ) , 2 ) ) ;\r\n\r\n    }\r\n}\r\n<\/code><\/pre>\n<p>I just touch on the most important points of the implementation, the full code is <a href=\"https:\/\/github.com\/dotnet\/roslyn-sdk\/blob\/master\/samples\/CSharp\/SourceGenerators\/SourceGeneratorSamples\/MathsGenerator.cs\">here<\/a>.<\/p>\n<p>This is not production code. For the sake of simplicity, I had to fit it in one sample file without external dependencies. It is probably wiser to use a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Comparison_of_parser_generators\">parser generator<\/a> to future-proof the implementation and avoid errors.<\/p>\n<p>With such caveats out of the way, the lexer is Regex based. It uses the following <code>Token<\/code> definition and Regexps.<\/p>\n<pre><code class=\"csharp\">    public enum TokenType {\r\n        Number,\r\n        Identifier,\r\n        Operation,\r\n        OpenParens,\r\n        CloseParens,\r\n        Equal,\r\n        EOL,\r\n        EOF,\r\n        Spaces,\r\n        Comma,\r\n        Sum,\r\n        None\r\n    }\r\n\r\n    public struct Token {\r\n        public TokenType Type;\r\n        public string Value;\r\n        public int Line;\r\n        public int Column;\r\n    }\r\n\r\n\/\/\/ ... More code not shown\r\n\r\n        static (TokenType, string)[] tokenStrings = {\r\n            (TokenType.EOL,         @\"(rn|r|n)\"),\r\n            (TokenType.Spaces,      @\"s+\"),\r\n            (TokenType.Number,      @\"[+-]?((d+.?d*)|(.d+))\"),\r\n            (TokenType.Identifier,  @\"[_a-zA-Z][`'\"\"_a-zA-Z0-9]*\"),\r\n            (TokenType.Operation,   @\"[+-\/*]\"),\r\n            (TokenType.OpenParens,  @\"[([{]\"),\r\n            (TokenType.CloseParens, @\"[)]}]\"),\r\n            (TokenType.Equal,       @\"=\"),\r\n            (TokenType.Comma,       @\",\"),\r\n            (TokenType.Sum,         @\"\u2211\")\r\n        };\r\n<\/code><\/pre>\n<p>The <code>Tokenize<\/code> function just goes from the source text to a list of tokens.<\/p>\n<pre><code class=\"csharp\">\r\n        using Tokens = System.Collections.Generic.IEnumerable&lt;MathsGenerator.Token&gt;;\r\n\r\n        static public Tokens Tokenize(string source) {\r\n<\/code><\/pre>\n<p>It is too long to show here. Follow the link above for the gory details.<\/p>\n<p>The parser&#8217;s grammar is described below.<\/p>\n<pre><code class=\"csharp\">    \/* EBNF for the language\r\n        lines   = {line} EOF\r\n        line    = {EOL} identifier [lround args rround] equal expr EOL {EOL}\r\n        args    = identifier {comma identifier}\r\n        expr    = [plus|minus] term { (plus|minus) term }\r\n        term    = factor { (times|divide) factor };\r\n        factor  = number | var | func | sum | matrix | lround expr rround;\r\n        var     = identifier;\r\n        func    = identifier lround expr {comma expr} rround;\r\n        sum     = \u2211 lround identifier comma expr comma expr comma expr rround;\r\n    *\/\r\n<\/code><\/pre>\n<p>It is implemented as a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Recursive_descent_parser\">recursive descendent parser<\/a>.<\/p>\n<p>The <code>Parse<\/code> function is below and illustrates a few of the design decisions.<\/p>\n<pre><code class=\"c\">        public static string Parse(Tokens tokens) {\r\n            var globalSymbolTable   = new SymTable();\r\n            var symbolTable         = new SymTable();\r\n            var buffer              = new StringBuilder();\r\n\r\n            var en = tokens.GetEnumerator();\r\n            en.MoveNext();\r\n\r\n            buffer = Lines(new Context {\r\n                tokens = en,\r\n                globalSymbolTable = globalSymbolTable,\r\n                symbolTable = symbolTable,\r\n                buffer = buffer\r\n                });\r\n            return buffer.ToString();\r\n\r\n        }\r\n\r\n<\/code><\/pre>\n<ul>\n<li><code>globalSymbolTable<\/code> is used to store the symbols that are whitelisted and the global symbols that are generated during the parsing of the language.<\/li>\n<li><code>symbolTable<\/code> is for the parameters to a function and gets cleared at the start of each new line.<\/li>\n<li><code>buffer<\/code> contains the C# code that is generated while parsing.<\/li>\n<li><code>Lines<\/code> is the first mutually recursive function and maps to the first line of the grammar.<\/li>\n<\/ul>\n<p>A typical example of one of such recursive functions is below.<\/p>\n<pre><code class=\"csharp\">        private static void Line(Context ctx) {\r\n            \/\/ line    = {EOL} identifier [lround args rround] equal expr EOL {EOL}\r\n\r\n            ctx.symbolTable.Clear();\r\n\r\n            while(Peek(ctx, TokenType.EOL))\r\n                Consume(ctx, TokenType.EOL);\r\n\r\n            ctx.buffer.Append(\"tpublic static double \");\r\n\r\n            AddGlobalSymbol(ctx);\r\n            Consume(ctx, TokenType.Identifier);\r\n\r\n            if(Peek(ctx, TokenType.OpenParens, \"(\")) {\r\n                Consume(ctx, TokenType.OpenParens, \"(\"); \/\/ Just round parens\r\n                Args(ctx);\r\n                Consume(ctx, TokenType.CloseParens, \")\");\r\n            }\r\n\r\n            Consume(ctx, TokenType.Equal);\r\n            Expr(ctx);\r\n            ctx.buffer.Append(\" ;\");\r\n\r\n            Consume(ctx, TokenType.EOL);\r\n\r\n            while(Peek(ctx, TokenType.EOL))\r\n                Consume(ctx, TokenType.EOL);\r\n        }\r\n\r\n<\/code><\/pre>\n<p>This shows the manipulation of both symbol tables, the utility functions to advance the tokens stream, the call to the other recursive functions, and emitting the C# code.<\/p>\n<p>Not very elegant, but it gets the job done.<\/p>\n<p>We whitelist all the functions in the <code>Math<\/code> class.<\/p>\n<pre><code class=\"csharp\">        static HashSet&lt;string&gt; validFunctions =\r\n            new HashSet&lt;string&gt;(typeof(System.Math).GetMethods().Select(m =&gt; m.Name.ToLower()));\r\n<\/code><\/pre>\n<p>For most Tokens, there is a straightforward translation to C#.<\/p>\n<pre><code class=\"csharp\">        private static StringBuilder Emit(Context ctx, Token token) =&gt; token.Type switch\r\n        {\r\n            TokenType.EOL           =&gt; ctx.buffer.Append(\"n\"),\r\n            TokenType.CloseParens   =&gt; ctx.buffer.Append(')'), \/\/ All parens become rounded\r\n            TokenType.OpenParens    =&gt; ctx.buffer.Append('('),\r\n            TokenType.Equal         =&gt; ctx.buffer.Append(\"=&gt;\"),\r\n            TokenType.Comma         =&gt; ctx.buffer.Append(token.Value),\r\n\r\n            \/\/ Identifiers are normalized and checked for injection attacks\r\n            TokenType.Identifier    =&gt; EmitIdentifier(ctx, token),\r\n            TokenType.Number        =&gt; ctx.buffer.Append(token.Value),\r\n            TokenType.Operation     =&gt; ctx.buffer.Append(token.Value),\r\n            TokenType.Sum           =&gt; ctx.buffer.Append(\"MySum\"),\r\n            _                       =&gt; Error(token, TokenType.None)\r\n        };\r\n<\/code><\/pre>\n<p>But identifiers need special treatment to check the whitelisted symbols and replace invalid C# characters with valid strings.<\/p>\n<pre><code class=\"csharp\">        private static StringBuilder EmitIdentifier(Context ctx, Token token) {\r\n            var val = token.Value;\r\n\r\n            if(val == \"pi\") {\r\n                ctx.buffer.Append(\"PI\"); \/\/ Doesn't follow pattern\r\n                return ctx.buffer;\r\n            }\r\n\r\n            if(validFunctions.Contains(val)) {\r\n                ctx.buffer.Append(char.ToUpper(val[0]) + val.Substring(1));\r\n                return ctx.buffer;\r\n            }\r\n\r\n            string id = token.Value;\r\n            if(ctx.globalSymbolTable.Contains(token.Value) ||\r\n                          ctx.symbolTable.Contains(token.Value)) {\r\n                foreach (var r in replacementStrings) {\r\n                    id = id.Replace(r.Key, r.Value);\r\n                }\r\n                return ctx.buffer.Append(id);\r\n            } else {\r\n                throw new Exception($\"{token.Value} not a known identifier or function.\");\r\n            }\r\n        }\r\n<\/code><\/pre>\n<p>There is a lot more that could be said about the parser. In the end, the implementation is not important. This one is far from perfect.<\/p>\n<h2>Practical advice<\/h2>\n<p>As you build your own Source Generators, there are a few things that make the process smoother.<\/p>\n<ul>\n<li>Write most code in a standard <code>Console<\/code> project. When you are happy with the result, copy and paste it to your source generator. This gives you a good developer experience (i.e., step line by line) for most of your work.<\/li>\n<li>Once you have copied your code to the source generator, and if you still have problems, use <code>Debug.Launch<\/code> to launch the debugger at the start of the <code>Execute<\/code> function.<\/li>\n<li>Visual Studio currently has no ability to unload a source generator once loaded. Modifications to the generator itself will only take effect after you closed and reopened your solution.<\/li>\n<\/ul>\n<p>These are teething problems that hopefully will be fixed in new releases of Visual Studio. For now, you can use the above workarounds.<\/p>\n<h2>Conclusion<\/h2>\n<p>Source generators allow you to embed external DSLs into your C# project. This post shows how to do this for a simple mathematical language.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post looks at creating an external DSL to represent mathematical expressions using C<\/p>\n","protected":false},"author":34619,"featured_media":31655,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,196,756],"tags":[],"class_list":["post-31653","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-dotnet-core","category-csharp"],"acf":[],"blog_post_summary":"<p>This post looks at creating an external DSL to represent mathematical expressions using C<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/31653","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/34619"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=31653"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/31653\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/31655"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=31653"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=31653"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=31653"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}