{"id":2177,"date":"2015-07-21T16:34:28","date_gmt":"2015-07-21T16:34:28","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/index.php\/2015\/07\/21\/recursive-descent-formula-parsing-in-c-in-net\/"},"modified":"2020-03-15T16:01:37","modified_gmt":"2020-03-15T23:01:37","slug":"recursive-descent-formula-parsing-in-c-in-net","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/recursive-descent-formula-parsing-in-c-in-net\/","title":{"rendered":"Recursive Descent Formula Parsing in C# in .NET"},"content":{"rendered":"<p>Intrinsic to any script-based application is the notion of parsing. In CloudSheet, described in detail in the forthcoming TED Case Study entitled \u201cMassively Scalable Graph-Based Computation in Azure,\u201d a formula parser tokenizes arbitrarily complex Excel-like formulas making subsequent evaluation faster and more efficient. This paper will cover the parsing component.<a href=\"#_ftn1\">1<\/a><\/p>\n<h2 id=\"overview-of-the-solution\">Overview of the Solution<\/h2>\n<p>The objective of the parser is to tokenize arbitrarily complex formulas of differing types to enable rapid and accurate evaluation. Example formulas include:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>=1+3\r\n=sum(a1:a3)\r\n=sum(a1:a3,5)+pi()\/2\r\n=pmt(a1\/12,(a2*36),b5)\r\n=stock(ibm)*a5+stock(msft)*a6\r\n=concatenate(\u201ctest\u201d,b12)\r\n=data(sheetcsvssample.op)\r\n=select(a5,3,2,5)\r\n<\/code><\/pre>\n<\/div>\n<p>And so on. As can be seen, formulas can return either numeric or string-based results, concatenation being an example of the latter. The current implementation supports around 47 numeric functions and 10 string functions (a subset of current Excel functionality), although more are being added as needed.<\/p>\n<p>The parser, which is contained within the \u201cCell\u201d grain (actor) in CloudSheet, is passed a text string from the client (either typed in by a user or loaded from a file). After tokenizing, the tokens are passed to the evaluator, which calculates a response. As the model is recalculated, the evaluator is called to update its result based on potentially new data. The parser is only invoked when the input formula is changed; the evaluator is called on every recalculation where necessary (i.e., in which a predecessor value is updated).<\/p>\n<h2 id=\"implementation\">Implementation<\/h2>\n<p>The implementation chosen is a recursive-descent parser.<a href=\"#_ftn2\">2<\/a> The parsing method is invoked with two arguments:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>private void Parse(List&lt;Formentry&gt; numStack, List&lt;Formentry&gt; opStack)\r\n{\r\n    while (this.index &lt; this.txt.Length)\r\n    {\r\n        Formentry tok = GetToken();\r\n  ...\r\n }\r\n<\/code><\/pre>\n<\/div>\n<p>The arguments are the number stack and the operator stack. The former holds formula arguments, the latter holds formula operators, such as arithmetic operators or functions.<\/p>\n<p>The act of tokenizing creates a data structure called a \u201cFormentry\u201d (formula entry) which describes the token. The various types of tokens are enumerated below:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>public enum TokenTypes\r\n{\r\n    UNDEFINED=-1,\r\n    NUMBER = 0,\r\n    CELLREF = 1,\r\n    OPERATOR = 2,\r\n    NAME = 3,\r\n    RANGE = 4,\r\n    SUBSTRING = 5,\r\n    FUNCTION = 6,\r\n    UNARY = 7,\r\n    PRECEDENCE = 8,\r\n    DATE=9,\r\n    ARGSEP=10,\r\n    LPAREN=11,\r\n    RPAREN=12,\r\n    FUNC = 13,\r\n    COMPARISON = 14,\r\n    RESOLVEDRANGE=15,\r\n    STRINGFUNCTION=16,\r\n    STARTPARSE=98,\r\n    ENDPARSE = 99\r\n}\r\n<\/code><\/pre>\n<\/div>\n<p>Note that a Formentry can describe either arguments or operators. Some descriptions follow:<\/p>\n<ul>\n<li>CELLREF indicates that the token refers to another cell. Cells referred to in this way are predecessors to the current cell; the current cell is thus a successor to the referred-to cell. As predecessors change, the recalculation method in this cell is invoked to get the most recent value of the referred-to cell(s).<\/li>\n<li>ARGSEP indicates a separator between arguments in a formula \u2013 the comma in <code class=\"highlighter-rouge\">=sum(3,4)<\/code> for example.<\/li>\n<li>LPAREN and RPAREN indicate parentheses and thus subexpressions (and thus recursion, as will be described below)<\/li>\n<li>NUMBERs are always maintained internally in double-precision format even if they are input as integers, in order to have a common format for operations, to avoid casts. The client is responsible for presentation formatting.<\/li>\n<li>A FUNCTION is a numeric function returning a double-precision value; a STRINGFUNCTION returns a string.<\/li>\n<\/ul>\n<p>The GetToken() method is the lexer. It converts a text element into a token, represented by a Formentry. Here is an excerpt:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>\/* unary operators have the highest precedence *\/\r\nif ((unary = ScanForUnary()) != -1)\r\n{\r\n    f.tokentype = TokenTypes.UNARY;\r\n    f.value = unary;\r\n    this.index++;\r\n    return f;\r\n}\r\n\/* logical comparison operators (=, &lt;, &gt;, etc) *\/\r\nif ((comp = ScanForComparison()) != -1)\r\n{\r\n    f.tokentype = TokenTypes.COMPARISON;\r\n    f.value = comp;\r\n    this.index++;\r\n    return f;\r\n}\r\n<\/code><\/pre>\n<\/div>\n<p>The lexer calls various worker subroutines (here it checks for unary operators and comparison operators) sequentially to determine the type of token, and returns it to the main parsing routine. This makes it trivial to add new types of lexemes, or tokens. (The _index global variable points to the current location in the input string. The lexer is responsible for ensuring that it always points to the next character after the detected lexeme.)<\/p>\n<p>A parenthesis indicates a subexpression, and thus needs recursion. When this circumstance is detected, new instances of the number stack and operator are created and the parser invokes itself against the subexpression. For example, when a function is detected, a new level of precedence (recursion) is created:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>if (tok.tokentype == TokenTypes.FUNCTION)\r\n{\r\n    this.expectingvalue = true;\r\n    Formentry f = Push(numStack, opStack);\r\n    f.tokentype = TokenTypes.FUNCTION;\r\n    f.func = (int)tok.value;\r\n    Parse(f.newnumstack, f.newopstack);\r\n    this.expectingvalue = false;\r\n    continue;\r\n}\r\n<\/code><\/pre>\n<\/div>\n<p>Note the <code class=\"highlighter-rouge\">push()<\/code> function which:<\/p>\n<ul>\n<li>Creates a special PRECEDENCE Formentry<\/li>\n<li>Creates new instances of the number stack and operator stack<\/li>\n<li>Increases the stack depth<\/li>\n<li>Creates a STARTPARSE Formentry on the new stack<\/li>\n<li>Creates and returns the first token Formentry on the new stack<\/li>\n<\/ul>\n<p>A corresponding <code class=\"highlighter-rouge\">pop()<\/code> function also exists.<\/p>\n<p>At the end of parsing, a set of data structures holding the tokens and their relationships to one another is created and presented to the evaluator. Here is an example of the result of a simple parsing session:<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-2.png\" alt=\"Image image001\" width=\"1009\" height=\"321\" class=\"aligncenter size-full wp-image-11224\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-2.png 1009w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-2-300x95.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-2-768x244.png 768w\" sizes=\"(max-width: 1009px) 100vw, 1009px\" \/><\/p>\n<p>Each box indicates a formentry. The STARTPARSE and ENDPARSE Formentries are convenience structures which optimize the evaluation process. In this case, there is obviously no recursion.<a href=\"#_ftn3\">3<\/a><\/p>\n<p>A more complex example is depicted below:<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image002.png\" alt=\"Image image002\" width=\"998\" height=\"558\" class=\"aligncenter size-full wp-image-11225\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image002.png 998w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image002-300x168.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image002-768x429.png 768w\" sizes=\"(max-width: 998px) 100vw, 998px\" \/><\/p>\n<p>Here, multiple levels of recursion are shown. Each instance of a PRECEDENCE Formentry indicates a subexpression. At evaluation time, a depth-first traversal of the tree occurs such that as evaluation progresses, the deepest subexpression is evaluated (here: A1\/12), then the next-deepest, and so on. The parser supports an arbitrary level of depth and complexity.<\/p>\n<h2 id=\"challenges\">Challenges<\/h2>\n<p>Of course, what is difficult in any general purpose parser with a relatively open syntax is accounting for every possible case. A formula like <code class=\"highlighter-rouge\">=pmt((a1\/12),(a2<em>30),a3)<\/code> is quite easy to parse as the precedence is made explicit by the existence of parentheses, while an equally valid version (far more popular with users because of its intuitiveness) exists: <code class=\"highlighter-rouge\">=pmt(a1\/12, a2<\/em>30, a3)<\/code>. Here the ARGSEP (comma) signals that backtracing must be used to reparse the now-delimited expression.<\/p>\n<p>In addition, with all the recursion, keeping track of all the levels of precedence can be challenging. A variable <code class=\"highlighter-rouge\">stackdepth<\/code> has been of immense help in debugging issues here. If at the end of a parse <code class=\"highlighter-rouge\">stackdepth<\/code> is not zero, it is likely that due to a recursion problem.<\/p>\n<h1 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>The parser can be reused in other contexts requiring parsing and evaluation (which was not covered here) of formulas using Excel semantics and syntax.<\/p>\n<p>A command-line version of the parser (useful for debugging) is kept at <a href=\"https:\/\/github.com\/barrybriggs\/Parser\">https:\/\/github.com\/barrybriggs\/Parser<\/a> .<\/p>\n<hr \/>\n<p><a href=\"#_ftn1\">1<\/a> <a name=\"_ftn1\"><\/a>Leaving the evaluator as \u201can exercise to the reader.\u201d There, I\u2019ve always wanted to say that in a technical document.<\/p>\n<p><a href=\"#_ftn2\">2<\/a> <a name=\"_ftn2\"><\/a>The curious reader is directed to the classic work on compilers: Alfred Aho, Ravi Sethi, and Jeffrey D. Ullman, <em>Compilers: Principles, Techniques and Tools;<\/em> Addison-Wesley. I have the 1988 edition. A slightly simpler explanation can be found here: <a href=\"http:\/\/www.engr.mun.ca\/~theo\/Misc\/exp_parsing.htm\">http:\/\/www.engr.mun.ca\/~theo\/Misc\/exp_parsing.htm<\/a><a name=\"_ftn2\"><\/a>.<\/p>\n<p><a href=\"#_ftn3\">3<\/a> <a name=\"_ftn3\"><\/a>I trust readers will forgive the crudity of these diagrams. Perhaps we need a sort of \u201cFeynmann diagram\u201d for recursive data structures.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Creating a parser to tokenize arbitrarily complex Excel-like formulas of differing types to enable rapid and accurate evaluation. <\/p>\n","protected":false},"author":21345,"featured_media":11223,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[11],"tags":[60,121],"class_list":["post-2177","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","tag-azure","tag-cloudsheet"],"acf":[],"blog_post_summary":"<p>Creating a parser to tokenize arbitrarily complex Excel-like formulas of differing types to enable rapid and accurate evaluation. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2177","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21345"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2177"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2177\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11223"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2177"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2177"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}