September 29th, 2020

How did we end up parsing Savvyday 29 Oatmeal 94 as Saturday 29 October 1994?

Some time ago, we learned that the Internet­Time­To­System­Time function manages to parse “Savvyday 29 Oatmeal 94” as “Saturday 29 October 1994”. How did that happen? Is it finding the date with the shortest English Levenshtein distance?

Nothing that fancy.

Warning: This article discusses implementation details, which are not contractual. The algorithm is subject to change in the future. The only thing that Internet­Time­To­System­Time formally guarantees is that it can parse properly-formatted HTTP timestamps.

The parsing is very simple. The official format for HTTP date strings is

  1. DayOfWeek, Day Month Year Hour Minute Second GMT

In practice, not everybody follows the rules, so the parser accepts these three formats:

  1. DayOfWeek Day Month Year Hour Minute Second TZ
  2. DayOfWeek Month Day Hour Minute Second TZ Year
  3. DayOfWeek Month Day Hour Minute Second Year TZ

After discarding non-alphanumerics, the parser takes each word in the input string and converts it to a number somehow. If it consists of digits, then it’s parsed to a number in the usual way. If it consists of alphabetics, then it’s parsed to a number by trying to match it against the list of valid tokens:

DayOfWeek Month TZ
Sun = 0 Jan = 1 Jul = 7 GMT
Mon = 1 Feb = 2 Aug = 8 UTC
Tue = 2 Mar = 3 Sep = 9  
Wed = 3 Apr = 4 Oct = 10  
Thu = 4 May = 5 Nov = 11  
Fri = 5 Jun = 6 Dec = 12  
Sat = 7      

If no match is found, then we look for an entry which shares the most initial characters with the word being parsed. If there is a unique such entry, then the parsed value is as given in the table. If there is no such entry, or the longest match is not unique, then parsing fails.¹

Since there only one time zone permitted in HTTP time/date strings, all we have to remember is “Yup, it’s a time zone. There’s a time zone marker here.”

For example, the string “Savvyday” is not in the above table, but it does share the following prefixes:

Length 1 Length 2
S(un) Sa(t)
S(ep)  

The longest match is length 2, and there’s only one such match, so the word “Savvyday” is parsed as if it were “Sat”.

Similarly, “Oatmeal” has only one match: Oct (length 1).

After everything is parsed into a number, we decide which of the three formats we are looking at.

If the second word was parsed from digits, then we are in case 1. If the seventh word was parsed from letters, then we are in case 2. Otherwise, we are in case 3.

Once we’ve decided what case we’re in, we know where the year is. If the caller provided a two-digit year, upgrade it to a four-digit year.

Finally, we copy the fields into the output structure. If a field is missing, it is taken from the current date and time.

That’s it. Nothing fancy. The algorithm is optimized for the case where the string follows the correct format. If you pass something that’s not in the correct format, it does what it can. Sometimes it even comes up with something vaguely sensible!

Usually not.²

¹ If you think about it, this can be done very quickly by a simple decision tree:

Character Result
1 2 3
A P   Apr
U   Aug
D     Dec
F E   Feb
R   Fri
G     GMT
J A   Jan
U L Jul
N Jun
M A R Mar
Y May
O   Mon
N     Nov
O     Oct
S A   Sat
E   Sep
U   Sun
T H   Thu
U   Tue
U     UTC
W     Wed

² For example, Friday Friday Friday Friday Friday Friday Friday Friday parses to “day 5 of month 5 year 5, hour 5 minute 5, and 5 seconds” or “May 5, 2005 at 05:05:05 GMT”.

Topics
History

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

  • Clockwork-Muse

    This same sort of logic undergirds part of the .NET HTTP date parsing too. That particular implementation was also returning a `DateTime` with `DateTimeKind` set to… `Undefined`. Which means that if your server wasn’t running in UTC, date/time operations could return wrong results (not that C#’s `DateTime` is particularly friendly for general date/time operations anyways).

    The real travesty, however, is that HTTP dates aren’t in the ISO standard format.

  • Thiago Macieira

    > For example, Friday Friday Friday Friday Friday Friday Friday Friday parses to “day 5 of month 5 year 5, hour 5 minute 5, and 5 seconds” or “May 5, 2005 at 05:05:05 GMT”.

    Which is not a Friday.

    • Raymond ChenMicrosoft employee Author

      Man, I wish I had thought of that joke.

      • Rich G

        Why does output vary at all according to the day of the week, when the day of the month along with the month and year ought to be both necessary and sufficient for determining the date from any meaningful input? (Unless one is either using the weekday to infer the century, which seems unlikely given the crude algorithm hinted at and the closing example; or else counting days in the first week of Creation, which...

        Read more
      • Thiago Macieira

        In this parser, you're probably right that parsing the day of the week was superfluous and unnecessary. At best you get information you already had; at worst you have conflicting or insufficient information. Suppose the input was "Sat, 30 Sep 2020 09:51:02": what do you do? 2020-09-30 is not a Saturday, it's a Wednesday. Similarly, if the input was "Sat, Sept 2020", you can't tell which Saturday in September was meant.

        So why should a parser...

        Read more
      • Rich G

        Thanks, a good point. Intrigued, I experimented with InternetTimeToSystemTime to see whether any combinations of input are parsed as multiples of weeks, but found only the expected proliferation of multiples of years, months, days, hours, minutes and seconds (or mixtures of some of these with the current date or time). On the other hand, "W" is the only character prefix to a single number which returns a non-error result; but this is as an offset...

        Read more
  • Nick

    It’s too bad we don’t have any days or months in English that begin with the letter C. I’d really like it if “Chicken chicken chicken chicken chicken chicken chicken chicken” parsed as a date.

    This does sort of raise the question of just _how_ liberal you should be when accepting input. When you get the wrong answer to the wrong question for the wrong reasons then something has gone, uh wrong, somewhere.

    • Brian Olsen

      If you prefer turkey, “Turkey turkey turkey turkey turkey turkey turkey turkey” should be valid.

  • 紅樓鍮

    This makes me appreciate compile-time regular expressions even more. Were there no regular expression-to-C compilers back in the 1990s though?