How did we end up parsing Savvyday 29 Oatmeal 94 as Saturday 29 October 1994?

Raymond Chen

Raymond

Some time ago, we learned that the Internet­Time­To­System­Time function manages to parse “Savvyday 29 Oatmeal 94” as “Saturday 29 October 1994”. How did that happen? Is it finding the date with the shortest English Levenshtein distance?

Nothing that fancy.

Warning: This article discusses implementation details, which are not contractual. The algorithm is subject to change in the future. The only thing that Internet­Time­To­System­Time formally guarantees is that it can parse properly-formatted HTTP timestamps.

The parsing is very simple. The official format for HTTP date strings is

  1. DayOfWeek, Day Month Year Hour Minute Second GMT

In practice, not everybody follows the rules, so the parser accepts these three formats:

  1. DayOfWeek Day Month Year Hour Minute Second TZ
  2. DayOfWeek Month Day Hour Minute Second TZ Year
  3. DayOfWeek Month Day Hour Minute Second Year TZ

After discarding non-alphanumerics, the parser takes each word in the input string and converts it to a number somehow. If it consists of digits, then it’s parsed to a number in the usual way. If it consists of alphabetics, then it’s parsed to a number by trying to match it against the list of valid tokens:

DayOfWeekMonthTZ
Sun = 0Jan = 1Jul = 7GMT
Mon = 1Feb = 2Aug = 8UTC
Tue = 2Mar = 3Sep = 9 
Wed = 3Apr = 4Oct = 10 
Thu = 4May = 5Nov = 11 
Fri = 5Jun = 6Dec = 12 
Sat = 7   

If no match is found, then we look for an entry which shares the most initial characters with the word being parsed. If there is a unique such entry, then the parsed value is as given in the table. If there is no such entry, or the longest match is not unique, then parsing fails.¹

Since there only one time zone permitted in HTTP time/date strings, all we have to remember is “Yup, it’s a time zone. There’s a time zone marker here.”

For example, the string “Savvyday” is not in the above table, but it does share the following prefixes:

Length 1Length 2
S(un)Sa(t)
S(ep) 

The longest match is length 2, and there’s only one such match, so the word “Savvyday” is parsed as if it were “Sat”.

Similarly, “Oatmeal” has only one match: Oct (length 1).

After everything is parsed into a number, we decide which of the three formats we are looking at.

If the second word was parsed from digits, then we are in case 1. If the seventh word was parsed from letters, then we are in case 2. Otherwise, we are in case 3.

Once we’ve decided what case we’re in, we know where the year is. If the caller provided a two-digit year, upgrade it to a four-digit year.

Finally, we copy the fields into the output structure. If a field is missing, it is taken from the current date and time.

That’s it. Nothing fancy. The algorithm is optimized for the case where the string follows the correct format. If you pass something that’s not in the correct format, it does what it can. Sometimes it even comes up with something vaguely sensible!

Usually not.²

¹ If you think about it, this can be done very quickly by a simple decision tree:

CharacterResult
123
AP Apr
U Aug
D  Dec
FE Feb
R Fri
G  GMT
JA Jan
ULJul
NJun
MARMar
YMay
O Mon
N  Nov
O  Oct
SA Sat
E Sep
U Sun
TH Thu
U Tue
U  UTC
W  Wed

² For example, Friday Friday Friday Friday Friday Friday Friday Friday parses to “day 5 of month 5 year 5, hour 5 minute 5, and 5 seconds” or “May 5, 2005 at 05:05:05 GMT”.

9 comments

Comments are closed. Login to edit/delete your existing comments

  • Avatar
    Nick .

    It’s too bad we don’t have any days or months in English that begin with the letter C. I’d really like it if “Chicken chicken chicken chicken chicken chicken chicken chicken” parsed as a date.

    This does sort of raise the question of just _how_ liberal you should be when accepting input. When you get the wrong answer to the wrong question for the wrong reasons then something has gone, uh wrong, somewhere.

  • Avatar
    Thiago Macieira

    > For example, Friday Friday Friday Friday Friday Friday Friday Friday parses to “day 5 of month 5 year 5, hour 5 minute 5, and 5 seconds” or “May 5, 2005 at 05:05:05 GMT”.

    Which is not a Friday.

      • Avatar
        Rich G

        Why does output vary at all according to the day of the week, when the day of the month along with the month and year ought to be both necessary and sufficient for determining the date from any meaningful input? (Unless one is either using the weekday to infer the century, which seems unlikely given the crude algorithm hinted at and the closing example; or else counting days in the first week of Creation, which on any reckoning predates RFC 822.)

        • Avatar
          Thiago Macieira

          In this parser, you’re probably right that parsing the day of the week was superfluous and unnecessary. At best you get information you already had; at worst you have conflicting or insufficient information. Suppose the input was “Sat, 30 Sep 2020 09:51:02”: what do you do? 2020-09-30 is not a Saturday, it’s a Wednesday. Similarly, if the input was “Sat, Sept 2020”, you can’t tell which Saturday in September was meant.

          So why should a parser even learn to parse day-of-weeks? Because some date formats could use the week number, like “Wed, week 40, 2020” that’s unambiguous (assuming sender and receiver agree on how to number weeks, like by adopting the ISO standard). There’s only one (ISO) week 40 in the year 2020 and there’s only one Wednesday in each week. This particular parser might have shared code with a different parser that accepted weekday-weeknumber as input, such as strptime with %U, %V or %W placeholders.

          At work, many people even refer to dates as “WW40.3”, where WW stands for “Work Week”, but the definition of the week numbers and which day in which is underspecified and does not match either %U, %V or %W, causing a shorthand that was meant to convey information to cause misunderstandings and back-and-forth emails asking for clarification. Just write “2020-09-30”.

          • Avatar
            Rich G

            Thanks, a good point. Intrigued, I experimented with InternetTimeToSystemTime to see whether any combinations of input are parsed as multiples of weeks, but found only the expected proliferation of multiples of years, months, days, hours, minutes and seconds (or mixtures of some of these with the current date or time). On the other hand, “W” is the only character prefix to a single number which returns a non-error result; but this is as an offset of days rather than weeks from 2000-00-00, and is presumably simply because “W” coincidentally is the only single letter that is unambiguously the start of only a particular day of the week. Of course weekday lookup numbers are needed for the reverse function InternetTimeFromSystemTime but it remains puzzling why they were factored into InternetTimeToSystemTime when nothing similar to the ISO formats is accepted.

            Belated correction: Ah, InternetTimeToSystemTime does faithfully convert the lookup value for the substring in the weekday position to the corresponding SYSTEMTIME.wDayOfWeek integer, indeed it does so regardless of whether it is the correct weekday for the specified date, and without causing the function to return an error if the combination gives a calendar discrepancy. So that explains why the substring is converted to a lookup value. Conversely, InternetTimeFromSystemTime entirely ignores the SYSTEMTIME.wDayOfWeek value in the input, calculating the weekday substring from the year/month/day and not returning an error even if the input is beyond 0 to 6. (SYSTEMTIME.wMilliseconds also has no effect on InternetTimeFromSystemTime except that it does cause the function to return an error if the input is beyond 0 to 999.)

  • Clockwork-Muse
    Clockwork-Muse

    This same sort of logic undergirds part of the .NET HTTP date parsing too. That particular implementation was also returning a `DateTime` with `DateTimeKind` set to… `Undefined`. Which means that if your server wasn’t running in UTC, date/time operations could return wrong results (not that C#’s `DateTime` is particularly friendly for general date/time operations anyways).

    The real travesty, however, is that HTTP dates aren’t in the ISO standard format.