June 11th, 2015

Keep your eye on the code page: Is this string CP_ACP or UTF-8?

A customer had a problem with strings and code pages.

The customer has a password like "Müllwagen" for a particular user. Note the umlaut over the u. That character is encoded as the two bytes C3 BC according to UTF-8. When the customer passes this password to the Logon­User function in order to authenticate the user, the call fails, claiming that the password is invalid.

If we encode the ü as the single byte FC, then the call to Logon­User succeeds.

Therefore, if the string is in UTF-8 form, it needs to be converted, and to do this we use the Multi­Byte­To­Wide­Char function. Once converted, the logon is successful.

The problem is that we are not sure if the password being given to the application will encode the ü as C3 BC or as FC. If it arrives as FC, and we try to convert it with the Multi­Byte­To­Wide­Char function, the ü is converted to U+FFFD.

If I take the FC-encoded string and convert it with the Multi­Byte­To­Wide­Char function, passing CP_ACP as the first parameter, then it converts successfully (no U+FFFD), and the call to Logon­User is successful.

For the application, the customer does not want to distinguish the two cases or implement any retry logic or anything like that. Can you help us understand the issue, what we are doing wrong, and how we can fix it?

As the problem is stated, you are screwed.

You have a bunch of bytes, and you don’t know what encoding they are in. The byte sequence C3 BC might be a UTF-8 encoding of ü, or it could be a CP_ACP encoding of ý. You are stuck with guessing. But for something as important as passwords, you shouldn’t guess. You need to know for sure, because an incorrect guess will generate audit entries, and may cause the user to become locked out of the account due to too many incorrect passwords.

This means that you need to make sure that whoever is passing you the string also tells you what encoding it is using.

The customer liaison replied,

Thanks. I went back and talked to the customer, and it turns out that the password is always in UTF-8 form, so the problem is solved. We will always pass CP_UTF8 when converting the string.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.