June 19th, 2024

On the sadness of treating counted strings as null-terminated strings

There are a number of data types which represent a counted string. Some of them are in the C++ standard library, like std::string and std::wstring. Some of them are Windows-specific like BSTR or HSTRING. Be careful when treating these counted strings as null-terminated strings.

Treating a counted string as a null-terminated string is a lossy operation, because any embedded nulls in the counted string are mistakenly interpreted as the end of the string.

std::string s = "hello\0world"s;

// This prints "hello<nul>world"
std::cout << s << std::endl;

// Copy it through c_str
std::string t = s.c_str();

// This prints "hello"
std::cout << t << std::endl;

The embedded null in the string s is treated as the string terminator when we interpret the c_str() as a null-terminated string, and the last part of the string is lost.

Now, you wouldn’t be so silly as to copy a std::string that way, seeing as there is a copy constructor right there.

std::string t = s; // use the copy constructor

But when you’re converting between different counted string types, you may be tempted to use the null-terminated string as the intermediary.

// widget.GetName() returns a winrt::hstring,
// but we want to manipulate it as a std::wstring
std::wstring name(widget.GetName().c_str());

Not only is there a performance penalty here, because the std::wstring constructor has to go look for the terminating null character, but there is also a security vulnerability: If an attacker puts an embedded null in a string, they might be able to sneak past a security check or validation.

bool IsAllowedName(std::wstring const& name)
{
    return name == L"alice" || name == L"bob";
}

void ProcessWidget(Widget const& widget)
{
    if (!IsAllowedName(widget.Name().c_str())) {
        throw winrt::hresult_access_denied();
    }

    ⟦ continue processing ⟧
}

An attacker could bypass the access check by using a widget whose name is "alice\0haha", and it will be considered to have an allowed name, since the embedded null causes the std::wstring passed to Is­Allowed­Name() to consist only of the characters leading up to the null terminator.

As another example, you might want to print a BSTR, which is also a counted string type, although the representation is that of a pointer to the first wchar_t. This means that you can often pretend that a BSTR is a null-terminated string, but the danger is that any embedded null will cause you to stop processing the string before you get to the end.

void PrintBstr(BSTR bstr)
{
    std::cout << bstr;
}

Sidebar: There’s another danger, namely that the BSTR might be nullptr, which represents a zero-length string. However, trying to << a (wchar_t*)nullptr will crash because the << operator will dereference the null pointer while searching for the null terminator.

Okay, now that we’ve laid out the problem, we’ll look at solutions next time.

Bonus chatter:

// This also prints "hello"
std::string u = "hello\0world";
std::cout << u << std::endl;

Bonus bonus chatter: For the specific case of converting a winrt::hstring to a std::wstring, you can just pass the winrt::hstring and use the std::wstring_view conversion constructor!

winrt::hstring h;
std::wstring w(h); // just construct it from the hstring

In our example, we would just pass the winrt::hstring to Is­Allowed­Name and let the compiler do the conversion.

void ProcessWidget(Widget const& widget)
{
    if (!IsAllowedName(widget.Name())) {
        throw winrt::hresult_access_denied();
    }

    ⟦ continue processing ⟧
}

Bonus bonus bonus chatter: What if you are forced to produce a pointer to a null-terminated string due to some interop requirement?

In that case, you should fail the operation if the counted string contains an embedded null. For HSTRING, you can use the Windows­String­Has­Embedded­Null to check for an embedded null. The Windows­String­Has­Embedded­Null function caches the result, so asking a second time uses the result calculated from the first time. Mind you, scanning a string for an embedded null is probably not that expensive, so the cache doesn’t buy you much, especially since you’re probably about to pass the string to another function that will consume it, so the contents of the string are going to be scanned by the consumer anyway.

Arguably, the c_str() function should throw an exception if the counted string is not representable as a C-style null-terminated string. But what’s done is done. At best, we can make up a new method name, like safe_c_str()?

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

12 comments

Discussion is closed. Login to edit/delete existing comments.

  • alan robinson

    All potentially ugly, but somewhat more tolerable if the compiler raises a warning. Do they? I’ve never dealt with counted strings. Of course warnings can go wrong; I am Soooooo tired of being warned about float to int conversions, for instance.

  • Jacob Manaker · Edited

    I think it's irresponsible to present such trancation as a fixable problem, rather than a sign that something much deeper has gone wrong. Two of the data types you mention (std::string, std::wstring) and most custom string classes are specifically designed to hold strings of textual data that do not contain embedded nulls, precisely because they use the null character as an invalid sentinel value. Storing in them a sequence of chars that may...

    Read more
    • Danielix Klimax

      Wrong. From Notes on c_str at cppreference:
      “The pointer obtained from c_str() may only be treated as a pointer to a null-terminated character string if the string object does not contain other null characters. ”

      Therefore we can assert that basic_string CAN contain NUL character, but you may be limited in interfacing with C-style APIs. There is no limitation on content of basic_string. (And it would be dumb to have it)

    • Simon Farnsworth

      I've just looked at my copy of ISO/IEC 14882:2003(E), an older version of the C++ standard, and it describes both std::string and std::wstring as containers for character type sequences, rather than the previously defined Null Terminated Character Type Sequence. It's also explicit about which constructors take a length, and which ones take an NTCTS and determine the length from the position of the null character type.

      I cannot find any language in section 21 (Strings Library)...

      Read more
      • Raymond ChenMicrosoft employee Author

        I think it’s clear that std::string and std::wstring are intended to support embedded null characters. If they were intended to be null-terminated strings, then resize() shouldn’t change the length(), since it just adds nulls to the end.

    • Dmitry · Edited

      You seem to treat the problem with C/C++-coloured glasses. Those languages are actually not the whole world, and in the rest of the world binary safety of strings and performance are and have been for ages by now something to take care of.

      That some terribly designed language in the past (let’s not use the names, but it was C) decided to use slow and inefficient representation of strings should not be a big deal by...

      Read more
      • 紅樓鍮 · Edited

        Bonus chatter:
        I doubt it’s really worth a separate data type in any language other than a few with C background in their gynecological tree.
        Modern languages (including those that don't descend from C) actually do have separate string and bytes types, but for a different reason: the string type is almost always restricted to containing only valid Unicode scalar values (and they're usually implemented as UTF-8/16/32 encoded strings that are known to be valid...

        Read more
  • Valts Sondors

    I have only a surface level familiarity with C++, so I have a potentially silly question:

    std::string u = "hello\0world";

    Doesn’t the double-quoted string syntax also produce a null-terminated string which is then passed as a char* to some implicit conversion function on the std::string class? That would mean that the “world” doesn’t even make it into “u”, much less “cout”.

    • deadcream

      This is exactly what Raymond says in the comment onle line higher: <code>

      If you meant the original <code> in the beginning of the post, it's different because here std::string is created using "s" literal (at the end of the line). User-defined (in this case standard library defined) string literals know the length of the whole string.

      Read more
    • Simon Farnsworth

      You misquoted the critical bit from the article:
      <code>

      Turning this into a full program has it look like:
      <code>

      That "s" on the end invokes operator ""s, which is passed both a pointer to the string literal, and its length. This feature was introduced in C++11, under the name "User-defined literals" - your favourite C++ reference (I use cppreference.com) should explain how it works, and have documentation for std::literals::string_literals::operator""s

      Read more
  • Antonio Rodríguez

    The proliferation of string types in C++ reminds me of this XKCD comic: https://xkcd.com/927/