There are a number of data types which represent a counted string. Some of them are in the C++ standard library, like std::string
and std::wstring
. Some of them are Windows-specific like BSTR
or HSTRING
. Be careful when treating these counted strings as null-terminated strings.
Treating a counted string as a null-terminated string is a lossy operation, because any embedded nulls in the counted string are mistakenly interpreted as the end of the string.
std::string s = "hello\0world"s; // This prints "hello<nul>world" std::cout << s << std::endl; // Copy it through c_str std::string t = s.c_str(); // This prints "hello" std::cout << t << std::endl;
The embedded null in the string s
is treated as the string terminator when we interpret the c_str()
as a null-terminated string, and the last part of the string is lost.
Now, you wouldn’t be so silly as to copy a std::string
that way, seeing as there is a copy constructor right there.
std::string t = s; // use the copy constructor
But when you’re converting between different counted string types, you may be tempted to use the null-terminated string as the intermediary.
// widget.GetName() returns a winrt::hstring, // but we want to manipulate it as a std::wstring std::wstring name(widget.GetName().c_str());
Not only is there a performance penalty here, because the std::
constructor has to go look for the terminating null character, but there is also a security vulnerability: If an attacker puts an embedded null in a string, they might be able to sneak past a security check or validation.
bool IsAllowedName(std::wstring const& name) { return name == L"alice" || name == L"bob"; } void ProcessWidget(Widget const& widget) { if (!IsAllowedName(widget.Name().c_str())) { throw winrt::hresult_access_denied(); } ⟦ continue processing ⟧ }
An attacker could bypass the access check by using a widget whose name is "alice\0haha"
, and it will be considered to have an allowed name, since the embedded null causes the std::
passed to IsAllowedName()
to consist only of the characters leading up to the null terminator.
As another example, you might want to print a BSTR
, which is also a counted string type, although the representation is that of a pointer to the first wchar_t
. This means that you can often pretend that a BSTR
is a null-terminated string, but the danger is that any embedded null will cause you to stop processing the string before you get to the end.
void PrintBstr(BSTR bstr) { std::cout << bstr; }
Sidebar: There’s another danger, namely that the BSTR
might be nullptr
, which represents a zero-length string. However, trying to <<
a (wchar_t*)nullptr
will crash because the <<
operator will dereference the null pointer while searching for the null terminator.
Okay, now that we’ve laid out the problem, we’ll look at solutions next time.
Bonus chatter:
// This also prints "hello" std::string u = "hello\0world"; std::cout << u << std::endl;
Bonus bonus chatter: For the specific case of converting a winrt::hstring
to a std::wstring
, you can just pass the winrt::hstring
and use the std::
conversion constructor!
winrt::hstring h; std::wstring w(h); // just construct it from the hstring
In our example, we would just pass the winrt::hstring
to IsAllowedName
and let the compiler do the conversion.
void ProcessWidget(Widget const& widget) { if (!IsAllowedName(widget.Name())) { throw winrt::hresult_access_denied(); } ⟦ continue processing ⟧ }
Bonus bonus bonus chatter: What if you are forced to produce a pointer to a null-terminated string due to some interop requirement?
In that case, you should fail the operation if the counted string contains an embedded null. For HSTRING
, you can use the WindowsStringHasEmbeddedNull
to check for an embedded null. The WindowsStringHasEmbeddedNull
function caches the result, so asking a second time uses the result calculated from the first time. Mind you, scanning a string for an embedded null is probably not that expensive, so the cache doesn’t buy you much, especially since you’re probably about to pass the string to another function that will consume it, so the contents of the string are going to be scanned by the consumer anyway.
Arguably, the c_str()
function should throw an exception if the counted string is not representable as a C-style null-terminated string. But what’s done is done. At best, we can make up a new method name, like safe_c_str()
?
All potentially ugly, but somewhat more tolerable if the compiler raises a warning. Do they? I’ve never dealt with counted strings. Of course warnings can go wrong; I am Soooooo tired of being warned about float to int conversions, for instance.
I think it's irresponsible to present such trancation as a fixable problem, rather than a sign that something much deeper has gone wrong. Two of the data types you mention (std::string, std::wstring) and most custom string classes are specifically designed to hold strings of textual data that do not contain embedded nulls, precisely because they use the null character as an invalid sentinel value. Storing in them a sequence of chars that may...
Wrong. From Notes on c_str at cppreference:
“The pointer obtained from c_str() may only be treated as a pointer to a null-terminated character string if the string object does not contain other null characters. ”
Therefore we can assert that basic_string CAN contain NUL character, but you may be limited in interfacing with C-style APIs. There is no limitation on content of basic_string. (And it would be dumb to have it)
I've just looked at my copy of ISO/IEC 14882:2003(E), an older version of the C++ standard, and it describes both std::string and std::wstring as containers for character type sequences, rather than the previously defined Null Terminated Character Type Sequence. It's also explicit about which constructors take a length, and which ones take an NTCTS and determine the length from the position of the null character type.
I cannot find any language in section 21 (Strings Library)...
I think it’s clear that std::string and std::wstring are intended to support embedded null characters. If they were intended to be null-terminated strings, then resize() shouldn’t change the length(), since it just adds nulls to the end.
You seem to treat the problem with C/C++-coloured glasses. Those languages are actually not the whole world, and in the rest of the world binary safety of strings and performance are and have been for ages by now something to take care of.
That some terribly designed language in the past (let’s not use the names, but it was C) decided to use slow and inefficient representation of strings should not be a big deal by...
Bonus chatter:
I doubt it’s really worth a separate data type in any language other than a few with C background in their gynecological tree.
Modern languages (including those that don't descend from C) actually do have separate string and bytes types, but for a different reason: the string type is almost always restricted to containing only valid Unicode scalar values (and they're usually implemented as UTF-8/16/32 encoded strings that are known to be valid...
I have only a surface level familiarity with C++, so I have a potentially silly question:
Doesn’t the double-quoted string syntax also produce a null-terminated string which is then passed as a char* to some implicit conversion function on the std::string class? That would mean that the “world” doesn’t even make it into “u”, much less “cout”.
This is exactly what Raymond says in the comment onle line higher: <code>
If you meant the original <code> in the beginning of the post, it's different because here std::string is created using "s" literal (at the end of the line). User-defined (in this case standard library defined) string literals know the length of the whole string.
You misquoted the critical bit from the article:
<code>
Turning this into a full program has it look like:
<code>
That "s" on the end invokes operator ""s, which is passed both a pointer to the string literal, and its length. This feature was introduced in C++11, under the name "User-defined literals" - your favourite C++ reference (I use cppreference.com) should explain how it works, and have documentation for std::literals::string_literals::operator""s
Raymond uses a subtle “”s (see https://en.cppreference.com/w/cpp/string/basic_string/operator%22%22s)
The proliferation of string types in C++ reminds me of this XKCD comic: https://xkcd.com/927/