TR1 Fixes In VC9 SP1
STL enjoys speaking in the third person and also enjoys bringing you this exclusive news:
Visual Studio 2008 Service Pack 1 (VC9 SP1) contains the TR1 and MFC library extensions that were originally released in the Visual C++ 2008 Feature Pack Refresh. But wait! VC9 SP1 also contains many delicious fixes for TR1 and MFC. (For TR1, “many” is 16; for MFC, “many” is 60 or more!)
As the current maintainer of the C++ Standard Library and TR1 implementations within Microsoft, STL has compiled an exhaustive list of the TR1 fixes in VC9 SP1. But first! Shout-outs must be sent to P.J. Plauger and Christopher J. Walker of Dinkumware, who tirelessly worked to implement most of these fixes, and Microsoft’s compiler front-end ninja Jonathan Caves, who fixed nasty compiler bugs that were exposed by TR1.
The three most significant fixes are:
* A massive performance improvement in regex matching.
* An across-the-board performance improvement in STL containers of TR1 objects (e.g. vector<shared_ptr<T> >).
* A fix allowing tr1::function to store function objects with non-const function call operators.
This is the exhaustive list:
1. The STL algorithm search_n() no longer spuriously triggers _HAS_ITERATOR_DEBUGGING assertions. (search_n() isn’t part of TR1, but this is why it’s being mentioned here: A severe search_n() bug, a regression from VC8 SP1 to VC9 RTM where the predicate version attempted to use operator==(), was fixed in the VC9 Feature Pack Refresh. However, the fix contained this less severe flaw, which was noticed in time to be truly fixed in VC9 SP1.)
2. The random distribution uniform_int<unsigned long long> no longer infinitely loops.
3. The <random> header has been overhauled, incorporating many C++0x improvements.
4. The copy constructors of the pseudorandom number generators now behave correctly. (This was a subtle mistake where a template constructor provided a better match than a copy constructor during overload resolution.)
5. enable_shared_from_this’s copy constructor and copy assignment operator have been corrected. (This affected classes deriving from enable_shared_from_this, and did not affect other uses of shared_ptr.)
6. shared_ptr<const T> can now be constructed from const T *. (This is relatively unusual.)
7. tr1::function can now store function objects with non-const function call operators. (This was a severe problem.)
8. The performance of regex matching has been massively improved. (In general, TR1 Regex is as fast or faster than Boost.Regex 1.35.0. TR1 Regex is still slower than Boost for some regexes, such as those dominated by alternations like “cute|fluffy|kittens”, but their performance has also improved significantly compared to the Feature Pack Refresh. Further performance improvements are being investigated for VC10.)
9. The performance of unordered_set (etc.) and hash_set (etc.) has been significantly improved. (tr1::unordered_set and stdext::hash_set share the same implementation. erase() still presents performance problems, which are being investigated for VC10.)
10. result_of now accepts references to functions.
11. is_function and is_member_function_pointer now work correctly with variadic arguments.
12. is_polymorphic now works correctly. (It previously gave bogus answers for classes like std::iostream. This was a compiler bug fixed by Jonathan Caves.)
13. The <memory> header’s declarations of the _InterlockedIncrement (etc.) intrinsics can now be suppressed by defining _DO_NOT_DECLARE_INTERLOCKED_INTRINSICS_IN_MEMORY . This is for compatibility with <intrin.h> and <winbase.h>, which contain conflicting declarations for certain configurations of certain platforms. (This workaround has been eliminated in VC10, which contains a comprehensive fix.)
14. mem_fn() now works with abstract base classes. (This surprising bug was caused by library and compiler bugs, both fixed. Other code may benefit from this compiler fix.)
15. is_pod and has_trivial_constructor now work correctly. (They previously gave bogus answers for certain uncommon classes.)
16. The Swaptimization (previously mentioned in this VCBlog post) is now actually used for TR1 objects.
#16 deserves some explanation.
STL containers (vector, deque, list, set, multiset, map, multimap) are “fast-swappable”; x.swap(y) and swap(x, y) are constant-time, nofail, non-iterator-invalidating. (You might remember that VC8’s Swap Bug broke that last, very important part.) std::string swap() is also constant-time and nofail, although it can invalidate iterators due to the small string optimization.
Constant-time means that swapping STL containers is implemented by swapping their guts (their pointers, whether to a vector’s block of memory, a list’s doubly-linked nodes, or a map’s binary tree nodes). That’s far superior to all of the element-copying that “Container temp = a; a = b; b = temp;” would involve.
Now, when a vector undergoes reallocation, it has to copy its current elements from the old memory block to the new memory block. When you have a vector of STL containers (vector<vector<int> >, vector<set<int> >, etc.), that would be slow, copying the sub-containers and doing lots of dynamic memory allocation and deallocation.
The Swaptimization, which was present in VC8 (possibly earlier, but STL hasn’t checked), is a bit of template magic whereby STL containers are marked as having fast swaps. When a vector<T> undergoes reallocation, it detects whether the T has a fast swap, and if so, will swap elements from the old memory block to the new memory block. This is super fast! (Why doesn’t it always swap? For vector<int>, simply copying ints from the old memory block to the new memory block is faster – nothing needs to be swapped back into the old memory block.)
Many TR1 objects, such as unordered_set and match_results and shared_ptr, also have fast swaps. This is because they’re either containers (like unordered_set) or they wrap containers (like match_results), or they’re not containers but they have the same pointers-to-stuff structure (like shared_ptr). So, Dinkumware and STL worked really hard to give all TR1 objects fast swaps, and mark them appropriately to be picked up by the Swaptimization.
Unfortunately, this was simply broken in the Feature Pack Refresh! The problem was subtle. The STL provides the swap() free function in namespace std, which performs the three-step dance that works for anything copyable and assignable (but might be slow). The idea is that users with fast-swappable classes, in addition to providing member swap(), can fully specialize swap() in namespace std for their classes. (Generally, users *aren’t* supposed to add anything to namespace std. However, Standard templates may be fully or partially specialized for user classes; this is paragraph 22.214.171.124/1 of the Standard.) This works well en ough.
However, users with fast-swappable class *templates* have to do something different. A not-commonly-understood fact about C++ is that there are no such things as partial specializations of function templates. Anything that looks like a partial specialization is actually an overload. (Only classes can be partially specialized.) In practice, overloads behave similarly enough to how users think partial specializations of function templates would work, and everyone gets along happily.
But this means that you can’t add overloads of swap() to namespace std, even if you think they look like those mythical partial specializations. So, if you have a fast-swappable class template, you need to provide a swap() free function in the same namespace as your class template, to be found through Argument-Dependent Lookup (ADL, formerly and less descriptively called Koenig Lookup).
(No, this isn’t really ideal. Swapping is such a fundamental operation that it really ought to be recognized in the Core Language like copying – but it’s far too late to change that.)
So, TR1, which is almost but not quite part of the Standard, came along and provided a bunch of fast-swappable stuff in namespace std::tr1, and defined overloads of swap(), again within std::tr1. The TR1 classes were fast-swappable, annotated as such, and detected as such by the existing Swaptimization machinery within vector and a few other places. So why didn’t it work?
It turned out that the Swaptimization was being performed with a call to qualified std::swap(). Most people don’t do C++ name lookup in their heads for fun, but the important thing to know is this: Qualified Name Lookup disables ADL. Uh oh! So, the general implementation of std::swap() (doing the slow three-step dance of copying) was chosen, instead of the specific implementations of std::tr1::swap() for shared_ptr<T>, unordered_set<T>, and so forth. Disaster. STL was deeply mortified.
(Why was a qualified call being used? Another name lookup subtlety – unqualified calls like swap() activate both Unqualified Name Lookup (the usual, which everyone is familiar with) and ADL (which everyone really should be familiar with). Ordinarily, the union of the sets of functions they find is then used for overload resolution (picking what actually gets called). But if Unqualified Name Lookup finds a member function, ADL is bypassed; this is paragraph 3.4.2/2a of the Standard. So, within a class that defines a member swap(), calling unqualified swap() doesn’t activate ADL, nor does it even find std::swap() – it finds the member swap(). Thus, qualified calls became conventional within VC’s Standard Library implementation.)
The fix was to define a wrapper, std::_Swap_adl(), that calls unqualified swap(), activating ADL properly. The STL now calls std::_Swap_adl() whenever ADL is desired. (It continues to call std::swap() whenever ADL is unnecessary, such as when swapping builtins). As with all _Leading_underscore_capital names, users shouldn’t call std::_Swap_adl() themselves (it may change or disappear in future versions). You can perform the exact same trick by defining your own wrapper in your own namespace of choice, with the wrapper containing a “using namespace std;” and an unqualified call to swap(). (If you look at std::_Swap_adl()’s implementation, it lacks a using-directive – it already lives in namespace std, unlike anything you can define.)
The end result is that when a vector<shared_ptr<T> > undergoes reallocation, absolutely no reference counts are incremented or decremented – which is a significant performance win. Woot!
Note that none of these fixes were in the SP1 Beta, which branched for release as the Feature Pack was being finished.
Stephan T. Lavavej
Visual C++ Libraries Developer