A Race Condition in .NET Finalization and its Mitigation for C++/CLI

Tanveer

Abstract

There is a dormant race condition in .NET which affects even single threaded code when finalizers are executed. The cause is primarily the fact that finalizers are called on a separate thread by .NET and may access objects which have already been garbage collected due to aggressive lifetime determination by the .NET JIT compiler in newer versions of the .NET runtime. A solution for this problem, in the form of automatic generation of calls to System::GC::KeepAlive, has been implemented in the Microsoft C++ compiler and is available in version 16.10 and later.

Introduction

C++/CLI is primarily meant to be an interop language bridging the native and .NET worlds efficiently. Consequently, a frequently occuring code pattern is wrapping of native pointers in managed classes. E.g.

class NativeClass { ... };
ref class ManagedClass {
    ...
private:
    NativeClass* ptr;
};

Often, the managed wrapper class will new an instance of NativeClass, which controls and accesses a system resource (e.g. a file), uses the resources and to make sure that the resource is properly released back, delegates this task to the finalizer. Elaborating the above example, we could have code like:

 1  using Byte = System::Byte;
 2  using String = System::String^;
 3  using Char = System::Char;
 4
 5  class File {
 6      FILE*   fp;
 7  public:
 8      explicit File(const Char* path, const Char* mode)
 9      {
10          fp = _wfopen(path, mode);
11      }
12      void Read() { ... }
13      void Write(const void*, size_t) { ... }
14      void Seek() { ... }
15      void Close()
16      {
17          if (fp) {
18              fclose(fp); fp = nullptr;
19          }
20      }
21      ~File() { Close(); }
22  };

26   ref class DataOnDisk
27   {
28   public:
29       DataOnDisk(String path, String mode)
30       {
31           cli::pin_ptr<const Char> path_ptr = PtrToStringChars(path);
32           cli::pin_ptr<const Char> mode_ptr = PtrToStringChars(mode);
33           ptr = new File(path_ptr, mode_ptr);
34       }
35       ~DataOnDisk() { this->!DataOnDisk(); }
36       !DataOnDisk()
37       {
38           if (ptr) {
39               delete ptr; ptr = nullptr;
40           }
41       }
42       void Close() { this->!DataOnDisk(); }
43       void WriteData(array<Byte>^ data) { ... }
44   private:
45       File*           ptr;  // Pointer to native implementation class.
46   };

In the above code, class File controls the actual file via the native C++ interface, while DataOnDisk uses the native class to read/write structured data to file (details have been omitted for clarity). While Close can be called explicitly when there is no more use for the file, the finalizer is meant to do this when the DataOnDisk object is collected.

As we shall see in the following section, while the above code appears correct, there is a hidden race condition that can cause program errors.

Race Condition

Let us define the member WriteData from the above code

49  void DataOnDisk::WriteData(array<Byte>^ buffer)
50  {
51      pin_ptr<Byte> buffer_ptr = &buffer[0];
52      this->ptr->Write(buffer_ptr, buffer->Length);
53  } 

This function itself might be called in this context:

55  void test_write()
56  {
57      DataOnDisk^ dd = gcnew DataOnDisk(...);
58      array<Byte>^ buf = make_test_data();
59      dd->WriteData(buf);
60  } 

So far, nothing catches the eye or looks remotely dangerous. Starting from test_write, let us examine what happens in detail.

  1. A DataOnDisk object is created (line 57), some test data is created and WriteData is called to write this data to file (line 59).
  2. The WriteData carefully pins the buffer array object (line 51) before taking the address of an element and calling the Write member function of the underlying native File object. The pinning is important because we don’t want .NET to move the buffer bytes while the write is happening.
  3. However, since the .NET garbage collector knows nothing about native types, the ptr field of DataOnDisk is just a bit pattern with no other meaning attached. The .NET JIT compiler has analyzed the code and determined that the last use of the dd object is to access ptr (line 52), before its value is passed as the implicit object parameter of File::Write. Following this reasoning by the JIT compiler, once the value of ptr is fetched from the object, the object dd is no longer needed and becomes eligible for garbage collection.The fact that ptr points to a live native object is opaque to .NET because it does not track native pointers.
  4. From here onward, things can go wrong. The object dd is scheduled for collection and as part of the process, the finalizer is run, typically on a second thread. Now, we have potentially two things happening at the same time without any ordering between them, a classic race condition: the Write member function is executing and the finalizer !DataOnDisk is executing as well, the latter will delete the file object referenced by ptr while File::Write is possibly still running, which can then result in a crash or other incorrect behavior.

Wait — Wha…?

Several questions immediately come to mind:

  • Is this a new bug? Yes — and no. The issue has potentially been around since .NET 2.0.
  • What changed? The .NET JIT compiler started being aggressive with lifetime determination in .NET 4.8. From the perspective of managed code, it is doing the right thing.
  • But, this affects a core C++/CLI native interop scenario. What can be done? Read on.

Solutions

It is easy to see that when the call to Write happens (line 52), if this is kept alive, the race condition disappears since dd will no longer be collected before the call to Write returns. This could be done in several different ways:

  • Treat the change in the behavior of the JIT compiler as a bug and revert back to old behavior. Doing this requires a system update for .NET and potentially disables optimizations. Freezing the .NET framework at version 4.7 is also an option but not one that will work in the longer term, especially since the same JIT behavior can happen in .NET Core as well.
  • Manually insert System::GC::KeepAlive(this) calls where needed. This works but is error prone and requires examining the user source and changing it, so this is not a viable solution for large source bases.
  • Have the compiler inject System::GC::KeepAlive(this) calls, when needed. This is the solution we have implemented in the Microsoft C++ compiler.

Details

We could brute-force a solution by issuing a call to KeepAlive every time we see a call to native function, but for performance reasons we want to be more clever. We want to issue such calls where there is a possibility of a race condition but nowhere else. The following is the algorithm that the Microsoft C++ compiler follows to determine if an implicit KeepAlive call is to be issued at a point in the code where:

  • We are at a return statement or implicit return from a member function of a managed class;
  • The managed class has a member of type ‘reference or pointer to unmanaged type’, including members in its direct or indirect base classes, or embedded in members of class-types occuring anywhere in the class hierarchy;
  • A call to a function FUNC is found in the current (managed member) function, which satisfies one or more of these conditions:

    1. FUNC doesn’t have a __clrcall calling convention, or
    2. FUNC doesn’t take this either as an implicit or explicit argument, or
    3. A reference to this doesn’t follow the call to FUNC

In essence, we are looking for indicators that show this is in no danger of getting garbage collected during the call to FUNC. Hence, if the above conditions are satisfied, we insert a System::GC::KeepAlive(this) call immediately following the call to FUNC. Even though a call to KeepAlive looks very much like a function call in the generated MSIL, the JIT compiler treats it as a directive to consider the current object alive at that point.

How to get the fix

The above Microsoft C++ compiler behavior is on by default in Visual Studio version 16.10 and up but in in cases where unforeseen problems occur due to the new implicit emission of KeepAlive calls, the Microsoft C++ compiler provides two escape hatches:

  • the driver switch /clr:implicitKeepAlive-, which turns off all such calls in the translation unit. This switch is not available in project system settings but must be added explicitly to the command-line option list (Property Pages > Command Line > Additional Options).
  • #pragma implicit_keepalive, which provides fine-grained control over the emission of such calls at the function level.

A Final Nit

The astute reader will have noted that there is still a possible race condition at line 39. To see why, imagine that both the finalizer thread and user code call the finalizer at the same time. The possibility of a double-delete in this case is obvious. Fixing this requires a critical section but is beyond the scope of this article and left to the reader as an exercise.

Posted in C++

2 comments

Leave a comment

  • Kalle Niemitalo

    If you inserted System::GC::KeepAlive(this) at the end of the finalizer, would that fix the possible race condition? I think it would partially serialize the finalization:

    – If user code calls the finalizer first, then the GC would not be able to call the finalizer before the call from user code reaches System::GC::KeepAlive(this). At that point, ptr would already be null, so the finalizer called by GC would do nothing.

    – If the GC calls the finalizer first, then user code cannot have a reference to the object and cannot call the finalizer.

    It would still not be safe against parallel Finalize or Dispose calls from two threads in user code, but I don’t think that kind of safety is generally expected, except in partial-trust code which is deprecated anyway.

  • Vincent Romano

    DataOnDisk::Close() should not call the finalizer directly, but rather ‘delete this’ to result in a call to IDisposable::Dispose(true) and GC::SuppressFinalize(this). This ensures deterministic cleanup of base classes and members of ref class type (as in R, not R^).

    With regard to your final nit, the only direct call to the finalizer should be in the destructor. The destructor is called in the context of IDisposable::Dispose, which holds a reference to ‘this’ and suppresses finalization. Therefore, I don’t see how the finalizer thread could ever call the finalizer concurrently.