You can’t copy code with memcpy; code is more complicated than that

Raymond

Back in the day, a customer reported that their program crashed on Itanium.

Wait, come back!

Itanium is where the customer recognized the problem, but it applies to all other architectures, so stick with me.

Their code went roughly like this:

struct REMOTE_THREAD_INFO
{
    int data1;
    int data2;
    int data3;
};

static DWORD CALLBACK RemoteThreadProc(REMOTE_THREAD_INFO* info)
{
    try {
        ... use the info to do something ...
    } catch (...) {
        ... ignore all exceptions ...
    }
    return 0;
}
static void EndOfRemoteThreadProc()
{
}

// Error checking elided for expository purposes
void DoSomethingCrazy()
{
    // Calculate the number of code bytes.
    SIZE_T functionSize = (BYTE*)EndOfRemoteThreadProc - (BYTE*)RemoteThreadProc;

    // Allocate memory in the remote process
    SIZE_T allocSize = sizeof(REMOTE_THREAD_INFO) + functionSize;
    REMOTE_THREAD_INFO* buffer = (REMOTE_THREAD_INFO*)
      VirtualAllocEx(targetProcess, NULL, allocSize, MEM_COMMIT,
        PAGE_EXECUTE_READWRITE);

    // Write data to the remote process
    REMOTE_THREAD_INFO localInfo = { ... };
    WriteProcessMemory(targetProcess, buffer,
                       &localInfo, sizeof(localInfo));

    // Write code to the remote process
    WriteProcessMemory(targetProcess, buffer + 1,
                       (void*)RemoteThreadProc, functionSize);

    // Execute it!
    CreateRemoteThread(targetProcess, NULL, 0,
                       (LPTHREAD_START_ROUTINE)(buffer + 1),
                       buffer);
}

This code is such a bad idea, I’ve intentionally introduced errors so it won’t even compile.

The idea is that they want to inject some code into a target process, so they use Virtual­Alloc to allocate some memory in that process. The first part of the memory block contains some data that they want to pass. The second part of the memory block contains the code bytes that they want to execute, and they tell Create­Remote­Thread execution at those code bytes.

I’m just going to say it right now: The entire idea that went into this code is fundamentally flawed.

The customer reported that this code “worked just fine on 32-bit x86 and 64-bit x86”, but it doesn’t work on Itanium.

Actually, I’m surprised that it worked even on x86!

The design assumes that all of the code in RemoteThreadProc is position-independent. There is no requirement that generated code be position-independent. For example, one code generation option for switch statements is to use a jump table, and that jump table consists of absolute addresses on x86.

In fact, it’s clear that the code isn’t position-independent, because it’s using C++ exception handling, and the Microsoft compiler’s implementation of exception handling involves a table that maps points of execution to catch statements, so that it knows which catch statement to use. And if they had used a filtered catch, then there would be additional tables for deciding whether the catch filter applies to the exception that was thrown.

The design also assumes that the code contains no references to anything outside the function body itself. All of the jump tables and lookup tables used by the function need to be copied to the target process, and the code assumes that those tables are also between the labels EndOfRemoteThreadProc RemoteThreadProc.

Indeed, we know that there will be references to content outside the function body itself, because the C++ try/catch block will call into functions in the C runtime support library.

Both x86-64 and Itanium use unwind codes for exception handling, and there was no attempt to register those unwind codes in the target process.

My guess is that they were lucky and no exceptions were thrown, or at least they were thrown infrequently enough that it eluded their testing.

There is also no guarantee that EndOfRemoteThreadProc will be placed directly after RemoteThreadProc in memory. Indeed, there’s not even a guarantee that EndOfRemoteThreadProc will have an independent existent. The linker may perform COMDAT folding, which causes multiple identical functions to be combined into one. Even if you disable COMDAT folding, Profile-Guided Optimization will move the functions independently, and they are unlikely to end up in the same place.

Indeed, there’s no requirement that the code bytes for the RemoteThreadProc function be contiguous at all! Profile-Guided Optimization will rearrange basic blocks, and the code for a single function may end up scattered across different parts of the program, depending on their usage patterns.

Even without Profile-Guided Optimization, compile-time optimization may inline some or all of a function, so a single function might have multiple copies in memory, each of which has been optimized for its specific call site.

There are also some Itanium-specific rules that ensure abject failure on Itanium.

On Itanium, all instructions must be aligned on 16-byte boundaries, but the above code does not respect that. Also, on Itanium, function pointers point not to the first code byte, but to a descriptor structure that contains a pair of pointers, one to the functions gp, and the other to the first byte of code. (This is the same pattern used by PowerPC.)

I pointed out to the customer liaison that what the customer is trying to do is very suspicious and looks like a virus. The customer liaison explained that it’s quite the opposite: The customer is a major anti-virus software vendor! The customer has important functionality in their product that that they have built based on this technique of remote code injection, and they cannot afford to give it up at this point.

Okay, now I’m scared.

A safer¹ way to inject code into a process is to load the code as a library, via Load­Library. This invokes the loader, which will do the work of applying fixups as necessary, allocating all the memory in the appropriate way, with the correct alignment, registering control flow guard and exception unwind tables, loading dependent libraries, and generally getting the execution environment set up properly to run the desired code.

We never heard back from the customer.

¹ I didn’t say it was a safe way to inject code. Just that it was safer.

12 comments

Comments are closed. Login to edit/delete your existing comments

  • Sunil Joshi

    And stories like this are why I stopped using 3rd party anti-virus programmes. I spotted only half the problems Raymond did when I first saw the code snippet but I could have told you this idea was a non-starter and I only spent about 20 seconds glancing at the code. The idea that a major anti-virus vendor had this concept out in production is terrifying.

  • Joshua Hudson

    1) Every time I see somebody doing this in a language other than assembly I get the heebie-jeebies. Raymond is absolutely right just how sensitive this is to code generation.

    2) If you’re not doing this in assembly _please_ take Raymond’s advise and inject a thread with its start address as LoadLibrary. This works. On a side note, you should pass a full path as the argument (via VirtualAllocEx and WriteProcessMemory; note that you can free this memory after the remote thread exits); which means you should be calling LoadLibraryW. But that’s a minor point.

    3) While the result of doing LoadLibraryW into a target process is reasonably safe provided you don’t violate the target process’s memory model*, most likely the first thing you will be doing in the target process is not safe at all.

    4) The documentation for RUNTIME_FUNCTION needs to be a lot better. As it stands I can’t register my functions because I can’t fill the structure because the documentation is too bad. https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-runtime_function

    5) We are in the state that everybody knows you don’t need to call RtlAddFunctionTable unless the function or any function it can call can throw structured exceptions. Changing that rule now is a breaking change all over the place.

    * Known memory model constraints:
    1) If the process doesn’t start COM, don’t start COM or even load the COM dlls. It may be incompatible.
    2) If the process doesn’t link against user32 or gdi32, there’s probably a perfectly good reason for that, so don’t do that either.
    3) If you inject a Cygwin process, be prepared for your dll to be loaded into other processes. Also, your dll must have a unique base address.
    4) Check target subsystem. It might not be WIn32 at all.

    The general result of these basically means that injecting all processes is a bad idea. Only inject processes you know.

  • Kasper Brandt

    The ASLR implementation on Windows tries to use the same base address for a given library in all processes (per boot) to maximize page sharing (also making ASLR less effective…), so that may also be why they didn’t run into issues on x86 and x64. That is of course also somewhat luck based because the base address for the module may already be in use in the target process, forcing the loader to use another base address.

  • 紅樓鍮

    So I searched the Internet for “DLL injection” and learned today that you can pass LoadLibrary as the start address to CreateRemoteThread, and I’m now wondering whether the signature of LoadLibrary has been deliberately given this way…

    On a closer look, however, it turns out that the signature of LoadLibrary isn’t an exact match for LPTHREAD_START_ROUTINE: LoadLibrary returns a pointer, which is different from LPTHREAD_START_ROUTINE‘s DWORD. I wonder whether that prevents passing it to CreateRemoteThread on any architecture Windows supports or used to support; I think Raymond talked some time ago about how posting a void-returning function to the thread pool can result in crashing on Itanium.

    • Me Gusta

      There is a difference between returning something and returning nothing in this case though.
      Anyway, first, threading was first added to Windows as part of Windows NT.
      For anything 32 bit, the DWORD would match the HMODULE return. Sure, HMODULE, or any handle type, is treated as a pointer in languages like C, it is just a scalar value so it would be treated as an integral type on the ABI. This means that the handling of the return value would be identical. So for 32 bit systems, there would never be any problems.
      64 bit systems do have a problems in that the HMODULE is 64 bit where the DWORD is 32 bit. But if the ABI is designed carefully, then this isn’t a problem. For one example, x64 will return scalars, pods 64 bits or less and _m64 in RAX. Since both pointers and integral types are scalars, then both LoadLibrary and CreateRemoteThread will expect the return in RAX. The only issue is that the upper 32 bits may or may not be valid and you would have to expect that it gets corrupted along the way.
      ARM64 uses the x0 register in a similar way as x64 uses RAX. So while the signatures no longer match, the ABI being designed in such a way that it will use a register that is the pointer size in length will mean that it will stay compatible. This careful design of the ABI should mean that all 64 bit platforms would have no issues.

      • Raymond ChenMicrosoft employee

        Almost. Alpha AXP requires that 32-bit return values by placed in v0 in canonical form. A non-canonical value (like a pointer) could cause problems but in practice is unlikely to since the thread exit code is a passthrough value.

  • Dave Gzorple

    I did this years ago under Windows NT 4 as a proof-of-concept of how dangerous the VirtualXXX functions were, and I remember the pain I had to go through just to pop up a simple dialog in the target process to prove I’d got there. I was hugely impressed when the l0pht guys later used the concept in a full-blown app, I think it was BO2K.

  • Gerrit Albrecht

    Thanks for the contribution. Indeed, code copying used to be a fine thing. When I was at school, we were supposed to write a curve sketching application. My solution was to create and compile a small helping program in the background and then copy the new function f(x) into the main application (Turbo Pascal using an Amiga-based DOS emulator):

      GotoXY( 8,10); Write('Funktion eingeben !   f(x) = ');
      ReadLn(term);
      Assign(datei,HilfP_verz + '\HILFPRG.PAS');
      ReWrite(datei);
      WriteLn(datei, 'Program Hilfsprogramm;         ');
      WriteLn(datei, 'Var anzahl : Word;             ');
      WriteLn(datei, 'Function f(x : Real) : Real;   ');
      WriteLn(datei, 'Begin                          ');
      WriteLn(datei, '  f := ',term,';               ');
      WriteLn(datei, 'End;                           ');
      WriteLn(datei, 'Function Dummy : Real;         ');
      WriteLn(datei, 'Begin End;                     ');
      WriteLn(datei, 'Begin                          ');
      WriteLn(datei, '  anzahl := Ofs(Dummy)-Ofs(f); ');
      WriteLn(datei, '  MemW[$0:$4fe] := CSeg;       ');
      WriteLn(datei, '  MemW[$0:$4fc] := Ofs(f);     ');
      WriteLn(datei, '  MemW[$0:$4fa] := anzahl;     ');
      WriteLn(datei, 'End.                           ');
      Close(datei);
    
      GotoXY( 8,WhereY+1); WriteLn('Compiling...'); WriteLn;
      SwapVectors;
      Exec(turbo_verz + '\TPC.EXE', HilfP_verz + '\HILFPRG.PAS');
      SwapVectors;
      GotoXY( 8,WhereY+1); Write('Running...');
      SwapVectors;
      Exec(HilfP_verz + '\HILFPRG.EXE','');
      SwapVectors;
      GotoXY( 8,WhereY+1); Write('Exchange function...');
      ziel   := Ptr(CSeg, Ofs(f));
      quell  := Ptr(MemW[$0:$4fe], MemW[$0:$4fc]);
      anzahl := MemW[$0:$4fa];
      Move(quell^, ziel^, anzahl);
      GotoXY( 8,WhereY+1); Write('Ready.');

    The good old days 😉

  • Shy Shalom

    Also, conveniently, the signature of LoadLibrary is the same as the signature required by CreateRemoteThread so all you need to copy to the remote process is just the path of the dll to load.

  • Alexis Ryan

    Injecting a dll with loadlibrary, fun when dealing with ASLR. Thankfully Windows lets you query the location a Dll was loaded at in another process so you can adjust your addresses as needed. I have tried directly copying instructions to another process but only inline assembly instructions and it worked fine because the asm was intentionally written to be position independent and I could use labels to get the exact size of the code