Diagnosing Hidden ODR Violations in Visual C++ (and fixing LNK2022)

Visual CPP Team

Hi, my name is Andy Rich, and I’m a tester on the front-end compiler.  Last time I posted here, I discussed several aspects of testing conformance in the C++ compiler, which is a large part of my job.  Another large part of what I do is diagnosing bugs in the compiler.  These come from a variety of sources, such as customer reports through Connect, regression testing, and new tests. But some of the most useful and critical bugs come from one of our largest customers: other developers in Microsoft.  These developers are the most likely to be using our latest product, and are often doing some very complicated things with it (such as building Windows). 

What I want to discuss today is a class of bug report that we get all the time, especially from internal customers.  We see these reports because the user is convinced they’ve caught a compiler bug, and have contacted us for a fix.  But I’m not going to discuss a compiler bug – because this class of bug report is always user error.

It is important that I warn you up front: some of the tricks that I am going to use – one compiler switch in particular – are UNDOCUMENTED and UNSUPPORTED.  If you choose to make use of this option in the compiler, you do so at your own risk.  We use this option all the time internally, but we have basically no testing for it, so there’s no telling what might happen if you use it.  Still here?  Great – let’s get on with it.

Spot the defect is a fun game we play in the compiler team.  Someone sends out a snippet of code, and says “Spot the defect!” and then we all take potshots at the code, trying to determine where the bug is.  (I used to play the same game with my DDJ subscription – the first thing I sought out every month was the PC-Lint Bug Of The Month advertisements.) Let’s play the same game – here’s your code, in three source files:

// a.h – share class definition (code has been corrected from original post)

#pragma once

 

class Test_A {

public:

       Test_A(){ c = ‘X’; data = 0; }

       ~Test_A(){ if(data) delete [] data; }

       char c;

       char* data;

};

 

void Test_A_Loader(Test_A&);

// loader.cpp – loader defn.

#include <cstring> //for strcpy

#include “a.h”

 

void Test_A_Loader(Test_A& a) {

       a.c = ‘p’;

       a.data = new char[10];

       strcpy(a.data, “3.14159”);

}

// main.cpp – main program

#include <stdio.h> // for printf

#include “a.h”

 

int main(){

       Test_A a;

       Test_A_Loader(a);

       printf(“%c : %c %c %c %cn”, a.c,

              a.data[0], a.data[1], a.data[2], a.data[3]);

}  

                                                                              

If I told you this code was crashing, would you believe me?  This is simplified, but accurately represents the same class of bugs I was whining about earlier.  The astute reader will look to the title of the post for assistance, which is helpful, though it won’t be a big enough hint: knowing the dangers of ODR violations are key to understanding this problem. (I did say they’d be hidden ODR violations, didn’t I?)

The One-Definition Rule is defined in section 3.2 of the C++ Standard (ISO/IEC 14882:2003).  Paraphrasing that rule: no single translation unit can have two definitions for the same variable, function, class type, etc.  However, between two translation units, some things can have the same definition (as in the code sample above), provided they are the same in both translation units.  Based on this reading, it doesn’t appear I’m breaking this rule in my code – in fact, based on what I’ve shown you, I haven’t.  Because the spot-the-defect was a trick question of sorts: there appears to be no crash in the code I showed you.

This lie-through-omission is the same one that internal customers tell all the time: the class appears to be the same in both translation units, and yet one translation unit is treating it differently than the other.  They’re left to assume that there’s a bug in the compiler that’s causing this.  After a while in the compiler group, you learn to ask the right questions; and the right question, in this case, is: how is this program being built?  Because in this case, that’s where the defect lies:

cl main.cpp /Zp2 /c
cl loader.cpp /Zp1 /c
link main.obj loader.obj

As soon as I see the /Zp switch (or I notice code that has a lot of #pragma pack’s), I start getting suspicious that I’m about to see an ODR violation.  The issue is actually fairly simple, once you understand it – the problem is that it is rarely as obvious as this case – classes have lots of base classes, are embedded in other classes, and often only show issues when you move to another architecture (especially when moving from 32 to 64-bit, because pointer sizes change).

The defect is that the definition of Test_A is different between main.obj and loader.obj.  So main creates the object with one view, and loader goes and stomps the data passed to it with its own understanding of Test_A, then that object is returned by-value to main.obj, which then tries to access an invalid pointer.

Main.obj’s understanding of Test_A:

class Test_A    size(6):
        +—
 0      | c
        | <alignment member> (size=1)
 2      | data
        +—

Loader.obj’s understanding of Test_A:

class Test_A    size(5):
        +—
 0      | c
 1      | data
        +—

You can clearly see that the data pointer is not in the same place in both of these objects.  This looks like a bug of the compiler, but the compiler is explicitly complying with the user’s request, and because the two compilands are compiled as separate entities, there’s no way the compiler could diagnose this.  The linker can diagnose this in limited scenarios, especially under IJW.  If you’ve gotten the boggling “inconsistent metadata” error (LNK2022), you’ve probably fallen victim to this issue.

Diagnosing and fixing these issues, once you’ve run into them, is the real challenge.  The LNK2022 help page is especially informative, stating: “In this case, make sure that the type has an identical definition in all compilands.”  (Which is harder to do – especially when packing throws a wrench in there.) The compiler has a switch that can help you do this, but (as I said before) that switch is undocumented and unsupported.  It’s also fantastically helpful.  That switch is /d1reportSingleClassLayoutXXX, where XXX performs substring matches against the class name.  In fact, those helpful class layout tables above were created with that switch:

cl main.cpp /Zp2 /c /d1reportSingleClassLayoutTest_A

Using this switch in multiple compilands and then comparing the output is an unbelievably useful way to get to the bottom of these issues.  It shows you exactly what the compiler believes to be the layout of that class – if these layouts differ at all, you’ve got potential bugs waiting to be found.  The switch is more than just useful – it can be fun to inspect and see how things are laid out by the compiler.

One other helpful tip for getting to the bottom of these problems is to make use of #pragma pack(show), which you can put right before your object definition to see what the packing level is when the compiler hits that line.

Thanks for reading,

Andy

Posted in C++

0 comments

Discussion is closed.

Feedback usabilla icon