Anybody who writes #pragma pack(1) may as well just wear a sign on their forehead that says “I hate RISC”

Raymond Chen

When you use #pragma pack(1), this changes the default structure packing to byte packing, removing all padding bytes normally inserted to preserve alignment.

Consider these two structures:

// no #pragma pack in effect.
struct S
{
    int32_t total;
    int32_t a, b;
};

#pragma pack(1)

struct P
{
    int32_t total;
    int32_t a, b;
};

Both structures have identical layouts because the members are already at their natural alignment, Therefore, you would expect these two structures to be equivalent.

But they’re not.

Changing the default structure packing has another consequence: It changes the alignment of the structure itself. In this case, the #pragma pack(1) declares that the structure P can itself be placed at any byte boundary, instead of requiring it to be placed on a 4-byte boundary.

struct ExtraS
{
    char c;
    S s;
    char d;
};

struct ExtraP
{
    char c;
    P p;
    char d;
};

Even though the structures S and P have the same layout, the difference in alignment means that the structures ExtraS and ExtraP end up quite different.

The ExtraS structure starts with a char, then adds three bytes of padding, followed by the S structure, then another char, and three more bytes of padding to bring the entire structure back up to 4-byte alignment. This ensures that an array of ExtraS structures will properly align all of the embedded S objects.

By comparison, the ExtraP structure starts out the same way, with a single char, but this time, there is no padding before the P because the P is byte-aligned. Similarly, there is no trail padding at the end of the structure because the P is byte-aligned and therefore does not need to be kept at a particular alignment in the case of an array of ExtraP objects.

  00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
ExtraS c padding s.total s.a s.b d padding
ExtraP c p.total p.a p.b d

The effect is more noticeable if you have an array of these objects.

  00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27
ExtraS c padding s.total s.a s.b d padding c padding s.total s.a s.b d padding
ExtraP c p.total p.a p.b d c p.total p.a p.b d

Observe that in the array of ExtraS objects, the s.total, s.a, and s.b are always four-byte aligned. But in the array of ExtraP objects, there is no consistent alignment for the members of p.

The possibility that any P structure could be misaligned has significant consequences for code generation, because all accesses to members must handle the case that the address is not properly aligned.

void UpdateS(S* s)
{
 s->total = s->a + s->b;
}

void UpdateP(P* p)
{
 p->total = p->a + p->b;
}

Despite the structures S and P having exactly the same layout, the code generation is different because of the alignment.

UpdateS UpdateP
Intel Itanium
adds  r31 = r32, 4
adds  r30 = r32  8 ;;
ld4   r31 = [r31]
ld4   r30 = [r30] ;;












add   r31 = r30, r31 ;;
st4   [r32] = r31








br.ret.sptk.many rp
adds  r31 = r32, 4
adds  r30 = r32  8 ;;
ld1   r29 = [r31], 1
ld1   r28 = [r30], 1 ;;
ld1   r27 = [r31], 1
ld1   r26 = [r30], 1 ;;
dep   r29 = r27, r29, 8, 8
dep   r28 = r26, r28, 8, 8
ld1   r25 = [r31], 1
ld1   r24 = [r30], 1 ;;
dep   r29 = r25, r29, 16, 8
dep   r28 = r24, r28, 16, 8
ld1   r27 = [r31]
ld1   r26 = [r30] ;;
dep   r29 = r27, r29, 24, 8
dep   r28 = r26, r28, 24, 8 ;;
add   r31 = r28, r29 ;;
st1   [r32] = r31
adds  r30 = r32, 1
adds  r29 = r32, 2 
extr  r28 = r31, 8, 8
extr  r27 = r31, 16, 8 ;;
st1   [r30] = r28
st1   [r29] = r27, 1
extr  r26 = r31, 24, 8 ;;
st1   [r29] = r26
br.ret.sptk.many.rp
Alpha AXP
ldl   t1, 4(a0)




ldl   t2, 8(a0)




addl  t1, t1, t2
stl   t1, (a0)









ret   zero, (ra), 1
ldq_u t1, 4(a0)
ldq_u t3, 7(a0)
extll t1, a0, t1
extlh t3, a0, t3
bis   t1, t3, t1
ldq_u t2, 8(a0)
ldq_u t3, 11(a0)
extll t2, a0, t2
extlh t3, a0, t3
bis   t2, t3, t2
addl  t1, t1, t2
ldq_u t2, 3(a0)
ldq_u t5, (a0)
inslh t1, a0, t4
insll t1, a0, t3
msklh t2, a0, t2
mskll t5, a0, t5
bis   t2, t4, t2
bis   t5, t3, t5
stq_u t2, 3(a0)
stq_u t5, (a0)
ret   zero, (ra), 1
MIPS R4000
lw    t0, 4(a0)

lw    t1, 8(a0)

addu  t0, t0, t1

jr    ra
sw    t0, (a0)
lwl   t0, 7(a0)
lwr   t0, 4(a0)
lwl   t1, 11(a0)
lwr   t1, 8(a0)
addu  t0, t0, t1
swl   t0, 3(a0)
jr    ra
swr   t0, (a0)
PowerPC 600
lwz    r4, 4(r3)






lwz    r5, 8(r3)






addu   r4, r4, r5
stw    r4, (r3)






blr
lbz    r4, 4(r3)
lbz    r9, 5(r3)
rlwimi r4, r9, 8, 16, 23
lbz    r9, 6(r3)
rlwimi r4, r9, 16, 8, 15
lbz    r9, 7(r3)
rlwimi r4, r9, 24, 0, 7
lbz    r5, 8(r3)
lbz    r9, 9(r3)
rlwimi r5, r9, 8, 16, 23
lbz    r9, 10(r3)
rlwimi r5, r9, 16, 8, 15
lbz    r9, 11(r3)
rlwimi r5, r9, 24, 0, 7
addu   r4, r4, r5
stb    r4, (r3)
rlwimi r9, r4, 24, 0, 31
stb    r9, 1(r3)
rlwimi r9, r4, 16, 0, 31
stb    r9, 2(r3)
rlwimi r9, r4, 8, 0, 31
stb    r9, 3(r3)
blr
SuperH-3
mov.l @(4, r4), r2












mov.l @(8, r4), r3












add   r3, r2






rts
mov.l r2, @r4
mov.b  @(7, r4), r1
shll8  r1
mov.b  @(6, r4), r2
extu.b r2, r2
or     r2, r1
shll8  r1
mov.b  @(5, r4), r2
extu.b r2, r2
or     r2, r1
shll8  r1
mov.b  @r4, r2
extu.b r2, r2
or     r1, r2
mov.b  @(7, r4), r1
shll8  r1
mov.b  @(6, r4), r3
extu.b r3, r3
or     r3, r1
shll8  r1
mov.b  @(5, r4), r3
extu.b r3, r3
or     r3, r1
shll8  r1
mov.b  @r4, r3
extu.b r3, r3
or     r1, r3
add    r3, r2
mov.b  r2, @r4
shlr8  r2
mov.b  r2, @(1, r4)
shlr8  r2
mov.b  r2, @(2, r4)
shlr8  r2
rts
mov.b  r2, @(3, r4)

Observe that for some RISC processors, the code size explosion is quite significant. This may in turn affect inlining decisions.

Moral of the story: Don’t apply #pragma pack(1) to structures unless absolutely necessary. It bloats your code and inhibits optimizations.

Bonus chatter: Once you make this mistake, you can’t go back. You allowed the structure to be byte-aligned, and if you remove the spurious #pragma pack(1), you are making the structure more strictly aligned, which will be a breaking change for any clients which used the byte-packed version.

19 comments

Discussion is closed. Login to edit/delete existing comments.

  • Jeremy Richards 0

    Out of curiosity, could you combine #pragma pack(1) with __declspec(align(4)) to get the structure packing effect, without impacting the alignment of the structure when embedded in other structures?

    • Raymond ChenMicrosoft employee 0

      My quick experiments suggest that yes you can add __declspec(align(4)) or alignas(4) to restore 4-byte alignment.

      • Alex Martin 0

        But note that this is not guaranteed to be sufficient; I believe some 64-bit architectures require 8-byte alignment?

        • Joshua Hudson 0

          __declspec(align(sizeof(void*))) should work but I have not tried it.

          • Charles Milette 0

            Prefer using “struct alignas(void*) Foo”

  • Joshua Hudson 0

    #pragma pack(1) is for representing on-disk or on-wire structures correctly. Why ever use it elsewhere?

    • Raymond ChenMicrosoft employee 0

      You’d be surprised how often it’s applied to things that are never used for serialization.

    • cricwood@nerdshack.com 0

      *cough* games *cough* performance *cough*

      When bandwidth is your bottleneck, trading ALU for more payload per cache line can be a good deal. This is especially true on x86, where pack(1) gives you more payload per cache line for … pretty much no additional ALU load.

      • Dave Bacher 0

        #pragma pack(1) is going to stall on x86 and x64. A lot.

        An unaligned read is five microcode operations, versus one for an aligned read. An unaligned write is twelve microcode operations, versus one for an aligned write. On older AMD hardware, you’re going to stall three of the four execution units a good percentage of the time.

        A better strategy – if you have the ability – is array per field. That will get you the same bandwidth benefits, but the compiler will optimize to SSE or MMX more frequently, will unroll more loops. You’ll generally get the same benefits as #pragma pack 1 to bandwidth, since arrays of primitive types pack.

        • cricwood@nerdshack.com 0

          This is in direct contradiction to any Intel guideline I’ve ever read. Since Sandy Bridge (2011), misaligned on-SSE data is handled exactly like aligned data, apart from a possible penalty crossing cache lines (https://www.agner.org/optimize/blog/read.php?i=142&v=t). I can’t imagine AMD being a lot different.

          I fully agree with your SOA comment, though.

    • Dave Gzorple 0

      It’s also used for APIs where you need compatibility across different compilers. For example PKCS #11 uses #pragma pack(1) to get around problems with different compiler structure padding.

  • Ajay SaxenaMicrosoft employee 0

    Asking for a friend, why is this approach better than not using padding and letting more of the objects be in cache, leading to better cache utilization and faster processing.

      • smf 0

        I’m sure you have seen code that does that, but that doesn’t make “Anybody who writes #pragma pack(1)” true.

        I used it only for stored structures that were shared across Z80 CP/M, 80286 (DOS), 80386 (DOS/Unix/Windows), ARM (Windows CE), SH4 (Windows CE).

        Admittedly when doing the ARM port it came back to bite me as there was one part of the code that took a pointer to one of the members and because it was a long that had ended up on an odd boundary the cpu threw an exception. However it was still easier to stop taking a pointer than it was to change from doing a #pragma pack(1) or building a time machine.

        • Raymond ChenMicrosoft employee 0

          “Anybody who writes #pragma pack(1) for reasons other than data serialization…”

    • Erik Fjeldstrom 0

      If your data structures don’t cross any alignment boundaries you (probably/hopefully) won’t notice, but the downside to not having to worry about alignment overly much is that the processor has to take care of misaligned data, which is not free. As well, most SSE and certain other vector instructions *do* care about alignment, so a compiler may be forced to reduce its optimizations if it encounters misaligned data.

      • Alex Martin 0

        > the processor has to take care of misaligned data

        As covered in the article, on most RISC architectures the processor does not do this for you. The compiler has to explicitly generate different code, which can be expensive.

  • Ivan Kljajic 0

    I may be mis-remembering, but didn’t VS emit warnings about structs being sensitive to alignment if padding bytes were inserted “in the middle”? I think the goal of the warning was to help efficiently rearrange newly created struct’s members and reduce memcpy surprises.
    And did #pragma pack(1) cause that warning to go away? So maybe it was inadvertly used as a magic bullet to get the compiler to shut up by people who never heard of “Bus Error (core dumped)”, and if asked would probably say “yes, I try to avoid and reduce risk whenever possible”.

  • Nikolai Vorontsov 0

    I’m sorry guys, but who cares these days about Intel Itanium, Alpha AXP, MIPS R4000, PowerPC 600 and SuperH-3 architectures?
    Let them rest in peace and move on.

    AMD64/ARM handles efficiently unaligned reads? OK, 99.9% of time forgot about alignment (and pragma pack as well :-).

    Only in rare cases when you really need a specific alignment (memory pressure, cache utilization, serialization, cache congestion, etc) you need to care. And most of the issues there are not related to what Raymond Chen presented us.

    Nikolai

Feedback usabilla icon