Interpreting bits in union fields as different datatypes in C/C++

Interpreting bits in union fields as different datatypes in C/C++

I am trying to access Union bits as different datatype for example,

typedef union { uint64_t x; uint32_t y[2]; }test; test testdata; testdata.x = 0xa; printf("uint64_t: %016lxnuint32_t: %08x %08xn",testdata.x,testdata.y[0],testdata.y[1]); printf("Addresses:nuint64_t: %016lxnuint32_t: %p %pn",&testdata.x,&testdata.y[0],&testdata.y[1]);

and the output is

uint64_t: 000000000000000a uint32_t: 0000000a 00000000 Addresses: uint64_t: 00007ffe09d594e0 uint32_t: 0x7ffe09d594e0 0x7ffe09d594e4

The starting address pointed to by 'y' is same as starting address of 'x'; Since, both fields uses the same location shouldn't values of 'x' be 00000000 0000000a ?

Why this is not happening? How the internal conversion happens in Union with different fields of different datatypes?

If we want to retrieve the exact raw bits as uint32_t in the same order as in uint64_t using union what needs to be done?

Thank you in advance.

Edit:
As mentioned in the comments, C++ gives undefined behaviour.
In C how it works? Can we actually do it?

Comments are not for extended discussion; this conversation has been moved to chat.
– Samuel Liew♦
Jul 3 at 1:45

1 Answer
1

I will first explain what happens in your implementation.

You are doing type punning between an uint64_t value and an array of 2 uint32_t values. According to the result, your system is little endian and gladly accepts that type punning by simply re-interpreting the byte representations. And the byte representation of 0x0a as a little endian uint64_t is:

uint64_t

uint32_t

0x0a

uint64_t

Byte number 0 1 2 3 4 5 6 7 Value 0x0a 0x00 0x00 0x00 0x00 0x00 0x00 0x00

The least significant byte in little endian has the lowest address. It is now evident why the uint32_t[2] representation is { 0x0a, 0x00 }.

uint32_t[2]

{ 0x0a, 0x00 }

But what you are doing is only legal in C language.

C11 says as 6.5.2.3 Structure and union members:

3 A postfix expression followed by the . operator and an identifier designates a member of
a structure or union object. The value is that of the named member,⁹⁵⁾ and is an lvalue if
the first expression is an lvalue.

The ⁹⁵⁾ note says explicitly:

If the member used to read the contents of a union object is not the same as the member last used to
store a value in the object, the appropriate part of the object representation of the value is reinterpreted
as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type
punning’’). This might be a trap representation.

So even if notes are not normative, their intent is to make clear the way the standard should be interpreted => you code is valid and has defined behaviour on a little endian system defining uint64_t and uint32_t types.

uint64_t

uint32_t

C++ is more strict in that part. Draft n4659 for C++17 says in [basic.lval]:

8 If a program attempts to access the stored value of an object through a glvalue of other than one of the
following types the behavior is undefined:⁵⁶
(8.1) — the dynamic type of the object,
(8.2) — a cv-qualified version of the dynamic type of the object,
(8.3) — a type similar (as defined in 7.5) to the dynamic type of the object,
(8.4) — a type that is the signed or unsigned type corresponding to the dynamic type of the object,
(8.5) — a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type
of the object,
(8.6) — an aggregate or union type that includes one of the aforementioned types among its elements or nonstatic
data members (including, recursively, an element or non-static data member of a subaggregate or
contained union),
(8.7) — a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
(8.8) — a char, unsigned char, or std::byte type.

And the note ⁵⁶ says explictely:

The intent of this list is to specify those circumstances in which an object may or may not be aliased.

As punning is never referenced in C++ standard and as the struct/union part does not contain the equivalent of the re-interpretation of C, that means that reading in C++ the value of a member that is not the one that was last written invokes undefined behaviour.

Of course common compiler implementation compile both C and C++, and most of them accept the C idiom even in C++ source, for the very same reason that gcc C++ compiler gladly accepts VLA in C++ source files. After all, undefined behaviour includes expected results... But you should not rely on that for portable code.

Notably, this is one reason why C++ is unsuitable for hardware-related programming, where you often have to type pun through union either between different integer types (example: 32 bit register but CPU is 16 bit) or between some type and the byte type uint8_t (when doing any form of serialization/de-serialization). Not only does this make C++ incredibly cumbersome for hardware-related programming; this also causes C++ to invoke nasty UB bugs when you grab a hardware register map written from C.
– Lundin
Jul 2 at 8:13

uint8_t

I strongly disagree. In practice, this works on all sane compilers. C++ is the only language except C that is suitable for kernel work and hard real-time systems. And you get better, 0-cost abstractions and RAII built into the language.
– Erik Alapää
Jul 2 at 8:25

Since we are reading it as uint32_t, according to the endianness it will read 4 bytes not 2 bytes per read from LSB to MSB from 'uint64_t' right?
– Rakesh
Jul 2 at 13:36

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

StzhtY2vy wG,vkbXuuAsRIYIL5yJKszIg5 14F,Lpig1B5osPZAufuRgAc xDkoaycSyqzAybfWvV2 P3g m39CXUn0WYdinuag,7X

搜尋此網誌

Gtjkyu