Interpreting bits in union fields as different datatypes in C/C++

Multi tool use
Interpreting bits in union fields as different datatypes in C/C++
I am trying to access Union bits as different datatype for example,
typedef union {
uint64_t x;
uint32_t y[2];
}test;
test testdata;
testdata.x = 0xa;
printf("uint64_t: %016lxnuint32_t: %08x %08xn",testdata.x,testdata.y[0],testdata.y[1]);
printf("Addresses:nuint64_t: %016lxnuint32_t: %p %pn",&testdata.x,&testdata.y[0],&testdata.y[1]);
and the output is
uint64_t: 000000000000000a
uint32_t: 0000000a 00000000
Addresses:
uint64_t: 00007ffe09d594e0
uint32_t: 0x7ffe09d594e0 0x7ffe09d594e4
The starting address pointed to by 'y' is same as starting address of 'x'; Since, both fields uses the same location shouldn't values of 'x' be 00000000 0000000a ?
Why this is not happening? How the internal conversion happens in Union with different fields of different datatypes?
If we want to retrieve the exact raw bits as uint32_t in the same order as in uint64_t using union what needs to be done?
Thank you in advance.
Edit:
As mentioned in the comments, C++ gives undefined behaviour.
In C how it works? Can we actually do it?
1 Answer
1
I will first explain what happens in your implementation.
You are doing type punning between an uint64_t
value and an array of 2 uint32_t
values. According to the result, your system is little endian and gladly accepts that type punning by simply re-interpreting the byte representations. And the byte representation of 0x0a
as a little endian uint64_t
is:
uint64_t
uint32_t
0x0a
uint64_t
Byte number 0 1 2 3 4 5 6 7
Value 0x0a 0x00 0x00 0x00 0x00 0x00 0x00 0x00
The least significant byte in little endian has the lowest address. It is now evident why the uint32_t[2]
representation is { 0x0a, 0x00 }
.
uint32_t[2]
{ 0x0a, 0x00 }
But what you are doing is only legal in C language.
C11 says as 6.5.2.3 Structure and union members:
3 A postfix expression followed by the . operator and an identifier designates a member of
a structure or union object. The value is that of the named member,95) and is an lvalue if
the first expression is an lvalue.
The 95) note says explicitly:
If the member used to read the contents of a union object is not the same as the member last used to
store a value in the object, the appropriate part of the object representation of the value is reinterpreted
as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type
punning’’). This might be a trap representation.
So even if notes are not normative, their intent is to make clear the way the standard should be interpreted => you code is valid and has defined behaviour on a little endian system defining uint64_t
and uint32_t
types.
uint64_t
uint32_t
C++ is more strict in that part. Draft n4659 for C++17 says in [basic.lval]:
8 If a program attempts to access the stored value of an object through a glvalue of other than one of the
following types the behavior is undefined:56
(8.1) — the dynamic type of the object,
(8.2) — a cv-qualified version of the dynamic type of the object,
(8.3) — a type similar (as defined in 7.5) to the dynamic type of the object,
(8.4) — a type that is the signed or unsigned type corresponding to the dynamic type of the object,
(8.5) — a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type
of the object,
(8.6) — an aggregate or union type that includes one of the aforementioned types among its elements or nonstatic
data members (including, recursively, an element or non-static data member of a subaggregate or
contained union),
(8.7) — a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
(8.8) — a char, unsigned char, or std::byte type.
And the note 56 says explictely:
The intent of this list is to specify those circumstances in which an object may or may not be aliased.
As punning is never referenced in C++ standard and as the struct/union part does not contain the equivalent of the re-interpretation of C, that means that reading in C++ the value of a member that is not the one that was last written invokes undefined behaviour.
Of course common compiler implementation compile both C and C++, and most of them accept the C idiom even in C++ source, for the very same reason that gcc C++ compiler gladly accepts VLA in C++ source files. After all, undefined behaviour includes expected results... But you should not rely on that for portable code.
Notably, this is one reason why C++ is unsuitable for hardware-related programming, where you often have to type pun through union either between different integer types (example: 32 bit register but CPU is 16 bit) or between some type and the byte type
uint8_t
(when doing any form of serialization/de-serialization). Not only does this make C++ incredibly cumbersome for hardware-related programming; this also causes C++ to invoke nasty UB bugs when you grab a hardware register map written from C.– Lundin
Jul 2 at 8:13
uint8_t
I strongly disagree. In practice, this works on all sane compilers. C++ is the only language except C that is suitable for kernel work and hard real-time systems. And you get better, 0-cost abstractions and RAII built into the language.
– Erik Alapää
Jul 2 at 8:25
Since we are reading it as uint32_t, according to the endianness it will read 4 bytes not 2 bytes per read from LSB to MSB from 'uint64_t' right?
– Rakesh
Jul 2 at 13:36
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Comments are not for extended discussion; this conversation has been moved to chat.
– Samuel Liew♦
Jul 3 at 1:45