C++ std vs Windows “MultiByte”

Multi tool use
C++ std vs Windows “MultiByte”
[update]
The key question here is one of the context. Not the "how" and not "you have this API".
Thus: Why would anybody use Windows Multi Byte strings in 2018, in Win10 RS4, namely in an UWP application? For example cppWINRT does it. What is the use-case where inside UWP app, one would transform from wide string to narrow multi byte string, using the WideCharToMultiByte
? Where is the result used and how?
WideCharToMultiByte
[original question]
Results of my research & development
"10 Years After" we have C++ 17, we have <codecvt>
deprecated and we also have std
way of converting between four (4) standard C++ string types:
<codecvt>
std
//
std::string std::basic_string<char>
std::wstring std::basic_string<wchar_t>
std::u16string (C++11) std::basic_string<char16_t>
std::u32string (C++11) std::basic_string<char32_t>
This is excluding additional four (4) 'namespace pmr` string types, which we can "abstract away" in this context.
Three point recap of the situation: C++ std vs Windows "MultiByte".
std::u16string
std::u32string
Considering above and knowing the standard C++ std:: one can code very functional and ridiculously simple but comprehensive conversions. Here is the one converting from wide to narrow string
// the "standard" version
// dbj.org created 2018-07-01
//
inline std::string to_string(std::wstring_view value)
{
if (value.empty()) return {};
return { value.begin(), value.end() };
}
I am sure the honorable audience is more than capable implementing the rest, necessary to convert all the standard types to std::string.
inline std::string to_string(std::u16string_view value) ;
inline std::string to_string(std::u32string_view value) ;
And also the rest necessary to cover all the other standard conversions.
I repeat: this is using MSVC STD namespace, shipped with CL version 19.14.26431.0 (as of today 2018-07-01).
The point is: I am failing to see why would I use "Multibyte" strings on Windows in my C++ code , provided the std:: is not using it.
Please help me understand where and when one might need WideCharToMultiByte(), and its twin counterpart, today?
For the sake of completeness here is one official (cppWINrt) version not relying on msvc std:: lib.
namespace winrt {
// the cppWINrt base.h
inline std::string to_string(std::wstring_view value)
{
int const size =
WideCharToMultiByte(CP_UTF8, 0, value.data(),
static_cast<int32_t>(value.size()), nullptr, 0,
nullptr, nullptr);
if (size == 0){ return{}; }
std::string result(size, '?');
WINRT_VERIFY_(size,
WideCharToMultiByte(CP_UTF8, 0, value.data(),
static_cast<int32_t>(value.size()), result.data(), size,
nullptr, nullptr)
);
return result;
}
}
The above (of course) does produce different std:string vs standard version, if unicode input contains chars form the extended char set.
// кошка 日本
constexpr wchar_t wide_specimen =
{ L"x043ax043ex0448x043ax0430 x65e5x672cx56fd" };
bool test =
winrt::to_string(wide_specimen)
==
to_string(wide_specimen)
;
// test is false
test =
winrt::to_string(L"Hello")
==
to_string(L"Hello)
;
// test is true
Which way one should take? The standard way or the Windows way ...
ps: This is actually one very good text on multi byte encoding. It was part of my research.
@Dusan - The Windows API goes back to the time before the first Unicode spec was published. Much of the odd parts come from a time when Windows 3 and Windows 95 used multibyte characters but Windows NT started to use Unicode. There was some utility in having a common code base and be able to convert strings at runtime. If you write new programs today, this is not something to be concerned about. Even if some APIs are still available.
– Bo Persson
Jul 1 at 21:50
Every C++ programmer eventually writes his own string class. The people that create operating systems and attend ISO meetings just did it earlier than SO users.
– Hans Passant
Jul 1 at 22:48
@BoPersson thanks for a reply. But, cppWINRT, the very latest, very modern C++ lib uses the
winrt::to_string
exactly as the one I copy pasted above. That is the core of the confusion. What is making them not to adopt the same philosophy as you or me in my to_string
?– Dusan Jovanovic
Jul 2 at 22:54
winrt::to_string
to_string
2 Answers
2
char
on Windows is not UTF-8, it is a (single or multi byte) codepage encoded string. These encodings come from DOS/16-bit Windows and was also the native encoding used on Windows 95/98/ME. Use WideCharToMultiByte(CP_ACP, ...)
to create CHAR
strings. wchar_t
is usually UTF-16 LE on Windows and often a 32-bit type on POSIX, possibly UCS-4.
char
WideCharToMultiByte(CP_ACP, ...)
CHAR
wchar_t
Technically, Windows uses the CHAR
and WCHAR
types but the standard library/compilers use the same meaning for its char
and wchar_t
types.
CHAR
WCHAR
char
wchar_t
I don't know if std::string
has changed in the newer versions but this is how it used to work.
std::string
Only the <char8/16/32_t>
types are required to use Unicode.
<char8/16/32_t>
Even if everything is Unicode encoded, a simple binary compare might not match because Unicode codepoints can be stored in different forms. You need to normalize to precomposed if you want to match with Windows native strings.
Windows has a multi-byte codepage for UTF-8 (
CP_UTF8
), and char
is suitable for holding UTF-8 code units, so std::string
can hold UTF-8 encoded strings (in fact, this is enforced in the C++11 and later standards via the u8
literal prefix, which encodes character data to UTF-8 using char
elements).– Remy Lebeau
Jul 1 at 23:27
CP_UTF8
char
std::string
u8
char
@RemyLebeau Yes of course a
char
it can hold a UTF-8 code unit. It does not mean that std::string
understands UTF-8 in terms of characters vs bytes and I'm sure all bets are off when you fill the buffer with bytes from a outside source. Also, I'm sure a lot of code relies on fopen(mystr.c_cstr(), ...)
to call CreateFileA
on Windows and only the very latest Windows 10 versions has basic support for CP_ACP
== CP_UTF8
.– Anders
Jul 2 at 2:06
char
std::string
fopen(mystr.c_cstr(), ...)
CreateFileA
CP_ACP
CP_UTF8
std::string
doesn't understand UTF-8 any more than std::wstring
and std::u16string
understand UTF-16, or std::u32string
understands UTF-32. They are just containers of elements, it is up to the app to interpret their meaning. And fopen()
does call CreateFileA()
on Windows (_wfopen()
calls CreateFileW()
).– Remy Lebeau
Jul 2 at 4:21
std::string
std::wstring
std::u16string
std::u32string
fopen()
CreateFileA()
_wfopen()
CreateFileW()
@RemyLebeau And since
fopen
calls CreateFileA
you can't just put a UTF-8 string inside std::string
and expect it to work since CreateFileA
is expecting a string encoded with a SBCS/DBCS Windows codepage, not UTF-8 in 99.99% of systems.– Anders
Jul 2 at 8:38
fopen
CreateFileA
std::string
CreateFileA
Actually, last time I looked,
fopen()
calls MultiByteToWideChar()
and then passes the resulting wide string to CreateFileW
. So ongoing lack of support for UTF-8 in the Windows CRT for this family of functions is just an unbelievable blind spot. MS should have provided an API to set the code page used by these functions ages ago, and not just continue to force us to use CP_ACP
. Yuk.– Paul Sanders
Jul 2 at 15:07
fopen()
MultiByteToWideChar()
CreateFileW
CP_ACP
Assuming the applications you are developing in c++ are compiled as native Unicode applications, you would want to use the MultiByte
APIs only when reading and writing files / streams where the multibyte codepage of the file has been stipulated (or assumed) in some way.
MultiByte
i.e. Not every application on Windows is written in C++, so these APIs provide an interoperation layer for applications to pass character data around correctly.
I would not expect their existence would impose a burden on a c++ application or suite of c++ applications that prefer to use the std:: string abstractions.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
If the standard fulfills your needs I'd prefer it for sake of portability.
– πάντα ῥεῖ
Jul 1 at 21:21