I18N in C RunTime on Windows Platform

    In multilingual environment, representing and processing text information is a challenging problem. The fundamental question here is -- how to map between language character and computer byte stream?

    There are three concepts/components essentially in this mapping: Character Collection -> Character Table -> Encoding.
  • Character Collection means a collection of characters from a specific language.
  • Character Table means mapping from characters to numbers(numeric codes).
  • Encoding means convert those numbers that stand for language characters into byte(bit) stream used internally in computer.
    In practice, we may call the three components combined together as Charset, Encoding or Codepage (I personally prefer 'codepage' due to its unambiguity). But they all refer to the same thing and are often used alternatively. There are also some other popular but confusing terms regarding this mapping, for example, UNICODE/UCS/utf-8/utf-16/UCS-2. According to the previous definitions, UNICODE/UCS(Universal Character Set) is actually used to represent some character table, while utf-8/utf-16/UCS-2 is to specify how to convert numbers from character table into byte streams.

    There are many problems that are worthy of discussion in the field of I18N(internationalization), but in this article, I only focus on -- "How does CRT from Microsoft deal with chars beyond ASCII?"

Let's look at the code below:
 1 #include <windows.h>
 2 #include <stdio.h>
 3 #include <iostream>
 4 #include <locale.h>
 5 #include <conio.h>
 7 int __cdecl main(int argc, const char* argv)
 8 {
 9     // Use default CRT locale
10     char * lt1 = setlocale(LC_CTYPE, NULL);
11     printf("current locale before change:%s\n", lt1);
13     char * lpsza = "Hello,世界!";
14     wchar_t * lpszw = L"Hello,世界!";
15     printf("NO1 - %s\n", lpsza);
16     printf("NO2 - %S\n", reinterpret_cast<const char*>(lpszw));
17     wprintf(L"NO3 - %s\n", lpszw);
18     _cwprintf(L"NO4 - %s\n", lpszw);
20     // Use 936 CodePage in CRT
21     char * lpszLC = setlocale(LC_CTYPE, "chinese_China"); 
22     char * lt2 = setlocale(LC_CTYPE, NULL);
23     printf("current locale after change:%s\n", lt2);
25     printf("NO1 - %s\n", lpsza);
26     printf("NO2 - %S\n", reinterpret_cast<const char*>(lpszw));
27     wprintf(L"NO3 - %s\n", lpszw);
28     _cwprintf(L"NO4 - %s\n", lpszw);
30     return 0;
31 }

The Console Output of this small program is:
current locale before change:C
NO1 - Hello,世界!
NO2 - Hello,NO3 - Hello,??!
NO4 - Hello,世界!
current locale after change:Chinese_People's Republic of China.936
NO1 - Hello,世界!
NO2 - Hello,世界!
NO3 - Hello,世界!
NO4 - Hello,世界!

    Let's see what happened behind the strange output.

     From the console output, we can find that results from L16 and L17 are not what we expected. We had specified that the string parameter was wide char string in line 16, and in line 17, function and parameter are all wide char version. So what's wrong with line 16/17?

     There are many posts on the Internet that had given some explanations on similar problems, but none of them can satisfy me. MSDN doesn't have information about CRT implementation internals. Since MS visual studio comes with CRT source code, so I turn to debugging into CRT source[1] to see what happened in detail.

Debug 'printf' in CRT using VS2008

    We can see that in the case of 'wprintf', the call flow will be: wprintf -> _output_l -> _wctomb_s_l, at this point, wide char string parameter is converted into multibyte string using current CRT locale. In order to output this mb string, the CRT will call write_char -> _fputc_ -> _write. The _write function is located in write.c, it will check the destination file handle, if it is console, _write will convert the input buffer into wide char string using CRT locale, then convert the wide char string result into multibyte string using Windows Console's CodePage, at last, _write will call Win32 API - WriteFile() to do the real work.
    From the first line of the console output of this small program, we can see that the CRT default locale is 'C'. From line 80@wbtomb.c we can find that 'C' locale will treate wchar_t value that is greater than 0xff as illegal char. Caller of this function in the wprintf context will use '?' to replace the original wchar_t value if error occurs in the convertion. This is the reason behind the output of "NO3 - Hello,??!"

    In the case of 'printf("%S")', it is slightly different. When output_l failed to convert wide char to multibyte string using _WCTOMB_S()(Line2235@output.c, it will eventaully call _wctomb_s_l()), it will return as error to the caller - output_l() and the whole output process will be terminated. This is the reason behind the output of "NO2 - Hello,", without lf/cr at the end.

    Then, how about _cwprintf()? Through the source code debug we can find that when this function is called, a macro:CPRFLAG is defined in the CRT source code context. The call flow will be: _cwprintf -> _vcwprintf_l -> write_char->_putwch_nolock, and _putwch_nolock will call Win32 API - WriteConsoleW direclty, no wbtomb or mbtowb convertion at all. So all _cwprintf() calls are OK in the example code.

From the the debug/analysis, we had found that:
1. All CRT inner wbtomb/mbtowb is implemented using Win32 API - WideCharToMultiByte/MultiByteToWideChar.

2. Output to console is implemented using Win32 API - WriteConsoleW or WriteFile.

3. If you are writing to file(wprintf/printf is actually writing to file in this context) rather than console and the input string parameter is wide char string, CRT will try to convert it into multibyte string using CRT locale first. If failed in the conversion, w* functions(wprintf) will use '?' to replace the original char and continue the output process, while non-w functions(printf) will terminate the output process and return with error value.

4. wprintf/printf is treated as file output rather than console output because STDIO may be redirected to real files, STDIO is not guaranteed be Windows Console and Keyboard.

5. If you use wprintf to output wide char string into Windows Console, there will be three wide char/multibyte char convertions: w->m in output_l()@output.c, m->w & w->m in _write()@write.c. Even if you are output multibyte string(most likely, encoded in utf-8) into console using printf(), 2 conversions may be needed in _write(). So you'd better to use _cwprintf()/_cprintf in this situation, no w/m convertion at all, CRT will all WriteConsoleW() directly.

6. All the coding/debugging/source code investigating are done on Windows platform using Vistual Studio 2008. The conclusions may be wrong in other CRT implementations on Windows platform and other operating systems.

NOTE [1]:
    In order to debug into CRT source code, you should download pdb files from Microsoft symbol servers, although CRT source code is already located on your local visual studio installation (By default: $MSVS\VC\CRT\SRC).

    To this end, just set a global system environment as:
_NT_SYMBOL_PATH = srv*c:\symbols*http://msdl.microsoft.com/download/symbols;symsrv*symsrv.dll*c:\symbols*http://msdl.microsoft.com/download/symbols

    It applies to VS debugger and WinDbg debugger. More information can be found at microsoft website:[1],[2]

NOTE [2]:
    If no CRT locale is given when program starts, it will use 'C' locale as default. When dealing with wide char to multibyte char conversion, 'C' locale will treat chars that are greater than 0x00ff as illegal input and will return EILSEQ to caller.

NOTE [3]:
    Since wprintf/printf will output to console in MOST times, so I think CRT should optimize this path. It should check the destination file handle at early time, if it is console, just convert the parameter string into wide char string (if needed) and then call WriteConsoleW(). This will reduce many unnecessary encoding conversions.

NOTE [4]:
    To test if you fully understand the internals, you could change the locale in line 21 from "chinese_China" to "chinese_Tainwan" and run the program again. See the strange output? Only the output from line 25 is changed! Try to explain the reason behind.

No comments: