7/15/2008

Reconstruct VIM file organization on Windows

  不大喜欢vim在Windows平台上的默认目录结构,打算重新按照Vim\Runtime, Vim\Bin, Vim\VimFiles三个子目录的结构重新组织Vim。

  需要做的工作:
  1. 建立VIM root dir.
  2. 安装官方方法正常安装vim:解压vim72rt.zip, vim72w32.zip, gvim72.zip;执行install.exe,所有任务全选然后执行.
  2. 在VIM root dir下面建立Runtime, Bin两个个子目录.
  3. 将vim.org上下载的vim72rt.zip里面Vim\Vim72下的所有内容拷贝到Vim\Runtime.
  4. 将vim.org上下载的gvim72.zip, vim72w32.zip里面Vim\Vim72目录下的所有文件拷贝到Vim\Bin.
  5. 给系统Path变量加上$YourDir\Vim\Bin.
  6. 删除vim下的vim72目录.

  现在,一个基本的vim, gvim可以运行了,但是vim系统还需要知道用户的启动脚本.vimrc等用户定制信息($Vim)和自己的runtime($VimRuntime)在什么地方:
  1. 设置环境变量$Vim=$YourDir\Vim.
  2. 设置环境变量$VimRuntime=$yourDir\Vim\Runtime,这个可以忽略,这是系统的默认的位置.

  这时系统基本就绪,但是最初安装时设置的右键环境菜单:"open with vim"还不能正确工作,系统会报告一个错误:"Error creating process: Check if gvim is in your path!"。实际上,我们已经将gvim.exe放入了$Path里面。从出现的对话框来看,应该是gvimext.dll在寻找gvim.exe程序的时候出的问题。

  在vim source code中寻找上面那段错误字符串,很快定位到$src/GvimExt/gvimext.cpp文件的CShellExt::InvokeGvim()函数,在启动gvim.exe时,是通过getGvimName()来定位gvim.exe的位置的。getGvimName()函数位于同一文件中,它首先寻找注册表的"HKEY_LOCAL_MACHINE\Software\Vim\Gvim",如果没有找到,才会使用系统$Path去寻找gvim.exe。至此可以判断是注册表相应的键值在最初运行install.exe时,设置了现在不可用的值。直接将之删除,为安全起见,再搜索替换整个注册表系统中的其它vim相关字段。

  重试右键菜单,一切ok.

  这样,一个干净而清晰的vim就完成了:$vim\Bin下面是所有的可执行文件;$vim\runtime下面是系统的预定义配置、帮助等辅助性文件;$vim\vimfiles是用户自己的定制文件,比如插件等等;用户的启动脚本则放在$vim\_vimrc.

  除了右键菜单,没有任何依赖注册表的信息.另外,需要手动修改一下path变量,设置一下vim变量.

  BTW,看代码时发现gvimext.dll的作者似乎是一位中国人:Tianmiao Hu.

7/08/2008

I18N in C RunTime on Windows Platform

    In multilingual environment, representing and processing text information is a challenging problem. The fundamental question here is -- how to map between language character and computer byte stream?

    There are three concepts/components essentially in this mapping: Character Collection -> Character Table -> Encoding.
  • Character Collection means a collection of characters from a specific language.
  • Character Table means mapping from characters to numbers(numeric codes).
  • Encoding means convert those numbers that stand for language characters into byte(bit) stream used internally in computer.
    In practice, we may call the three components combined together as Charset, Encoding or Codepage (I personally prefer 'codepage' due to its unambiguity). But they all refer to the same thing and are often used alternatively. There are also some other popular but confusing terms regarding this mapping, for example, UNICODE/UCS/utf-8/utf-16/UCS-2. According to the previous definitions, UNICODE/UCS(Universal Character Set) is actually used to represent some character table, while utf-8/utf-16/UCS-2 is to specify how to convert numbers from character table into byte streams.

    There are many problems that are worthy of discussion in the field of I18N(internationalization), but in this article, I only focus on -- "How does CRT from Microsoft deal with chars beyond ASCII?"

Let's look at the code below:
 1 #include <windows.h>
 2 #include <stdio.h>
 3 #include <iostream>
 4 #include <locale.h>
 5 #include <conio.h>
 6
 7 int __cdecl main(int argc, const char* argv)
 8 {
 9     // Use default CRT locale
10     char * lt1 = setlocale(LC_CTYPE, NULL);
11     printf("current locale before change:%s\n", lt1);
12
13     char * lpsza = "Hello,世界!";
14     wchar_t * lpszw = L"Hello,世界!";
15     printf("NO1 - %s\n", lpsza);
16     printf("NO2 - %S\n", reinterpret_cast<const char*>(lpszw));
17     wprintf(L"NO3 - %s\n", lpszw);
18     _cwprintf(L"NO4 - %s\n", lpszw);
19
20     // Use 936 CodePage in CRT
21     char * lpszLC = setlocale(LC_CTYPE, "chinese_China"); 
22     char * lt2 = setlocale(LC_CTYPE, NULL);
23     printf("current locale after change:%s\n", lt2);
24
25     printf("NO1 - %s\n", lpsza);
26     printf("NO2 - %S\n", reinterpret_cast<const char*>(lpszw));
27     wprintf(L"NO3 - %s\n", lpszw);
28     _cwprintf(L"NO4 - %s\n", lpszw);
29
30     return 0;
31 }


The Console Output of this small program is:
current locale before change:C
NO1 - Hello,世界!
NO2 - Hello,NO3 - Hello,??!
NO4 - Hello,世界!
current locale after change:Chinese_People's Republic of China.936
NO1 - Hello,世界!
NO2 - Hello,世界!
NO3 - Hello,世界!
NO4 - Hello,世界!


    Let's see what happened behind the strange output.


     From the console output, we can find that results from L16 and L17 are not what we expected. We had specified that the string parameter was wide char string in line 16, and in line 17, function and parameter are all wide char version. So what's wrong with line 16/17?

     There are many posts on the Internet that had given some explanations on similar problems, but none of them can satisfy me. MSDN doesn't have information about CRT implementation internals. Since MS visual studio comes with CRT source code, so I turn to debugging into CRT source[1] to see what happened in detail.


Debug 'printf' in CRT using VS2008

    We can see that in the case of 'wprintf', the call flow will be: wprintf -> _output_l -> _wctomb_s_l, at this point, wide char string parameter is converted into multibyte string using current CRT locale. In order to output this mb string, the CRT will call write_char -> _fputc_ -> _write. The _write function is located in write.c, it will check the destination file handle, if it is console, _write will convert the input buffer into wide char string using CRT locale, then convert the wide char string result into multibyte string using Windows Console's CodePage, at last, _write will call Win32 API - WriteFile() to do the real work.
    From the first line of the console output of this small program, we can see that the CRT default locale is 'C'. From line 80@wbtomb.c we can find that 'C' locale will treate wchar_t value that is greater than 0xff as illegal char. Caller of this function in the wprintf context will use '?' to replace the original wchar_t value if error occurs in the convertion. This is the reason behind the output of "NO3 - Hello,??!"

    In the case of 'printf("%S")', it is slightly different. When output_l failed to convert wide char to multibyte string using _WCTOMB_S()(Line2235@output.c, it will eventaully call _wctomb_s_l()), it will return as error to the caller - output_l() and the whole output process will be terminated. This is the reason behind the output of "NO2 - Hello,", without lf/cr at the end.

    Then, how about _cwprintf()? Through the source code debug we can find that when this function is called, a macro:CPRFLAG is defined in the CRT source code context. The call flow will be: _cwprintf -> _vcwprintf_l -> write_char->_putwch_nolock, and _putwch_nolock will call Win32 API - WriteConsoleW direclty, no wbtomb or mbtowb convertion at all. So all _cwprintf() calls are OK in the example code.


From the the debug/analysis, we had found that:
1. All CRT inner wbtomb/mbtowb is implemented using Win32 API - WideCharToMultiByte/MultiByteToWideChar.

2. Output to console is implemented using Win32 API - WriteConsoleW or WriteFile.

3. If you are writing to file(wprintf/printf is actually writing to file in this context) rather than console and the input string parameter is wide char string, CRT will try to convert it into multibyte string using CRT locale first. If failed in the conversion, w* functions(wprintf) will use '?' to replace the original char and continue the output process, while non-w functions(printf) will terminate the output process and return with error value.

4. wprintf/printf is treated as file output rather than console output because STDIO may be redirected to real files, STDIO is not guaranteed be Windows Console and Keyboard.

5. If you use wprintf to output wide char string into Windows Console, there will be three wide char/multibyte char convertions: w->m in output_l()@output.c, m->w & w->m in _write()@write.c. Even if you are output multibyte string(most likely, encoded in utf-8) into console using printf(), 2 conversions may be needed in _write(). So you'd better to use _cwprintf()/_cprintf in this situation, no w/m convertion at all, CRT will all WriteConsoleW() directly.

6. All the coding/debugging/source code investigating are done on Windows platform using Vistual Studio 2008. The conclusions may be wrong in other CRT implementations on Windows platform and other operating systems.

NOTE [1]:
    In order to debug into CRT source code, you should download pdb files from Microsoft symbol servers, although CRT source code is already located on your local visual studio installation (By default: $MSVS\VC\CRT\SRC).

    To this end, just set a global system environment as:
_NT_SYMBOL_PATH = srv*c:\symbols*http://msdl.microsoft.com/download/symbols;symsrv*symsrv.dll*c:\symbols*http://msdl.microsoft.com/download/symbols

    It applies to VS debugger and WinDbg debugger. More information can be found at microsoft website:[1],[2]

NOTE [2]:
    If no CRT locale is given when program starts, it will use 'C' locale as default. When dealing with wide char to multibyte char conversion, 'C' locale will treat chars that are greater than 0x00ff as illegal input and will return EILSEQ to caller.

NOTE [3]:
    Since wprintf/printf will output to console in MOST times, so I think CRT should optimize this path. It should check the destination file handle at early time, if it is console, just convert the parameter string into wide char string (if needed) and then call WriteConsoleW(). This will reduce many unnecessary encoding conversions.

NOTE [4]:
    To test if you fully understand the internals, you could change the locale in line 21 from "chinese_China" to "chinese_Tainwan" and run the program again. See the strange output? Only the output from line 25 is changed! Try to explain the reason behind.