12/11/2008

Scalability in Online Game and Virtual World

The latest issue of ACM Queue Magazine has an article titled "Scaling in Games & Virtual Worlds", which talks about SUN's efforts on the scalability problem in Online Game and Virtual World systems(A.K.A. MMORPG).

The author is a Distinguished Engineer in SUN, who had been involved in the scalable online game platform: DarkStar for about 2 years. He summarized many interesting points in this paper.

I - Unique Characteristics of Online Game:
1. Being Fun is the Prime Directive
2. Should be Easy to Learn, but Hard to Master
3. Client is Fat and Powerful
4. Network Latency is a Critical Factor

II - Current State of the Art of Techniques in Game & World:
1. Network protocol is simple and should be hold in one network packet
2. Server is artictured as doing small things fast with large scale concurrent requests
3. Predict & Adjust tricks are used to hide network latency
4. Servers act as player interaction hub and arbiter to avoid client cheating

III - Current Scaling Strategy:
1. Geographic Decomposition - Partitioned by Geography Information. For example, all activities in an island/country are processed by one physical server
2. Sharding - create noninteracting copies of parts of a world, players can only interact with players in the same shard

The problems that SUN found are:
1. Chip Architecture is switching to multi-core style, which is suitable for parallel tasks
2. Server side game tasks are essentially parallelized: one thread for one player, interactions among users are relatively small, compared with all activies
3. Currently, most game implementation can't expoit parallel characteristic of multi-core cpu

SUN's proposal for next generation game development:
1. Hide concurrency and distribution for game logic developer
2. Server is architectured as event-based(may be SEDA like, I guess) and one client request served by one server task
3. Communication is abstracted as Channel, physical address is hiden from developer
4. Data/State is moved from server memory to global data service, which is based on simplified DBMS technologies
5. Since little state in task logic and server mem, task imigration is possible, which enables hot load balance

The last valubale word in this paper: the darkstar is open sourced.

I think the observation and current tech description are the most interesting parts, the solution part is not that exciting, since the idea is somewhat not so innovative.

One personal observation is that, general platform will improve productivity of application developer, but with the price of performance. So, what's future of the darkstar project? It may be useful for small scale game, but when your players grows to M or 10M level, productivity is not the primary problem for dev team, but performance/stability is. So large scale applications always use ad-hoc solutions. But if darkstar is not used in REALLY LARGE scale environment, how can we evaluate the value of this system, how can we say the project succeed or fail?

12/04/2008

Use Reg Expression in VIM

    When I am writing blog, I will use photos from flickr very often. When posting photos to blogspot, I usually upload them to flickr first, then use the "add image" button in blogger edit box. At the mean time, I usually modify the image target link to the url of the original size photo. When the photo count is relatively large, for example, larger than 10, doing these things maually is annoying.

     So I decided to do it in two steps using VIM's replacing command:(Suppose we now have all image's urls in vim)

1. Compose the Blogspot style image anchor tags.
VIM command: %s/^\(\S\+\)$/str1\1str2\1str3

str1 -- <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="
str2 -- "><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 240px; height: 135px;" src="
str3 -- " border="0" alt="" /></a>


2. Modify the image anchor target to be the original size photo:
The medium size photo's url is like: http://farm3.static.flickr.com/2357/2096529455_4bc506c8ff_m.jpg
The original size photo's url is like: http://farm3.static.flickr.com/2357/2096529455_4bc506c8ff_b.jpg

VIM command:%s/^\(\S\+href\S\+\)\(_m\)\(\S\+\)$/\1_s\2


So it becomes much easier to do blogging on blogspot with a lot of image urls.

For VIM reg expression and replacing syntax, see the reference documents(Sorry, all in Chinese).

Reference(In Chinese):
[1] http://ubuntu-fans.blogspot.com/2007/02/vi.html
[2] http://bbs.et8.net/bbs/showthread.php?t=746650
[3] http://vimcdoc.sourceforge.net/doc/pattern.html#pattern.txt

12/03/2008

Visual Studio Product Comparison

  When the first version of VSTS(vsts 2005) came out, I was very confused about the various names, products' functionalities and the relationships among them. So I spent some time and came out the following brief summary. Wish it be helpful if you encounter the same problem.


/-
| Express Edition (VB/VC++/VC#/Web for Free)
|
| Standard Edition (Basic for Individual Developer)
|
| Professional Edition (Std + Adv.Debug/Office/Mobile)
|
Visual Studio + /- VSDB - DataBase
| | VSTA - Architecture
| /- Team Suite +
| | | VSTD - Development
| Team System + \- VSTT - Test
| |
| \- Team Fundation Server
\-

Visual Studio Product Family


VSTS - Visual Studio Team System

TFS(Team Fundation Server) - It combines project management, work item tracking, version control, reporting & business intelligence, build management, and process guidance into a unified team server.

VSTA/VSTD/VSTT/VSDB
- It is for Architect/Developer/Tester/DB Professional individually.

The Official Visual Studio Product Comparison Web version and PDF version

11/28/2008

Argument Parsing for Console Applications

  Parsing console application arguments is a basic task for system programmer.

  First, let's clarify some confusing terms:

  Suppose the syntax of a tool is: "tool_name[-a][-b][-c option_parameter] [-d-e][-foption_parameter][operand...]"

  The tool in the example is named as "tool_name". It is followed by options, option-arguments, and operands. The arguments that consist of hyphens and single letters or digits, such as 'a', are known as "options" (or, historically, "flags", "switches"). Certain options are followed by an "option-argument", as shown with [ -c option_argument]. The arguments following the last options and option-arguments are named "operands".

  In general:
  • Argument is parameter that is used to tell applications some specific information in order to accomplish some specific tasks. It can be divided into 3 categories: Option, Option-Parameter and Operand.
  • Option (also named as: flag, switch) is used to customize some default behaviours of program.
  • Option-Parameter is used to provide extra information for the option.
  • Operand is usually source or destination file name, directory name. It will not affect the program's behaviour, but just help program to identify the data to be processed.

  Then, let's see command argument syntax or convention in Winows and *nix world.

POXIS Command Argument Syntax/Convention
  • Arguments are options if they begin with a hyphen delimiter (`-').
  • Multiple options may follow a hyphen delimiter in a single token if the options do not take arguments. Thus, `-abc' is equivalent to `-a -b -c'.
  • Option names are single alphanumeric characters.
  • Certain options require an parameter.
  • An option and its parameter may or may not appear as separate tokens. (In other words, the whitespace separating them is optional.) Thus, `-o foo' and `-ofoo' are equivalent.
  • Options typically precede other non-option arguments.
  • The argument `--' terminates all options; any following arguments are treated as non-option arguments, even if they begin with a hyphen.
  • A token consisting of a single hyphen character is interpreted as an ordinary non-option argument. By convention, it is used to specify input from or output to the standard input and output streams.
  • GNU introduced so called long option style. It starts with '--', the option name consists of letter and '-'. If there is any parameter, it is separated with name by whitespace or '='.
Reference:
POSIX Option
GNU Long Option
Argument Parsing in TAOUP

Windows Command Argument Syntax/Convention
  • Option consists of an option specifier, either a dash (–) or a forward slash (/), followed by the name of the option.
  • Option names cannot be abbreviated.
  • Some options take an argument, specified after a colon (:) or whitespace.
  • No spaces or tabs are allowed within an option specification, except within a quoted string.
  • Option names and their keyword or filename arguments are not case sensitive, but identifiers as arguments are case sensitive.
  • Use suffix/prefix "+/-" to represent order attribute or addition/removal operation of Options.
  • @fileName.ext style argument specifies that the filename is not used as argument, but the content of the file is used as command argument, plus other normal arguments.
  • To see more concrete examples, run "dir /?", "robocopy /?", "runas /?" and watch the screen output.
Reference:
MSVS Link.exe Option Syntax
Sqlcmd.exe Option Syntax


Yet Another C++ Command Line Argument Parser - ApCxx
- It supports both Windows convention and Posix/GNU convention
- It supports both ANSI version and Unicode version
- It is implemented in portable C++ code
- It can Get all/specific options/operands
- It can Get all/specific arguments for specific option
- It can check whether some option is give
etc.

Source Code download

[Reference]

Option Parsing for C++:
CLAP
TCLAP
ArgTable
ArgParse
AnyOpt

Option Parsing for Other Languages:
C# OptParse
ASM OptParse

11/26/2008

Some Tips on Exception Handling

1. Exception V.S. Error Code

- The method you choose depends on how often you expect the event to occur
- Use error code mechanism if it’s a routine case
- Use exception if it’s real exceptional case only
- Never use exception for flow control

2. User Defined Exception

- End user defined exception class names with the word “Exception”
- Ensure that metadata of user defined exceptions is available to code that (may remotely) handles it

3. Exception Safety

- A piece of code is said to be exception-safe, if run time exception will not produce ill effects
- Exception-safe code must satisfy some invariant placed on the code even if exception occurred
- For example, vector<>.add() function should keep the internal state valid/consistent (like element counting) if any exception happened

4. Miscellaneous 

- Always order exceptions in catch blocks from the most specific to the least specific. It handles the specific exception before a general one
- The stack trace begins at the statement where the exception is thrown and ends at the catch statement that catches the exception
- Throw InvalidOperationException if property set or method call is not appropriate given the object state
- Throw ArgumentException if invalid parameters are passed to a function
- Handles resource clean up work in final block
- Logging exception information just once in your whole code stack 

Reference

1. .Net Best Practices for Exception Handling
2. .Net Exception Handling Internals
3. Exception Handling Practices for .Net
4. Exception Handling Practices for Java
5. https://en.wikipedia.org/wiki/Exception_safety

11/19/2008

CAPTCHA and Modern Attacking Ways

CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart", which is a widely used technology in popular(high traffic) online web services(for example, webmail, ebusiness order confirmation etc) to prevent automatic and abusive client access.

But recently, there are many reports that said CAPTCHA systems of those Internet service providers were cracked and abused by spammers to send huge amount of advertisement junks. So what's the problem of today's CAPTCHA systems?

Traditional CAPTCHA cracking uses OCR techniques to automatically recognize those wired char/text in photos. But the most successful cases use alternative ways that may not be so core-tech related. Security experts call them as Social Engineering, in which people are engaged to do those things that are hard for computer algorithms/softwares.

Under the concept of Social Engineering, there are two concrete methods:
1. Leverage unaware users' efforts of 3rd part service
2. Delegate to people with awareness, the "Mechanical Turk" way

The basic idea of the first method is to redirect CAPTCHA challenges to users of another web service, and use the response results to serve the original web site's CAPTCHA system. Detailed process is described in this article in detail.

The article summarized this method as:
Although it is possible to identify the difference between a computer and a human, there may yet be a challenge in verifying that a given human response is from the intended human.



(from McAfee)

Other case studies of this kind of cracking can be found here and here.

The other social engineering cracking method is reported in this article. It says that:
Spammers are using a variety of techniques to accomplish this. Some of their success is due to their use of "mechanical turks", people who either directly or indirectly create accounts traded online.

Mechanical Turks is an interesting concept which uses manpower to drive some mechanic/automatic services behind the scene. Amazon push this concept into Internet as one of their web services - Amazon MTurk.

This concept is enlarged and advocated as CrowdSourcing by Jeff Howe in the June, 2006 Issue of Wired magazine. The advocators described it as a great way to divide and dispatch tasks to large amount of individual workers to achieve great results in end user's perspective - very similar to what we had done in distributed computing domain.

Turn back to our topic. Due to those fatal defects that exist in today's CAPTCHA systems, some people begin to think CAPTCHA may be not so useful and trustworthy. But CAPTCHA researchers continue their efforts in this field and here is a recent report on this. But in this article, I didn't see how they were going to solve the social engineering problems we described above.

Since manpower is invovled in social engineering, the cost(money) is much higher than compuer software. So the state of today's CAPTCHA system is that, cracking is possible in technical perspective, but it may not be feasible in practice due to huge cost.

11/15/2008

Notes: "The Art of Software Testing" - 3

The Art of Software Testing


PART III - 怎样设计测试用例


  测试用例的设计坚持的原则:用尽可能少的用例,发现经可能多的错误.或者换句话说,寻找最有效的那个测试用例集合.

一 基于白盒测试的用例设计原则

  • 1. 测试的主要关注点(目标):代码覆盖率.
  • 2. 语句覆盖(Statement Coverage)指的是代码被测试时,有多少语句被执行到.这种覆盖的语义太弱,很多逻辑分支无法反映.
  • 3. 判定覆盖(Decision Coverage)将会把所有分支语句的真假分支路径计算在覆盖率统计内,涵盖了语句覆盖的概念.
  • 4. 条件覆盖(Condition Coverage)则是要求每个判断语句中的每个条件表达式所有可能的结果都出现了,才算真正覆盖.但是请注意,判定覆盖不能确保分支覆盖,因为它并没有要求判定语句中不同条件表达式结果的组合情况.
  • 5. 判定条件覆盖,同时满足3和4的要求.但是由于逻辑表达式的by pass优化的存在,有些组合路径并没有考虑进去.
  • 6. 多重条件覆盖(multiple-condition coverage),要求将判断语句中每个条件的不同结果的组合都计算入内,作为测试应该覆盖的目标. 但是由于没有将不同判断语句间的结果进行组合,仍然不能覆盖所有的程序路径.


二 基于黑盒测试的用例设计原则

  • 1. 等价类划分:有很多测试用例,他们发现潜在错误的能力是相同的,这时候,他们属于一个等价类.这个等价类中的任何一个成员用例用作测试,作用是完全一样的.
  • 2. 边界值分析:当对输入参数有范围的限制描述时,比较适合按照边界值来设计多个测试用例.
  • 3. 因果图分析,主要用来处理输入参数进行组合的问题.


三 其它设计用例的方法

  • 1. 错误猜测,是一种依靠直觉来进行用例编写的方法.通过估计程序可能在什么地方出错而设计相应的用例,对程序进行测试.


PART VI - 调试:处理测试结果


一 调试的策略

  • 1. 暴力调试法(Brute Force):通常使用了查看内存,打印信息,集成调试工具等方法.优点是比较直观,能获得很多了解内部状态的机会.但是这忽略了思考的过程,产生的数据量过大,不利于确定问题的效率.
  • 2. 归纳调试法(Induction):归纳是一种从特殊到一般的思维方式和过程.从细节到全局,从线索出发,寻找线索之间的联系,作出假设,再验证假设,最后发现并解决问题.
  • 3. 演绎调试法(Reduction):演绎是从普遍的道理到某一个具体的问题的思考方式.先根据最后的后果列举所有的可能的假设,再利用数据进行排除,最后验证精炼后的假设,最后修改.
  • 4. 回溯调试法(Backtracking),从发生错误的地方,推断造成错误的地方.
  • 5. 测试调试法(Testing),通过修改发现错误的测试用例来帮组进一步发现定位问题.


二 调试的原则

  • 1. 先思考,再调试.
  • 2. 遇困难,先休息.
  • 3. 与人交谈,是发现问题的一种好方法.
  • 4. 尽量避免使用测试工具.
  • 5. 尽量避免使用试验法,尝试法.
  • 6. 存在一个缺陷的地方,往往也存在其它的缺陷
  • 7. 应该意识到修正一个错误可能会引入新的错误.做自动回归测试在修正错误时是非常重要的.
  • 8. 正确修改错误的可能性随着程序规模的增加而降低.


PART V - 新兴测试技术


一 极限测试


  极限编程(eXtreme Programming)是敏捷软件开发(Agile)方法学下的一种软件开发过程,产生的初衷是为了适应社会发展到今天,业务和需求的快速变化.测试在极限编程中处于非常重要的地位,称为"极限测试".

1. 极限编程的基本概念

  • 1). 适合于中小规模,需求快速变化的团队.
  • 2). 需求分析与计划,核心概念:用户场景(User Scenario)来描述需求,客户决定特性(Feature)优先级,开发人员决定功能所需时间.
  • 3). 小规模,不断迭代,递增地发布软件和功能.
  • 4). 系统隐喻(System Metaphor).(不是很理解)
  • 5). 简要设计(Simple Design),假定变化总是会发生,简要设计方便适应变化.
  • 6). 持续测试(Continuous Testing),开发之前编写单元测试用例,不断用测试来验证开发的正确性.
  • 7). 重视重构(Refactoring),迭代的过程里不断优化,重构代码.重构后,需要通过单元测试.
  • 8). 结对编程(Pair Programming),两位程序员用一台电脑进行工作.
  • 9). 代码集体拥有, 代码是由整个团队而不是某个程序员拥有的.
  • 10). 持续集成, 实现每日构建(Daily Build),让最终系统始终可得.
  • 11). 每周工作40小时, 避免加班.
  • 12). 客户在场(on-site customer),由客户不断提出反馈意见.
  • 13). 标准编码,团队所有人提交的代码必须尽量一致.


2. 极限编程下的测试

  • 1). 单元测试:在开发之前编写测试用例,好处:有助开发前了解系统规格;方便重构;开发人员更容易获得信息.由于单元测试需要持续进行,因此一个良好的测试框架,运行和报告工具非常重要.
  • 2). 验收测试:由客户根据用户场景来执行的测试.反馈结果也需要排定优先级.


二 Web测试


1. Web测试面临的挑战

  • 客户端浏览器和操作系统平台千差万别
  • 客户到服务器的连接类型比较多
  • 在涉及金融和财务时,事务概念很重要,测试事务是比较困难的任务
  • 客户可能来自不同的国家和地区,语言,时区,货币符号都需要考虑
  • 由于网站的用户粘着度较低,需要良好的用户体验才能留住客户,如果测试用户体验,是个很困难的问题
  • 网站的在不同负载时的性能表现差别很大,需要模拟各种负载
  • 由于网站的公开性,需要考虑可能的攻击(DoS)


2. 测试策略

  • 事先定义好功能性和非功能性规格说明,作为测试的依据
  • 采用分而治之的方式分别测试表现层,逻辑层和数据层


3. 表示层的测试

  • 内容测试:字体,语法,色彩等等
  • 结构测试:链接,图片,文件的可达性
  • 用户环境:浏览器和操作系统的不同组合,Javascript, ActiveX, Java Applet等等的测试


4. 业务层的测试

  • 性能测试:响应时间和吞吐率,测试时需要模拟典型用户的访问操作
  • 数据验证:测试数据收集模块的功能,检查能否正确处理各种输入
  • 事务测试


5. 数据层测试

  • 响应时间,主要是数据规模达到一定程度时的查询性能
  • 数据完整性,看数据是否满足业务规范
  • 容错性,测试数据库端在容错方面的一些表现



11/09/2008

Notes: "The Art of Software Testing" - 2

The Art of Software Testing


Part II - 如何进行测试


一、人工测试:阅读代码

1. 代码检查(code inspection)

  1) 活动人员组成:协调人(Drive "Preparing -> Recording -> Ensure Fixing" Process),代码拥有者(Code Owner),其它开发组员(Other Dev Team Member).
  2) 活动内容:逐条讲述/解释代码,其他人提做判断、提意见;对照常见错误列表分析程序.
  3) 会议内容集中于查找错误,而不是改正错误.
  4) 适当控制时间,消耗脑力的会议越长时间越短.
  5) 会后协调人要确保错误被修复,总结错误教训,避免组员犯类似错误.
  6) 会议参与者要对事不对人,代码拥有者要以平和心态接受意见,其它参与者以建设性的态度参与讨论.
  7) 代码检查是个团队内部相互学习的良好机会(编程风格,算法选择,编程技巧等).
  8) 代码检查的最大优势在于尽早发现错误,降低修改成本.
  
2. 代码走查(code walk through)

  1) 与代码检查不同的地方在于,这里使用的是由人去动态模拟程序运行、检查程序代码正确性.
  2) 全体参会人员用预先选好的数据沿着程序逻辑走一遍,在白板上记录状态和变化.
  3) 为了确保会议的顺利进行,应坚持一个原则:软件中存在的错误不应被视为编写程序的人员自身的弱点,而应被看作软件开发本身的困难.

3. 常见错误检查列表(Code Error Check List)
  
  数据错误
  1) 变量需要在引用前初始化.
  2) 是否确保数组下表不越界.
  3) 指针变量的值和被指对象的生命周期是否一致(dangling reference).
  4) 动态分配内存、对象的生命周期管理是否清晰、恰当.
  5) 是否考虑了内存地址对齐的问题.
  6) 动态类型转化是否安全(将内存对象解释成另外一个对象类型).
  7) 字符处理时,是否考虑了最后结尾的字符的问题.
  8) 实现类中,是否完成了所有的继承需求.
  9) 变量名是否合理而清晰.

  运算错误
  1) 进行运算的两个数据的类型是否是所期望的.
  2) 混合类型的数据进行运算时是否有清晰的说明.
  3) 是否考虑了运算(中间或者最终结果)可能导致的溢出问题.
  4) 运算结果赋值时,目标变量是否拥有足够的容纳量.
  5) 运算的顺序是否明确清晰.
  6) 除法中除数是否考虑除数可能为0的情况.

  比较错误
  1) 进行比较的数据类型是否一致,特别要注意有符号与无符号数的比较.
  2) 注意在引入“非”运算符号的时候语句的语义是否正确.
  3) 用作逻辑判断的结果是否是布尔型的值.
  4) 复杂的逻辑运算是否确保了清晰正确的顺序.
  5) 是否注意了逻辑运算中的side effect和bypass evaluation.
  
  流程错误
  1) 循环语句中循环的次数是否确保正确.
  2) 是否对输入的数据进行了假设,这个假设是否体现在输入参数检查和处理中.
  3) 跳转语句(Goto, Break, Continue)是否执行了正确而清晰的语义.
  4) for语句的循环体是否包含了所有想要的语句(忘记{}会引发意外错误).

  函数错误
  1) 是否考虑了参数的压栈顺序与调用风格(cdecl, stdcall).
  2) 是否考虑了传递参数时中间临时对象的创建管理的开销.
  3) 传递内存块时,是否考虑了内存对象的生命期管理问题.
  4) 是否确保没有将函数局部变量(或者局部变量的指针、引用)返回给调用者.
  5) 函数所有的返回路径是否都能正确处理资源释放和临时对象生命周期管理问题.
  6) 在调用函数时,是否考虑了处理所有的返回值和异常情况.

  I/O错误
  1) 打开创建文件时的模式、权限的设置是否正确.
  2) 新建文件时是否确保所有的父目录会正确创建.
  3) 文件是否及时清空了缓存.
  4) 是否考虑了I/O系统出错的情况.
  5) 是否确保文件会正确的关闭.
  6) 文件结束的判断是否正确.

  其它问题
  1) 多线程的问题是否考虑了thread safety.
  2) 不同模块内的全局变量是否考虑了初始化顺序问题.
  3) 交叉引用其它模块的全局变量时,变量声明和定义处的类型是否一致.
  4) 引用其它模块的变量时,能否保证引用时变量已经初始化.
  5) 是否消除了所有编译警告.
  6) 进程间同步时是否考虑了可能的死锁问题.

二、机器测试:运行程序

1. 软件开发流程与测试的角色
  
  1) 典型的软件开发过程:需求分析->[外部规范说明]->系统设计->[系统架构说明]->软件系统设计->[模块接口规范说明]->编码开发  
  2) 对代码的测试,依照模块接口规范进行,称之为"模块测试"(单元测试).
  3) 对整个最终系统按照外部规范说明进行的测试,称之为"功能测试".
  4) 系统开发过程中的测试,大致可以分为:验收测试、系统测试、功能测试、模块测试等.

2. 测试的分类

  1) 验收测试(Acceptance Testing):验收测试是由最终用户进行根据双方签署的合同进行的测试,以检测交付的系统是否满足合同说明.

  2) 功能测试(Function Testing):功能测试是一个试图发现系统和外部规范说明不一致的过程.外部规范说明是一份从最终用户的角度对程序行为的描述.

  3) 系统测试(System Testing):是对一个系统的非功能性方面进行测试.系统规范说明不能作为系统测试的基础,往往用户手册和系统目标文档,才是进行系统测试的基础.系统测试主要包括:

  • a) 容量测试(Volume Testing):是系统经受大数据量的考验.注意这里只强调数据的容量,没有强调时间和其它因素.
  • b) 压力测试(Stress Testing):使系统接受高负载,或者高强度的检验.注意,这里的负载和强度都和时间有关,指的是单位时间内到达的数据或者操作的数量.
  • c) 性能测试(Performance Testing):测试系统在某一用户负载和特定环境配置的条件下的响应时间,吞吐率[throughput](带宽[bandwidth], 每秒操作数[IOPS]).
  • d) 易用性测试(Usability Testing):通常是测试用户使用最终系统是否方便,容易.但是这个测试相对会显得比较主观.易用性测试通常包括衡量完成一件事所需的步骤, 用户用到了多少系统提供的功能, 所有文字输出信息是否语法正确且风格一致,系统是否给予了用户恰当的提示信息等等.
  • e) 安全性测试(Security Testing):主要包括用户的秘密信息(银行帐号,密码)是否被安全地传送,用户的私密信息(偏好,个人隐私)是否被暴露和不恰当收集.
  • f) 兼容性测试(Compatibility Testing):与已有的相关组件,程序是否能很好地配合.对上一版本的系统和软件产生的数据文件是否能正确地识别和转换等等.
  • g) 可恢复测试(Recovery Testing):有很多系统(比如数据库,中间件)都有容错,可恢复功能,能从一些程序错误,硬件失效总回复过来急需工作.需要有一些测试来验证这些功能确实能够正确地工作.
  • h) 可靠性测试(Reliability Testing):有些软件系统(比如航天控制)需要很高的可靠性,因此需要对系统进行相应的测试.但是,通常这样的测试比较难,特别是对高可靠性的系统而言.
  • i) 配置测试(Configuration Testing):系统运行的相关软硬件环境会有多种,应该针对不同的环境测试系统的表现.
  • j) 安装测试(Installation Testing):现代软件的安装往往比较复杂,软件提供方需要对各种情况下的软件安装过程进行测试,以保证其准确无误.
  • k) 国际化测试:主要测试软件在不同的语言配置和环境下是否能正确工作,是否能显示当地的语言,是否能符合当地的时区,用语习惯,货币单位等等.


  4) 单元测试

  • a) 特点:规模小,容易定位错误;方便并行开发测试不同的模块;分离了关注单位的大小,使每个人关注的内容有限.
  • b) 单元测试是一个白盒测试,在现代软件开发中,通常是开发人员完成单元测试.
  • c) 由于各模块单元需要相互调用,因此整个系统有两种方式完成测试:增量方式和非增量方式.前者对模块进行拓扑排序,当所有被依赖的模块都准备好之后才进行后续的测试;后者则需要各模块的开发测试者编写相应的mock/stub模块方便本模块的测试.
  • d) 除了可以减少依赖,方便并行测试,增量模式在其它方面都比非增量模式更有利于系统的开发.
  • e) 增量测试可以按照自上而下和自下而上两种方式进行.自上而下的方法最大有点在于早期演示,激发积极性;缺点在于搭建测试环境比较困难.自下而上的方法则刚好相反.
  • f) 由于单元测试通常需要不断进行,因此辅助的自动化工具非常重要,可以有效提高工作效率.


3. 测试的流程

  1) 测试的计划, 测试计划通常应该包括:

  • 测试目标: 明确的目标可以帮助团队更高效地达到目标.
  • 结束准则: 可以更好地帮组我们明确工作的目标,提高管理的效率.
  • 进度安排: 对细分的工作任务要有明确的时间进度计划.
  • 责任划分: 团队成员的任务要有明确的安排,避免出现责任不明而导致的争端.
  • 测试工具: 测试是需要高度自动化,专业化的工作.相应测试工具的准备,使用都需要计划.
  • 硬件配置: 这是完成真正的测试任务所需资源很重要的一部分.
  • 集成方法: 在多模块的大型系统中,采用什么策略进行系统集成和测试.
  • 测试用例: 测试用例构造的依据和方法.


  2) 测试结束的准则

  • 根据代码覆盖率:当测试用例对被测程序的代码覆盖率达到一定标准时就停止.
  • 根据测试用例覆盖率:根据一些系统方法找出来的用例都跑完之后.
  • 更具用例的数目和发现错误的数目:如果找到的错误达到一定的数目,或者用例达到一定的数据.
  • 根据错误发现和修正的变化趋势:如果单位时间里发现的错误数目在不断下降,或者持续处于低水平,那么继续测试的动力不大.

11/06/2008

Notes: "The Art of Software Testing" - 1


  虽然从事软件开发工作也有一段时间了,但因主要专注于开发,对测试的一致停留在直观了解的层次上。最近读了这本关于测试的经典著作,对很多概念有了更加清晰而系统的了解。特此记录从书中学到的一些理论和概念,其中也加入了一些自己平时工作的经验和体会,以作自己知识的总结、也方便以后的回顾。


The Art of Software Testing


Part I - 关于测试的基本理论



一、测试心理学

1. 测试是为发现错误而执行程序的过程,而不是为了证明没有错误.这背后的原因在于:

  1). 人类的行为总是倾向于有高度的目标性,确立一个正确的目标有着重要的心理学影响.将目标设置为找到错误,那么发现问题的几率将会增加.
  2). 人们在面对完成的可能性较大(更加现实)的任务时,表现出的效率和积极性更加的高.将软件测试定义为发现错误的活动,而不是证明没有错误,会让这个活动的目标更加切实可行.
  3). 当程序作了不应该做的事情时,也是一种错误.测试过程也许要覆盖程序不应该做的事情.

2. 软件的开发,是一个建设性的过程;而软件测试,则是一个破坏性的过程.

3. 我们希望最后的程序达到这个目标:"做了该做的,没做不该做的."但是通过不断去寻找程序中的错误和问题,是达到这个目标的最佳途径.


二、软件测试的策略

1. 黑合测试 - 数据驱动的测试或者输入输出驱动的测试.测试的目标与程序的内部机制和结构无关.从软件规范出发,进行正面和反面两方面进行测试.

2. 白盒测试 - 逻辑驱动的测试.通常从程序的内部构造和逻辑出发,构造测试输入,以覆盖尽可能多的语句路径.但是,即使测试覆盖了所有的语句路径,程序中也有可能存在错误,这些错误与输入数据相关.


三、软件测试的经济学

1. 要发现程序中所有可能的问题,理论上需要穷举所有的数据输入,这在现实中往往不可能.白盒测试中的路径覆盖是一种很好的简化,但是无法发现所有可能的错误,并且列举空间也非常大.

2. 测试在经济学上的考虑是怎样使用最少的时间,最少的资源(人员、软硬件)发现更多的错误和问题.


四、软件测试的基本原则

1. 对每一个测试用例,都需要对输入和输出的定义.这样我们才能明确判断用例是通过还是失败.

2. 开发人员应该尽量避免测试自己开发的代码.因为他们的头脑处于"建设性"思考的状态,不利于发现存在的问题.

3. 测试应该同时包含正向的和反向的.也就是作,应该覆盖"应该做的",也应该包含“不应该做的”.

4. 尽量保留已有的测试,让测试自动化,可重复,以方便地进行回归测试.

5. 错误通常是聚集出现.如果一个模块错了较多的错误,它出现更多错误的可能性将会更高.

6. 错误发现得越早,改正的成本越低.不要等到项目的最后才开始测试.

10/24/2008

Moving from Win XP to Ubuntu - 2

8. Shortcuts on Ubuntu

Desktop Environment
Ctrl + Alt + F1(F2)(F3)(F4)(F5)(F6) = Select the different virtual terminals, if you are in one of these term, no Ctrl is needed to do the switch.
Ctrl + Alt + F7 = Switch to current terminal session with X.
Ctrl + Alt + F8(f9)(...) = If there are more than one X terninal session, these key combines can switch to them.

Ctrl + Alt + +/- = Switch to next/previous X resolution (Depends of your X configuration)
Ctrl + Alt + <-/-> = Switch to next/previous workspace(virtual desktop)
Ctrl + Alt + Shift + <-/-> = Bring current applicatioin to next/previous workspace(virtual desktop)

Ctrl + Alt + Backspace = Kill & Restart X server
Alt+Tab = Switch between open programs
Alt + F1 = Bring up the start menu
Alt + F2 = Bring up the "run command" dialogue
Ctrl + Alt + D = Display desktop/Resotre application
Ctrl + Alt + L = Lock the computer
Ctrl + Alt + Del = Log out
Ctrl + L = Open location
Ctrl + F = File browser/Search

Window Management
Alt + F4 = Close current window
Alt + F5 = Restore current window
Alt + F7 = Move current window
Alt + F8 = Resize current window
Alt + F9 = Maxmize current window
Alt + F10 = Minmize current window
Alt + Space = Open control menu of current window

Screen Shot
Prtsrn = Print screen
Alt_+ PtrSrn = Print current window
Shell command : "gnome-screenshot --delay 3" = Print screen after 3 seconds

Command line/Terminal shortcuts
Ctrl+C = Kill process (Kill the current process in terminal, also used to copy elsewhere)
Ctrl+Z = Send process to background
Ctrl+D = Log out from the current terminal. In X, this may log you out after a shuting down the emulator.

Ctrl+A/Home = Move cursor to beginning of line
Ctrl+E/End = Move cursor to end of line
Tab = List available commands from typed letters (Ex: type iw and click tab, output = iwconfig iwevent iwgetid iwlist iwpriv iwspy)

Ctrl+U = Delete current line
Ctrl+K = Delete current line from cursor
Ctrl+W = Delete word before cursor in terminal (Terminal only, also used to close the current document elsewhere)

Arrows up and down = Browse command history
Ctrl+R = History search (Finds the last command matching the letters you type)

Shift+PageUp / PageDown = Scroll terminal output
Ctrl+L = Clears terminal output
Shift+insert = Paste

Mouse
Middle Click Window Title = The window will lose focus
Shift + Drag Window = Window border will be sticky to desktop border
Drag file into ternimal = Show full path of that file/dir
Middle Click Scroll Bar = Move scroll bar to that place
Middle Drag a Picture to Desktop = Make that picture to be the wallpaper
Middle Click in Web Browser = Cop selected text


Customize keyboard shortcut
1. install xbindkeys/xbindkeys-config. It works for each desktop (KDE, GNOME, XFCE, ...)

2. using metacity (the default GNOME window manager). Run gconf-editor, go to: Apps -> MetaCity -> KeyBinding_commands ...

3. using System>Preferences>Keyboard Shortcuts ...

10/23/2008

Moving from Win XP to Ubuntu - 1

  Due to recent policies from Win XP provider, I decided to move my home computer's OS from Win XP to Ubuntu.

  When you decided to replace the OS, you must replaces those daily used applications. Here I collected some of those substitutes:

  1. Web Browser
  FireFox is build into Ubuntu natively, but the firefox version and Ubuntu version mapping is fixed. So fx's version maybe not be as latest as the one from Mazillia's website. If you want to use your own version of FireFox, you can Automatic Install, or you can Manual Install 1 and Manual Install 2 it into Ubuntu.
  Opera is also available on Ubuntu.

  2. Office suite
  OOo(OpenOffice org) is definitely the good candidate. Here is a simple components mapping:
  Microsoft Office V.S. OpenOffice.org
  Word V.S. Write
  Excel V.S. Calc
  Access V.S. Base
  Powerpoint V.S. Impress

  3. Multimedia
  DVD player - MPlayer, the coolest thing is that all control can be accomplished by key strokes. In order to get chinese subtitles show up, you should do some manual settings to /etc/mplayer/mplayer.conf, especially to the sub-fuzziness variable. use "man mplayer" to get explaination on it.

  Music player - RhythmBox
  
  4. BBS tool
  Fterm -> Qterm

  5. P2P tool
  eMule -> aMule.

  For chinese internet user, recommended server list: for ed2k network, use http://www.emule.org.cn/server.met, for kad network, use ttp://www.emule-inside.net/nodes.dat.

  In order to log into popular servers, you'd better change your client name to [CHN][VeryCD]** style. For those wireless router users, in order to get HighID when you log into a server, you should configurate the TCP/UDP port mapping on your wireless router. If outter client can connect to your client directly, you will get HighID in eMule system.

  To better integrate amule with Firefox, you should add some web protocol handler to Firefox and some other aMule plugins. First, use "sudo apt-get install amule-utils" to install amule-utils package. In about:config page of Firefox, create a new boolean var, named "network.protocol-handler.external.ed2k", and set it to be true; create another string var, named "network.protocol-handler.app.ed2k", set its value to "/usr/bin/ed2k". From now on, when you searched out some useful resources on VeryCD's website, you can just click the ed2k://*** link, and that resouce will then automatically added to your amule client if it is already running.

  6. IM tool
  Google Talk ,Windows Live Messenger and Tecent QQ -> Pidgin. It will organize all your buddies from these 3 protocols into one uniformed UI.

  7. Fonts from Microsoft
  To beautify your system fonts, run "$ sudo apt-get install msttcorefonts ". It will download some popular fonts from Microsoft, extract and install the font files. You now have more options in font choosing dialogues.

  So life seems much the same as on Win XP paltform, the only exception is that I still haven't get my TV card run on Ubuntu yet ...

10/19/2008

The great PsExec and How it works

  Recently, one of my project tasks is to remotely manage serveral tens of computers - format the disks, deploy some softwares, run some executables, copy some data files into a center place ...

  Due to the lack of cluster managment tool, I turn to a simple but very powerful tool PsExec, one of the PsTool commands from systeminternals.

  You can use this tool to execute any commands on remote nodes and bring the console output back to the node where PsExec runs. You can also copy some local file to remote node and execute that file. And also, you can specify in what user account, those commands will run. You can try that command to get more info about it.

  One thing needs to mention is that if you want to use some commands, that are inner command (for example, dir c:\*) of windows console shell(cmd.exe), you'd better tell PsExec to run "cmd /c dir c:\*" on remote server. This is because there is no executable called "dir.exe", but "cmd.exe" does.

  The most magic character of PsExec is that, there is nearly NO ANY need for the server side, no need to manually copy something into remote server, no need to open the telnet service ... So, as a developer, I think many people would ask how did PsExec do all the magic stuff? Of cause, I am one of them.

  After some discussion with colleagues and searching on the web, here is the secret:
"  First of all I've found that psexec.exe contains embedded binary resource PSEXESVC, that is actually a PE-executable, more exact it's a Win32 service. Some initial reversing of PSEXESVC discovered that this is a server part of utility responsible for starting processes and redirecting I/O to/from client system. However, lets start from very beginning and describe what psexec.exe do in sequence.

  As it expected first things utility do is checking host operating system and parameters validity, an example if the application to copy and execute exists on the host system. I think here is no need to describe this part in details, any programmer working over console application do the same things (again and again in endless loop...).

  After parameters validated psexec obtains pointer and size of PSEXESVC resource:

HRSRC hSvc = FindResource (NULL, "PSEXESVC", "BINRES" );
if ( !hSvc ) return 0;
HGLOBAL hGbDesc = LoadResource (NULL, hSvc);
DWORD dwSvcSize = SizeofResource (NULL, hSvc);
PBYTE pPsExeSvc = LockResource (hGbDesc);

  Then creates file in "\\RemoteSystemName\ADMIN$\System32" named PSEXESVC.EXE and saves resource into it. If there is no existing session with permissions to access RemoteAdmin(ADMIN$) share it tries to establish new session using username and password specified in command line via call to WNetAddConnection2 as following:

DWORD PsExecRemoteLogon (
LPCSTR lpComputerName,
LPCSTR lpUserName,
LPCSTR lpPassword
)
{
char szFullPath [_MAX_PATH];
NETRESOURCE NetResource;
sprintf (szFullPath, "\\\\%s\\IPC$");
// Initialize NetResource structure, omitted here
...
return (NO_ERROR ==
WNetAddConnection2 (
&NetResource,
lpPassword,
lpUserName,
0)
);
}
  If no error happen we have PSEXESVC.EXE in \\SystemRoot\System32 folder on remote system. Note, if the executable to start remotely must be copied to the remote system, it will be also placed into that folder. After this psexec.exe install and start PSEXESVC service using SCM API (OpenSCManager, CreateService, StartService). Full description of these calls in source would take pretty much place, and I don't see much need to do this.

  After start PSEXESVC creates named pipe "psexecsvc", and start reading messages from it. For this moment we have server part installed and started on remote system, ready to accept command messages. All other work is typical for client/server applications (for better understanding of writing server applications I strongly recommend Jeffrey Richter, Jason D.Clark "Programming Server-Side Applications for MS Windows 2000"). So psexec.exe copies executable to start to remote system if necessary, opens psexecsvc pipe on remote host (CreateFile), fill in message structure with necessary parameters (command line arguments, username&password if specified and etc...) and sends it into \\RemoteSytem\pipe\psexecsvc (TransactNamedPipe API call).

  On receiving this message PSEXESVC creates three named pipes instances "psexecsvc-app_name-app_instance-stdin", "psexecsvc-app_name-app_instance-stdout", "psexecsvc-app_name-app_instance-stderr". As you may suspect psexec.exe connects to each of these pipes and creates separate threads to work with each one. Using console functions (GetStdHandle, ReadConsole, WriteConsole and etc..) standard I/O streams (input, output, error) redirected to/from remote system through previously mentioned named pipes.

  On exiting application psexec.exe stops and uninstall PSEXESVC service, removes it's binary from remote host and removes console executable if it was copied.

  As a result you have telnet like application with extensive use of Windows NT/2000 features, it can be effectively used by system administrators for common administration tasks. The only hole (mentioned on utility homepage) is security: "If you omit a username the remote process runs in the same account from which you execute PsExec, but because the remote process is impersonating it will not have access to network resources on the remote system. When you specify a username the remote process executes in the account specified, and will have access to any network resources the account has access to. Note that the password is transmitted in clear text to the remote system". As you can see this tool is dangerous to use for remote administration via Internet or sometimes even in corporate network (as dangerous as telnet an example). One of the possible extension for this tool would be securing communication of psexec and psexesvc with encryption. However, in combination with IPSEC (if IP used as transport for CIFS) it can be successfully used even today. "


  So from the upper description we can find that the corner stones that make PsExec magic happen are two mechanisms: one is the convinient way to access "\\RemoteNode\ADMIN$" folder, withou this, PsExec can't upload the extracted PSEXESVC onto the remote server; the other one is the Remote Service Management facilities proviced by windows OS, by which PsExec can start and manage its server part. In a word, some existing inner "remote copy, remote execute" facility, provided by windows OS, make PsExec behaves like a magic tool.

  As to my task mentioned at the beginning of this article, I write some batch script and store it as *.bat file, and tell PsExec to execute it remotely. For each inner command of windows console shell, I prefix it with "cmd /c". For the powerful commands come from windows console shell and other useful executables, please refer to my previous articles. And also, use windows PowerShell and PowerShell script is another good alternative to batch script.

  One suggestion to make this tool even better is to make it have some "persistent copying" features.

  There are many situations that I need to copy some data files onto remote servers, but the user accounts in local and remote machine can't access eath other's resources, so share folder can't resolve this problem.

  Yes, PsExec now has one "-c" parameter, by which you can upload it to remote machine. But this file can only be executable files and it is the command that PSEXESVC will execute. What's more, this file will be deleted when the PsExec ends, so you can't cheat current PsExec to do what I mentioned here.

  So, in order to do "persistent copy", PsExec should either provide a switch to tell it NOT delete remote copied files on exiting, or It should provide a new parameter that is only used to copy files persistently.

10/09/2008

Books on Scalable Web Application Architecture

"Building Scalable Web Sites"
- by Cal Henderson, 2006.
[It focuses on back-end architecture, design considerations, technologies tips and practices, but not only system scaling technologies, also some application level techniques ]

"High Performance Web Sites"
- by Steve Souders, 2007.
[It focuses on front-end technologies and some web server tuning tips that can improve web application performance]

"Information Architecture for the World Wide Web"
- by Louis Rosenfeld and Peter Morville, 2006.
[It focus on overall design of web based information system and involves both technical and non-technical aspects]

"Scalable Internet Architectures"
- by Theo Schlossnagle, 2006.
[This book is all about core scaling problems and technologies that a large scale web site have to face]

"Squid: The Definitive Guide"
- by Duane Wessels, 2004.
[Squid is a caching software for web server, this book focuses on various aspects on this great software]

9/25/2008

Write Debugger Friendly Applications

One of the challenges for debugging when developing distributed system is that, you may need to debug some processes that are created by some other processes(for example Deamon on Data/Compute Node).

How to get debugger hooked in at the very first place when the process is spawned by those Deamon processes?
- The solution is to write "Debug Friendly" applications.

In order to get the process hooked in by Debugger at the starting up point, you should enable:
1. Stop and Wait the debugger to start and kick in, at the process entry point.
2. Some flag/switch mechanism to contorl whether your process should do step 1

Here is one way to realize these visions:
1. Use environment variable to control whether the debuggee should stop/wait debugger
2. Use IsDebuggerPresent() to determine whether your process is attached by a debugger
3. Use DebugBreak() to get Debugger kicked into the Debuggee.

{
    // if MY_DEBUG_BREAK is defined, break into the debugger
    SomeStrUtil strDebugBreak;
    if (strDebugBreak.GetEnvVar("MY_DEBUG_BREAK ") == S_OK && strDebugBreak.GetLength() != 0) {
        printf("Process-[%d] is waiting for debugger ...\n", GetCurrentProcessId()</a>);
        fflush(stdout);
        while (!IsDebuggerPresent()) {
            Sleep(someTime);
        }
        DebugBreak();
    }
}


If the upper code is put at the entry point (main/wmain) of your code, and you set the proper environment variable(here its MY_DEBUG_BREAK), it will wait the Debugger to attach to it. If it's attached, it will kick the Debugger into its code logic.

You can now break the debuggee, set break point and start your debugging journey.

9/11/2008

Get Process Owner

When investigating with access control and security problems, you may want to get process owner pragmatically.

Since security mechanism on windows is far more complex and trivial than those in *nix world, I will list the steps to accomplish this task here:
1. Get access token associated with the target process
2. Get user info(SID of the user) from access token
3. Get domain\user info from AD using SID

code for querying process owner
1 // open the access token associated with the process
2 HANDLE tokenHandle = NULL;
3 if (!OpenProcessToken(GetCurrentProcess(), TOKEN_QUERY, &tokenHandle))
4 {
5     Log(LogLevel_Error, "Failed to get process token, LastError=%u", GetLastError());
6     return;
7 }
8
9 // retrieve user security info(SID) about this access token
10 DWORD dwRequireSize = 0;
11 char userInfo[128];
12 if (!GetTokenInformation(
13     tokenHandle,
14     TokenUser,
15     &userInfo,
16     sizeof(userInfo),
17     &dwRequireSize))
18 {
19     Log(LogLevel_Error, "Failed to get process owner SID, LastError=%u", GetLastError());
20     return;
21 }
22
23 // retrieve the domain\name of the account for this SID
24 TOKEN_USER* userToken = (TOKEN_USER*)userInfo;
25 char name[MAX_PATH];
26 char domain[MAX_PATH];
27 DWORD nameSize = MAX_PATH;
28 DWORD domainSize = MAX_PATH;
29 SID_NAME_USE nu;
30 if (!LookupAccountSidA(NULL,
31     userToken->User.Sid,
32     name,
33     &nameSize,
34     domain,
35     &domainSize,
36     &nu))
37 {
38     Log(LogLevel_Error, "Failed to lookup user info from SID. LastError=%u", GetLastError());
39 }
40 else
41 {
42     Log(LogLevel_Info, "Current Process is using domain=%s, user=%s", domain, name);
43 }


Note:
- I pass a 128 char array, rather than a pointer to struct TOKEN_USER, to the 3-th parameter of GetTokenInformation() function at line 15. It's because this API needs more space than TOKEN_USER. On my dev machine, it requires 44 bytes, so I enlarge it to 128 to make it more flexible. But the beginning memory are of this buffer is always a TOKEN_USER structure, so I convert the start addr of this buffer to TOKEN_USER pointer at line 24.

8/12/2008

On the Failure of our System/Project

How can we say that our system/project succeeded or failed?

It is not easy to define the Failure/Success criteria of a project. Some junior people always think that if our software can run(跑起来), then our project is successful.

First, how to define "software can run"? What's the feature list? What's the configuration? What's the test workload? What's the performance expectation? What's the code/doc quality?

Second, even if all the upper questions can be answered elegantly, but finally no customer will use your software, can you claim your project to be successful?

You may challenge that, if I have good answers to all these hard questions, why those stupid customers won't use our software? There is no one that is really stupid on this planet, especially among your customers.

The customer may not choose to use your runnable software for many reasons:
1. Your software maybe good for others, but not suitable for his specific requirements such as data diversity, importing policies.
2. There are some better alternatives. Customers have the freedom to make choices and they will try their best to investigate various candidates. If there is any better and proven solution to their problem, why should they choose yours?
3. The trust and relationship between you and your customer. Will you provide enough supports if they choose your solution? Will you work on them if the customer propose new feature requirements?

Some people may also say that if the project member learned a lot and can do things that he/her can't before, then they project is successful. I think this can only be used to judge the value of a project, but not the failure/success of a project. Because we can always learn from what we have done, whether the project is successful or not.

In my opinion, failure happens when the results can't meets the initial goals set at the the initial stage of the project. What goals to set is another big problem. You may set the goals according to your budget, competitor's status, market requirements, technical challenges and available resources.

According to the upper definition about "Failure", I think our project failed due to the big gap between the initial goals and final outcomes. Confessing failure is good and important, but the more important thing is to ask:why our system/project failed?

Part I - From the non-tech point of view, they are:
1. No clear success criteria is one of the main root causes. No clear and proper goal/vision decrease the productivity and moral of the whole team.
2. Component owners don't have shared vision, thus not cooperated in a smooth way, which in turn makes the overall system very strange in the user's view.
3. No architect position in the team to make technical decisions.
4. Application(internal customer) owners have too much impact and decision rights on the design/implementation of the platform.
5. Target applications are not clear and we have incorrect ambitious to support all kinds of applications.
6. Leadership team delivered wrong message and team members thus build wrong mindset. Many team members don't think they are doing engineering work and won't do such works as dev test/write docs etc.
7. Team member don't have proper expertise, but not willing to learn, to cooperate.
8. Lack of experienced team member to guide other junior members.
9. Lack of team culture to make the team a good environment to work in.
10. Lack of realistic understanding of our system's weakness and strengthen. Thus can't win the potential customers' trust.

Part II - From the technical point of view:
1. Lack of proper engineering expertise to guide the development process. We have build engineers, testers, coding standards, triage processes. But we don't use it in a proper and professional way. Most of the time, we just use the proper term/concept, but practice in a non-engineering way.For example, no one know what's unit test and code coverage. The have no functional spec and just ask testers to do ad hod tests according to Dev's requirement.
2. Lack of domain expertise. Most team member came from Machine Learning, Data Mining background, they don't know too much about system area knowledge. So the outcome solutions are always ad hoc.
3. Lack of clear target scenarios. Large scale distributed system design is all about trade-offs, if no clear target, proper technical decisions is not feasible to make.
4. Lack of analytical thinking capability to make right decisions. For example, some designs use reliable distributed file system to temporarily create/delete a huge amount of small files. We should have clear understanding what components can do what well and what bad.
5. Have unrealistic technical goals. If we want to do all things well, we will do nothing well in the end.

8/01/2008

On the Failure of our Team

About half an year ago, I joined a big cloud infrastructure team(15p+). During these days, I learned a lot from those things that we may not deal well. In this article, I will try to summarize what caused the team management problems and what we can do better if we had the chance to do it again.

What makes us fail?
1. Team Culture
no culture is a kind of culture.
no trust, no cooperation
respect to individual team member and domain experts
software developing is art but also scientific activity

2. Team Vision and Goal
Define the goal of our work and project. How to define success criteria? many times, people say, we should make this component runnable(把系统跑起来). But how to define "RUN"? what test cases to pass? what code coverage to cover? what user scenarios to pass? what's the performance goal? what's the scalability goal?Are we doing research or engineering? If engineering, be careful about the engineering excellence. If research, what's your uniqueness/innovation?

3. Team Moral
team moral. Salary is not the only way to burn team moral, sometimes, it's not the most effective way. Team building activities(lunch together, team activities(true man CS), hiking together), exciting project vision, comfortable team environment and simple colleague relationship, training courses.

4. Team Member Diversity
Roles: Architect, Developer, Test
Background: Junior, Senior, Experienced Leader

5. Team Leadership
Shield team members from outside interference
Give ownership to team member, they are not just coder

6. Cross Team Cooperation
Platform and Application should be partner relationship, don't challenge, don't be aggressive, just work together.

7. The root cause of all the problems lies on some leaders. Some people never do any risk analysis and challenge estimation. Such unrealistic characters have very bad influence on the project and the whole team.

Things Learned from the Failure:

1. Cooperation: open to partner and other domain expertise owners. Learning from others,leveraging existing knowledge will reduce our risk and cost.

2. Personal development: encourage team member to think, to dream, to be open, to be passionate and to grow. Only in this way, can the team be stable and grow. When team member grows, the reputation, confidence and capability of your team grow.

3. Control your ambitious: "志存高远地梦想固然值得钦佩,但必须要脚踏实地地行事". Daring to dream is always a good thing, but we should be realistic and have to approach our goals step by step. We must be self-aware and know what's our strength and limitations.

4. Good mindset: Clever is important, but not that important. Most projects failed not because of the IQ of the team members, but the chaos of team/project management. For the success of the software project, I think all team member should have following attitudes in mind:
a. Be open and respectful, among team members and with our partners.
b. Pursuing engineering excellence and personal excellence.
c. Passionate about technologies and learning.
d. Teamwork and work with others.

How to improve team moral?
1. Open and respectful among team members, between staff and leadership
2. Trust, passion
3. Exciting project/team vision/goal.
4. Willing to dream and to make dream come true
5. Team building
6. Do important things and make big impact
7. Make team member think their work are of great importance
8. Grant ownership of small components to individuals, don't treat any member as yet another coder.

How to build great team?
1. Build team uniqueness
2. Build domain experts
3. Help team member to think and grow
4. Build Studying organization
5. Set proper team membership bar
6. Control team ambitious, build team reputation/confidence gradually
7. Keep proper amount of new hires to keep team freshness
8. Communication effectiveness (don't treat people as machine, they need to talk with people, not just those cold, boring, harmful, annoying machines)

How to be great team member?
1. Identify your uniqueness and improve it
2. Accumulate domain expertise
3. Improve time/project management skills
4. Learn communication skill
5. Build your personal insight
6. Build great social relationship among colleagues

7/15/2008

Reconstruct VIM file organization on Windows

  不大喜欢vim在Windows平台上的默认目录结构,打算重新按照Vim\Runtime, Vim\Bin, Vim\VimFiles三个子目录的结构重新组织Vim。

  需要做的工作:
  1. 建立VIM root dir.
  2. 安装官方方法正常安装vim:解压vim72rt.zip, vim72w32.zip, gvim72.zip;执行install.exe,所有任务全选然后执行.
  2. 在VIM root dir下面建立Runtime, Bin两个个子目录.
  3. 将vim.org上下载的vim72rt.zip里面Vim\Vim72下的所有内容拷贝到Vim\Runtime.
  4. 将vim.org上下载的gvim72.zip, vim72w32.zip里面Vim\Vim72目录下的所有文件拷贝到Vim\Bin.
  5. 给系统Path变量加上$YourDir\Vim\Bin.
  6. 删除vim下的vim72目录.

  现在,一个基本的vim, gvim可以运行了,但是vim系统还需要知道用户的启动脚本.vimrc等用户定制信息($Vim)和自己的runtime($VimRuntime)在什么地方:
  1. 设置环境变量$Vim=$YourDir\Vim.
  2. 设置环境变量$VimRuntime=$yourDir\Vim\Runtime,这个可以忽略,这是系统的默认的位置.

  这时系统基本就绪,但是最初安装时设置的右键环境菜单:"open with vim"还不能正确工作,系统会报告一个错误:"Error creating process: Check if gvim is in your path!"。实际上,我们已经将gvim.exe放入了$Path里面。从出现的对话框来看,应该是gvimext.dll在寻找gvim.exe程序的时候出的问题。

  在vim source code中寻找上面那段错误字符串,很快定位到$src/GvimExt/gvimext.cpp文件的CShellExt::InvokeGvim()函数,在启动gvim.exe时,是通过getGvimName()来定位gvim.exe的位置的。getGvimName()函数位于同一文件中,它首先寻找注册表的"HKEY_LOCAL_MACHINE\Software\Vim\Gvim",如果没有找到,才会使用系统$Path去寻找gvim.exe。至此可以判断是注册表相应的键值在最初运行install.exe时,设置了现在不可用的值。直接将之删除,为安全起见,再搜索替换整个注册表系统中的其它vim相关字段。

  重试右键菜单,一切ok.

  这样,一个干净而清晰的vim就完成了:$vim\Bin下面是所有的可执行文件;$vim\runtime下面是系统的预定义配置、帮助等辅助性文件;$vim\vimfiles是用户自己的定制文件,比如插件等等;用户的启动脚本则放在$vim\_vimrc.

  除了右键菜单,没有任何依赖注册表的信息.另外,需要手动修改一下path变量,设置一下vim变量.

  BTW,看代码时发现gvimext.dll的作者似乎是一位中国人:Tianmiao Hu.

7/08/2008

I18N in C RunTime on Windows Platform

    In multilingual environment, representing and processing text information is a challenging problem. The fundamental question here is -- how to map between language character and computer byte stream?

    There are three concepts/components essentially in this mapping: Character Collection -> Character Table -> Encoding.
  • Character Collection means a collection of characters from a specific language.
  • Character Table means mapping from characters to numbers(numeric codes).
  • Encoding means convert those numbers that stand for language characters into byte(bit) stream used internally in computer.
    In practice, we may call the three components combined together as Charset, Encoding or Codepage (I personally prefer 'codepage' due to its unambiguity). But they all refer to the same thing and are often used alternatively. There are also some other popular but confusing terms regarding this mapping, for example, UNICODE/UCS/utf-8/utf-16/UCS-2. According to the previous definitions, UNICODE/UCS(Universal Character Set) is actually used to represent some character table, while utf-8/utf-16/UCS-2 is to specify how to convert numbers from character table into byte streams.

    There are many problems that are worthy of discussion in the field of I18N(internationalization), but in this article, I only focus on -- "How does CRT from Microsoft deal with chars beyond ASCII?"

Let's look at the code below:
 1 #include <windows.h>
 2 #include <stdio.h>
 3 #include <iostream>
 4 #include <locale.h>
 5 #include <conio.h>
 6
 7 int __cdecl main(int argc, const char* argv)
 8 {
 9     // Use default CRT locale
10     char * lt1 = setlocale(LC_CTYPE, NULL);
11     printf("current locale before change:%s\n", lt1);
12
13     char * lpsza = "Hello,世界!";
14     wchar_t * lpszw = L"Hello,世界!";
15     printf("NO1 - %s\n", lpsza);
16     printf("NO2 - %S\n", reinterpret_cast<const char*>(lpszw));
17     wprintf(L"NO3 - %s\n", lpszw);
18     _cwprintf(L"NO4 - %s\n", lpszw);
19
20     // Use 936 CodePage in CRT
21     char * lpszLC = setlocale(LC_CTYPE, "chinese_China"); 
22     char * lt2 = setlocale(LC_CTYPE, NULL);
23     printf("current locale after change:%s\n", lt2);
24
25     printf("NO1 - %s\n", lpsza);
26     printf("NO2 - %S\n", reinterpret_cast<const char*>(lpszw));
27     wprintf(L"NO3 - %s\n", lpszw);
28     _cwprintf(L"NO4 - %s\n", lpszw);
29
30     return 0;
31 }


The Console Output of this small program is:
current locale before change:C
NO1 - Hello,世界!
NO2 - Hello,NO3 - Hello,??!
NO4 - Hello,世界!
current locale after change:Chinese_People's Republic of China.936
NO1 - Hello,世界!
NO2 - Hello,世界!
NO3 - Hello,世界!
NO4 - Hello,世界!


    Let's see what happened behind the strange output.


     From the console output, we can find that results from L16 and L17 are not what we expected. We had specified that the string parameter was wide char string in line 16, and in line 17, function and parameter are all wide char version. So what's wrong with line 16/17?

     There are many posts on the Internet that had given some explanations on similar problems, but none of them can satisfy me. MSDN doesn't have information about CRT implementation internals. Since MS visual studio comes with CRT source code, so I turn to debugging into CRT source[1] to see what happened in detail.


Debug 'printf' in CRT using VS2008

    We can see that in the case of 'wprintf', the call flow will be: wprintf -> _output_l -> _wctomb_s_l, at this point, wide char string parameter is converted into multibyte string using current CRT locale. In order to output this mb string, the CRT will call write_char -> _fputc_ -> _write. The _write function is located in write.c, it will check the destination file handle, if it is console, _write will convert the input buffer into wide char string using CRT locale, then convert the wide char string result into multibyte string using Windows Console's CodePage, at last, _write will call Win32 API - WriteFile() to do the real work.
    From the first line of the console output of this small program, we can see that the CRT default locale is 'C'. From line 80@wbtomb.c we can find that 'C' locale will treate wchar_t value that is greater than 0xff as illegal char. Caller of this function in the wprintf context will use '?' to replace the original wchar_t value if error occurs in the convertion. This is the reason behind the output of "NO3 - Hello,??!"

    In the case of 'printf("%S")', it is slightly different. When output_l failed to convert wide char to multibyte string using _WCTOMB_S()(Line2235@output.c, it will eventaully call _wctomb_s_l()), it will return as error to the caller - output_l() and the whole output process will be terminated. This is the reason behind the output of "NO2 - Hello,", without lf/cr at the end.

    Then, how about _cwprintf()? Through the source code debug we can find that when this function is called, a macro:CPRFLAG is defined in the CRT source code context. The call flow will be: _cwprintf -> _vcwprintf_l -> write_char->_putwch_nolock, and _putwch_nolock will call Win32 API - WriteConsoleW direclty, no wbtomb or mbtowb convertion at all. So all _cwprintf() calls are OK in the example code.


From the the debug/analysis, we had found that:
1. All CRT inner wbtomb/mbtowb is implemented using Win32 API - WideCharToMultiByte/MultiByteToWideChar.

2. Output to console is implemented using Win32 API - WriteConsoleW or WriteFile.

3. If you are writing to file(wprintf/printf is actually writing to file in this context) rather than console and the input string parameter is wide char string, CRT will try to convert it into multibyte string using CRT locale first. If failed in the conversion, w* functions(wprintf) will use '?' to replace the original char and continue the output process, while non-w functions(printf) will terminate the output process and return with error value.

4. wprintf/printf is treated as file output rather than console output because STDIO may be redirected to real files, STDIO is not guaranteed be Windows Console and Keyboard.

5. If you use wprintf to output wide char string into Windows Console, there will be three wide char/multibyte char convertions: w->m in output_l()@output.c, m->w & w->m in _write()@write.c. Even if you are output multibyte string(most likely, encoded in utf-8) into console using printf(), 2 conversions may be needed in _write(). So you'd better to use _cwprintf()/_cprintf in this situation, no w/m convertion at all, CRT will all WriteConsoleW() directly.

6. All the coding/debugging/source code investigating are done on Windows platform using Vistual Studio 2008. The conclusions may be wrong in other CRT implementations on Windows platform and other operating systems.

NOTE [1]:
    In order to debug into CRT source code, you should download pdb files from Microsoft symbol servers, although CRT source code is already located on your local visual studio installation (By default: $MSVS\VC\CRT\SRC).

    To this end, just set a global system environment as:
_NT_SYMBOL_PATH = srv*c:\symbols*http://msdl.microsoft.com/download/symbols;symsrv*symsrv.dll*c:\symbols*http://msdl.microsoft.com/download/symbols

    It applies to VS debugger and WinDbg debugger. More information can be found at microsoft website:[1],[2]

NOTE [2]:
    If no CRT locale is given when program starts, it will use 'C' locale as default. When dealing with wide char to multibyte char conversion, 'C' locale will treat chars that are greater than 0x00ff as illegal input and will return EILSEQ to caller.

NOTE [3]:
    Since wprintf/printf will output to console in MOST times, so I think CRT should optimize this path. It should check the destination file handle at early time, if it is console, just convert the parameter string into wide char string (if needed) and then call WriteConsoleW(). This will reduce many unnecessary encoding conversions.

NOTE [4]:
    To test if you fully understand the internals, you could change the locale in line 21 from "chinese_China" to "chinese_Tainwan" and run the program again. See the strange output? Only the output from line 25 is changed! Try to explain the reason behind.

6/24/2008

the RPC Way to Interop With Remote Process

When we say "interop with remote process", people may think "oh, its just remote ipc, which is actually network communication". Yes, it's network communication, but not that easy as it may seem at the first sight.

Besides raw socket mechanism, there are many well defined styles in which a process can communicate with remote processes: Named Pipe, Mailslot are those that are available on Windows platform.

In this article, I will focus on another specific remote IPC mechanism: Remote Procedure Call - communicating with remote process as method call.

1. Why people invent RPC?

When physical network communication comes true in computer world, software designer starts to think how to provide a software interface to these great facility.

Raw socket abstraction is the first step, it is powerful but too low-level - you see connection/listen/accept/byte stream. Networking is for remote IPC, but remote IPC semantic is far more rich than just connection and raw byte stream.

So here comes the invention of RPC. According to the earliest paper, the main motivation behind the design of RPC is - procedure call is a well-known and well-understood mechanism/semantic, making it work among remote computers will make distributed system easier to be built and get right.

2. What problems should RPC solve?

Key problems:

-1]. Binding Remote Methods(A.K.A. identifying/naming/addressing/locating). It concerns with two fundamental problems: how to let caller specify which callee to communicate (Naming)? how to let caller find the physical address(network address/port/process ID etc) of the callee (Locating)?

-2]. Data Unmarshall/Marshall. It concerns how caller encode its data types (in some programming language) into byte stream and how callee decode byte stream into its desired data types (may be in another programming language on another platform). It is typically called Message Encoding. More recently, there is also a related problem - Message Formatting, which define how the message between client/server is layoutted.

-3]. Message Transporting. It concerns how to send/receive data stream, for example, how to manage connection, congestion, failure and time out etc.

Every RPC mechanism has specific solution to these problems.

Basic Components for RPC system:
- Client Code
- Client Stub
- RPC runtime (at both client and server side)
- Server Stub
- Server Code

3. What are the popular RPC system today?

Function Oriented - classic RPC
- Distributed Computing Environment by HP, IBM, DEC and others
- Open Network Computing by SUN
- MSRPC by Microsoft, it's a modification to DCE/RPC and MSRPC is widely used in DCOM, which could be regarded as MSRPC + COM.

Object Oriented - OO RPC
- .Net Remoting by Microsoft
- Java Remote Method Invocation by Sun
- DCOM: Distributed COM by Microsoft
- COBRA: Common Object Request Broker Architecture by Object Management Group
- Jini - Advanced Java RMI by Sun

HTTP/XML Oriented - Web Service
- XML-RPC, xml as data representation and http as transport channel
- JSON-RPC, JSON as data representation, http/socket as transport channel
- SOAP WS, (A.K.A big web service) xml with strong schema as data/message representation, http/smtp/tcp etc as transport layer, security/transaction related protocols (so called WS-*) are added to support enterprise application
- RESTful WS, xml/json/...(flexible) as data encoding, simple & flexible message format, http as transport channel, http method & uri to represent action and stateless in communication

4. Other stuff that matters

The Extensive Adoption of XML

Advantages of using XML
- software that marshall to, unmarshall from xml is easy to get
- text based data representing, easy to read(but why should we read those data?)
- it's flexible and easy to extend

Disadvantages of using XML
- high computing/memory cost
- high cost of communication

The Extensive Adoption of HTTP

- In fact, the HTTP is not a general purpose communication facility by design. The initial motivation to adopt HTTP as transport channel in web service protocols is to communicate through Internet firewalls.

Many criticisms say that HTTP/XML is abused in today's web-scale distributed systems. Yes, these technologies are used in a way other than the initial target, but very few innovations are created as planned, rather, many great inventions are just happy accidents. HTTP/XML may not be the best technologies, but they do be the most successful ones in web world.

Windows Communication Foundation

WCF by no means invents any new communication protocol. It is just a programming framework that implements existing protocols on .Net platform. The main contribution of WCF is that it uniforms various communication protocols(SOAP, WSE, .Net Remoting, Message Queue, JSON-RPC and REST etc.) into one programming model in .Net framework.

[Reference]

Remote Procedure Call
Wiki about RPC - http://en.wikipedia.org/wiki/Remote_procedure_call
RFC about RPC - http://tools.ietf.org/html/rfc707
RPC Tutorial - http://www.cs.cf.ac.uk/Dave/C/node33.html
RPC Implementation - http://pages.cs.wisc.edu/~cs736-1/papers/rpc.pdf
DCE RPC - http://www.opengroup.org/dce/
SUN RPC - http://www.onc-rpc-xdr.com/
MS RPC - http://msdn.microsoft.com/en-us/library/aa378651(VS.85).aspx

Web Service
The SOAP/XML-RPC/REST Saga, http://www.tbray.org/ongoing/When/200x/2003/05/12/SoapAgain
History of SOAP, http://webservices.xml.com/pub/a/ws/2001/04/04/soap.html
XML-RPC, http://www.xmlrpc.com
SOAP Tutorial, http://www.w3schools.com/soap/default.asp
Build WS the REST way, http://www.xfront.com/REST-Web-Services.html

6/10/2008

Windows Service - How To

Part I - Why to invent so called "Windows Service"?
1. can automatically started when the computer boots, can be paused and restarted
2. can run in their own windows sessions, and do not show any user interface(no keyboard, mouse and monitor needed, background process)
3. similar to "Daemon" in *nix world

Part II - Windows Service Components
1. Service Database
- register based data store
- store configuration/attribute of each service

2. Service Control Manager
- a RPC server started at system boot
- manage service database
- control individual services directly

3. Individual Service
- Service Program, A program that provides executable code for one or more services
- Service Configuration Program, A program that queries or modifies the services database(install/delete/modify service programs)
- Service Control Program, A program that starts and controls services and driver services
- all the upper 3 prgms leverage functionalities from SCM

Part III - Steps to develop a Windows Service
1. composing service program, which includes:
- main function, called by SCM. It should start the service dispatch loop/thread and execute the servicemain function for each service it contains.
- servicemain function, it contains service specific logic.
- control handler, defines how the service serves the control messages from SCM.
- the thread model is "1 + N": 1 thread for control dispatching and N for each servicemain.

2. composing service configuration/control program
- can be in a shared/individual executable
- config program interact with Windows Service in a static way, while control program in a dynamic way
- both talks to Windows Service indirectly using SCM as the mediator

Part IV - How to debug Windows Service:

1. Use your debugger to debug the service while it is running. First, obtain the process identifier (PID) of the service process. After you have obtained the PID, attach to the running process. For syntax information, see the documentation included with your debugger.

2. Call the DebugBreak function to invoke the debugger for just-in-time debugging.
Specify a debugger to use when starting a program. To do so, create a key called Image File Execution Options in the following registry location:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion

3. Use Event Tracing to log information.

4. Write the service as console application first. After debugging/verification, converted it to Windows Service form.

Let's see an code example of a service, who writes an event record to Windows periodically.
Code Snippet I - Service Program
  1 #include <windows.h>
  2 #include <stdio.h>
  3 #include "msg.h"
  4
  5 BOOL g_isShutdown = FALSE;
  6 BOOL g_isPaused = FALSE;
  7
  8 SERVICE_STATUS g_servStat = {0};
  9 SERVICE_STATUS_HANDLE g_hServStat = NULL;
 10
 11 DWORD WINAPI TimerServiceHandler(DWORD  dwControl, DWORD  dwEventType, LPVOID lpEventData, LPVOID lpContext)
 12 {
 13     switch (dwControl)
 14     {
 15     case SERVICE_CONTROL_CONTINUE:
 16         g_isPaused = FALSE;
 17         break;
 18
 19     case SERVICE_CONTROL_PAUSE:
 20         g_isPaused = TRUE;
 21         break;
 22
 23     case SERVICE_CONTROL_SHUTDOWN:
 24         g_isShutdown = TRUE;
 25         g_servStat.dwCurrentState = SERVICE_STOPPED;
 26         break;
 27
 28     case SERVICE_CONTROL_STOP:
 29         g_isShutdown = TRUE;
 30         g_servStat.dwCurrentState = SERVICE_STOPPED;
 31         break;
 32
 33     default:
 34         break;
 35     }
 36
 37     SetServiceStatus(g_hServStat, &g_servStat);
 38
 39     return 0;
 40 }
 41
 42 // ServiceMain of Service Program
 43 void WINAPI TimerServiceMain(DWORD argc, LPWSTR *argv)
 44 {
 45     // Register service control handler
 46     SERVICE_STATUS_HANDLE g_hServStat = RegisterServiceCtrlHandlerEx(L"MyTimer", TimerServiceHandler, NULL);
 47
 48     // Set service initial status
 49     g_servStat.dwCheckPoint = 0;
 50     g_servStat.dwControlsAccepted = SERVICE_ACCEPT_STOP | SERVICE_ACCEPT_PAUSE_CONTINUE | SERVICE_ACCEPT_SHUTDOWN;
 51     g_servStat.dwCurrentState = SERVICE_START_PENDING;
 52     g_servStat.dwServiceSpecificExitCode = 0;
 53     g_servStat.dwServiceType = SERVICE_WIN32_OWN_PROCESS;
 54     g_servStat.dwWaitHint = 2 * 1024;
 55     g_servStat.dwWin32ExitCode = ERROR_SERVICE_SPECIFIC_ERROR;
 56     if (!SetServiceStatus(g_hServStat, &g_servStat))
 57     {
 58         // report error in the way you like: disk file, windows event
 59         printf("failed to set service init status\n");
 60     }
 61
 62     // Your service specific logic here
 63     HANDLE hEvtSrc = RegisterEventSource(NULL, L"MyTimerService");
 64     if (hEvtSrc == NULL)
 65     {
 66         // use your own service info reporting mechanism
 67         printf("Cannot register the event source.");
 68         return;
 69     }
 70
 71     // Init done, report new service status to SCM
 72     g_servStat.dwCurrentState = SERVICE_RUNNING;
 73     SetServiceStatus(g_hServStat, &g_servStat);
 74
 75     wchar_t buf[MAX_PATH];
 76     LPCWSTR lpBuf = buf;
 77     UINT32 uiCounter = 0;
 78     while (!g_isShutdown)
 79     {
 80         if (!g_isPaused)
 81         {
 82             // see windows event documentation for how to use it in your own program
 83             swprintf_s(buf, MAX_PATH, L"Info from Gate Keeper Service: %u\n", uiCounter++);
 84             ReportEvent(hEvtSrc, EVENTLOG_SUCCESS, NULL, MSG_INFO_COMMAND, NULL, 1, NULL, &lpBuf, NULL);
 85         }
 86         Sleep(5 * 1000);
 87     }
 88    
 89     DeregisterEventSource(hEvtSrc);
 90
 91     g_servStat.dwCurrentState = SERVICE_STOPPED;
 92     SetServiceStatus(g_hServStat, &g_servStat);
 93
 94     return;
 95 }
 96
 97 // The Main Function of Service Program
 98 int main(int argc, char** argv)
 99 {
100     SERVICE_TABLE_ENTRY dispatchTable[] =
101     {
102         {L"MyTimer", TimerServiceMain},
103         {NULL, NULL}
104     };
105
106     // Register ServiceMain to Service Control Manager
107     if (!StartServiceCtrlDispatcher(dispatchTable))
108     {
109         // you can choose to write the info msg to disk file or windows event
110         printf("Failed to register service main to SCM.Please check system logs\n");
111         return -1;
112     }
113
114     return 0;
115 }
116

Code Snippet II - Service Configuration/Control Program

 1 #include <windows.h>
 2 #include <stdio.h>
 3
 4 int wmain(int argc, wchar_t** argv)
 5 {
 6     SC_HANDLE hSCM = OpenSCManager(NULL, NULL, SC_MANAGER_ALL_ACCESS);
 7     if (hSCM == NULL)
 8     {
 9         printf("failed to open sc manager, due to error:%d\n", GetLastError());
10     }
11
12     // Service Configuration Program - Install a new Windows Service
13     SC_HANDLE hServ = CreateService(hSCM, L"MyTimer",
14             L"Time Service",
15             SC_MANAGER_ALL_ACCESS,
16             SERVICE_WIN32_OWN_PROCESS,
17             SERVICE_DEMAND_START,
18             SERVICE_ERROR_NORMAL,
19             L"D:\\Dev\\Debug\MyTimer.exe",
20             NULL,
21             NULL,
22             NULL,
23             NULL,
24             NULL);
25     if (hServ == NULL)
26     {
27         printf("create windows service failed, due to error:%d\n", GetLastError());
28     }
29
30     // Service Configuration Program - Delete an existing Windows Service
31     //SC_HANDLE hServ = OpenService(hSCM, L"MyTimer", SERVICE_ALL_ACCESS);
32     //if (!DeleteService(hServ))
33     //{
34     //  printf("delete windows service failed, due to error:%d\n", GetLastError());
35     //}
36
37     // Service Control Program - Start/Pause/Resume/Stop the Windows Service
38     if (!StartService(hServ, argc, const_cast<LPCWSTR*>(argv)))
39     {
40         printf("failed to start service due to error:%d", GetLastError());
41     }
42
43     SERVICE_STATUS servStat;
44     QueryServiceStatus(hServ, &servStat);
45
46     if (!ControlService(hServ, SERVICE_CONTROL_PAUSE, &servStat))
47     {
48         printf("failed to control service due to:%d\n", GetLastError());
49     }
50
51     if (!ControlService(hServ, SERVICE_CONTROL_CONTINUE, &servStat))
52     {
53         printf("failed to control service due to:%d\n", GetLastError());
54     }
55
56     if (!ControlService(hServ, SERVICE_CONTROL_STOP, &servStat))
57     {
58         printf("failed to control service due to:%d\n", GetLastError());
59     }
60
61 }
62

Note:
1. You can see the main/servicemain/controlhandler of Service Program from the code comments.
2. You can also see the Service Configuration/Control Program from the second code snippet. These two programs are merged in one executable.
3. You should update service status each time a control command is processed in your control handler function, even if the status is not changed at all, as illustrated in line 37@code snippet I. This will let SCM aware that your service is in progress, not hang.
4. Windows Service applications run in a different window station than the interactive station of the logged-on user. The Windows service station is not interactive, so dialog boxes raised from within a Windows service application will not be seen and may cause your program to stop responding. Error/Debug messages should be logged in the Windows event log.

full source code package:
http://code4cs.googlecode.com/files/NotifyService4W.zip