2/28/2009

Tips for Improving Daily Productivity

1. Prioritize your Work Items.

2. Do Less but More Important Things.

3. Delegate Things to Other Proper People.

4. Set Time Schedule & Limitation for your Tasks.

5. Batch Small but Frequent Routines(for example, email checking/rss reading).

6. Focus on One Thing At One Time and Do it Well.

7. Avoid things that Won't Produce Meaningful Results.(like chatting)

8. Dare to Say "NO" to Others, even to Your Manager.

2/27/2009

the Confusing URI, URL and URN

  Naming and Addressing are two key issues that every large scale distributed system should face. (they are also known as persistence/availability)

  URI emphasizes on both sides of them, while URL emphasizes on location and URN emphasizes on naming. So we can say that the URN defines an item's identity, while the URL provides a method for finding it.

  But we can't say that URI consists of two classes of identifiers: URL and URN. According to w3c's formal documentation, URL is a useful but informal concept: a URL is a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network "location"), rather than by some other attributes it may have. While URN is now just an URI scheme, which itself will contain some subspaces, for example: urn:issn:1535-3613

[Reference]
http://www.w3.org/TR/uri-clarification/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier

2/20/2009

Computer Program Analysis

Computer Program Analysis is the process of automatically analyzing the behavior of computer programs.

1. Analyzing Types


Static Analysis:
- code defect detecting
- format/convention checking
- code understanding/comprehension
- reverse engineering

Dynamic Analysis:
- code coverage producing
- memory leak checking
- threading error checking (data races & deadlocks)
- performance analysis (a.k.a. profiling)

2. Analyzing Purpose
s

Optimization:
- performance
- robustness

Correctness:
- memory leak
- code coverage
- model check
- threading error

3. Popular Goals

- code coverage, to see test coverage
- profiling, to identify performance bottle neck
- code analysis, to find code defect, improve code quality

[Reference]
http://en.wikipedia.org/wiki/Program_analysis_(computer_science)
http://en.wikipedia.org/wiki/Dynamic_program_analysis
http://en.wikipedia.org/wiki/Static_code_analysis
http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
http://en.wikipedia.org/wiki/Code_coverage
http://www.bullseye.com/coverage.html

2/16/2009

PsExec and Windows Security Problem

A few days ago, I encountered a strange problem - a process, which can access local disk files (such as d:\dir\file.ext), but can't access remote NTFS files (such as \\file_server_name\dir\file.ext), even if this process's owner is the administrator of the target machine.

After some investigation I found that this process(P1) is spawned by a another process(P2), which is started (on machine M1) remotely by PsExec (runs on machine M2). The process P2 runs in session 0 with non-interactive window station. If I start P2 manually (through terminal service session), all things OK.

You can reproduce this problem easily(we assume your current account is admin on all related machines):
1. On a remote machine M1, setup a shared folder which contains a text file(\\M1\Dir\test.txt) and write some simple text into this file(for example: hello world!).
2. On your local machine M2, run the following commands:
PsExec \\M3 CMD "/c type \\M1\Dir\test.txt"
PsExec \\M3 -u yourName -p yourPassWD CMD "/c type \\M1\Dir\test.txt"

3. You will see the second succeed while the first fail due to "access is denied"

In fact, this behavior is a "by design" feature of PsExec. The home page of this tool says that "If you omit a username the remote process runs in the same account from which you execute PsExec, but because the remote process is impersonating it will not have access to network resources on the remote system. When you specify a username the remote process executes in the account specified, and will have access to any network resources the account has access to."

But what happens behind the scene?

first, Let's run the following commands:
PsExec \\M3 CMD 1
PsExec \\M3 -u yourName -p yourPassWD CMD 2


then, log on to Machine M3, open Process Explorer, show up the property dialogs for the two CMD processes:


you will now see that the only difference is that "CMD 2" belongs to a security group called "Logon SID".

So, we can guess that:
1. If user name is given to the PsExec command, the PsExeSvc windows service will call LogonUser to first log on to the target machine using given cridentials and get some access token. It then use this token to creat the target command. So the created process functions normally.
2. But if no user name is given, PsExeSvc will impersonate the target command process using the access token from PsExec.exe itself.

If a process is impersonating as an account's SID, which is not logged on locally, it can't access network resources. This causes the problem we described in the beginning.

[Reference]
1. PsExec, User Account Control and Security Boundaries
2. interactive windows service
3. Windows Privilege

== Session, Windows Station and Desktop ==
4. http://blogs.technet.com/askperf/archive/2007/07/24/sessions-desktops-and-windows-stations.aspx
5. http://www.alex-ionescu.com/?p=59
6. http://www.alex-ionescu.com/?p=60

== Vista Winsta0 Isolation ==
7. http://bartdesmet.net/blogs/bart/archive/2007/03/05/windows-vista-winsta0-isolation-explained.aspx
8. http://blogs.technet.com/voy/archive/2007/02/23/services-isolation-in-session-0-of-windows-vista-and-longhorn-server.aspx

2/14/2009

Reading in Database Systems 3E

from http://pages.cs.wisc.edu/~nil/764/

Book @ Amazon

The Roots

1. E. F. Codd: "A Relational Model of Data for Large Shared Data Banks." CACM 13(6): 377-387 (1970)

2. Morton Astrahan, et al: "System R : Relational Approach to Database Management." TODS 1(2): 97-137 (1976)

3. Stonebraker, et al: "The Design and Implementation of INGRES." TODS 1(3): 189-222 (1976)

4. Chamberlin, et al: "A History and Evaluation of System R"

5. Stonebraker: "Retrospection on a Database System." TODS 5(2): 225-240 (1980)

Relational DBMS Implementation

6. Stonebraker, "Operating System Support for Database Management." CACM 24(7): 412-418 (1981)

7. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching." SIGMOD Conference 1984: 47-57

8. Hellerstein, Naughton, & Pfeffer, "Generalized Search Trees for Database Systems". VLDB 1995: 562-573 Buffer Management:

9. Chou & DeWitt, "An Evaluation of Buffer Management Strategies for Relational Database Systems". VLDB 1985: 127-141

10. Shapiro, "Join Processing in Database Systems with Large Main Memories". TODS 11(3): 239-264 (1986)

11. Selinger, Astrahan, Chamberlain, Lorie & Price: "Access Path Selection in a Relational Database Management System." SIGMOD Conference 1979: 23-34

12. Leung, Pirahesh, Seshadri and Hellerstein: Query Rewrite Optimization Rules in IBM DB/2 Universal Database. To appear as an IBM Research Report, contact cleung@almaden.ibm.com

Transaction Management

13. Gray, et al. "Granularity of Locks and Degrees of Consistency in a Shared Database." IFIP Working Conference on Modelling of Database Management Systems, 1-29, AFIPS Press.
(This is an IBM research version, thanx pedro)

14. Kung & Robinson: "On Optimistic Methods for Concurrency Control." TODS 6(2): 213-226 (1981)

15. Agrawal, et al.: "Concurrency Control Performance Modeling: Alternatives and Implications". TODS 12(4): 609-654 (1987)

16. Lehman & Yao: "Efficient Locking for Concurrent Operations in B-trees." TODS 6(4): 650-670 (1981)

17. Haerder & Reuter: "Principles of Transaction-Oriented Database Recovery." Computing Surveys 15(4): 287-317 (1983)

18. Mohan, et al.: "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging." TODS 17(1): 94-162 (1992)

19. Stonebraker, "The Design of the POSTGRES Storage System." VLDB 1987: 289-300
Another Version

20. Wachter & Reuter: "The ConTract Model." In Database Transaction Models for Advanced Applications, Elmagarmid A. (Ed), Morgan Kaufmann, 1992: 219-263

Distributed Databases

21. Williams, et al., "R*: An Overview of the Architecture." IBM Research Report RJ3325.

22. Lohman & Mackert, "R* Optimizer Validation and Performance Evaluation for Distributed Queries"

23. Mohan, Lindsay & Obermarck, "Transaction Management in the R* Distributed Database Management System" TODS 11(4): 378-396 (1986)

24. Gray, et al., "The Dangers of Replication and a Solution." SIGMOD Conf. 1996: 173-182
.ps Version

25. Stonebraker, et al. "Mariposa: A Wide-Area Distributed Database VLDB Journal 5(1): 48-63 (1996)
Another Version

Parallel Databases

26. DeWitt and Gray, "Parallel Database Systems: The Future of High Performance Database Systems. CACM 35(6): 85-98 (1992)
Another Version

27. DeWitt, et al. "The Gamma Database Machine Project." TKDE 2(1): 44-62 (1990)
Another Version

28. Nyberg, et al. "AlphaSort: A Cache-Sensitive Parallel External Sort." VLDB Journal 4(4): 603-627 (1995)

29. Hasan and Motwani: "Coloring Away Communication in Parallel Query Optimization." VLDB 1995: 239-250

Objects and Databases

30. Lamb, et al. "The ObjectStore System." CACM 34(10): 50-63 (1991)

31. Seth J. White, David J. DeWitt: QuickStore: A High Performance Mapped Object Store. VLDB Journal 4(4): 629-673 (1995)

32. Franklin & Carey: "Client-Server Caching Revisited." IWDOM 1992: 57-78

Object-Relational DBs

33. Zaniolo: "The Database Language GEM." SIGMOD Conference 1983: 207-218

34. Stonebraker. "Inclusion of New Types in Relational Data Base ICDE 1986: 262-269 (Not the exact version)

35. Stonebraker and Kemnitz. "The POSTGRES Next-Generation Database Management System." CACM 34(10): 78-92 (1991)

Data Analysis and Decision Support

36. O'Neill & Quass: "Improved Query Performance with Variant Indexes." SIGMOD Conference 1997: 38-49
.ps Version

37. Gray, et al.: "Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals." Data Mining and Knowledge Discovery 1(1): 29-53 (1997)

38. Zhao/Deshpande/Naughton. "An Array-Based Algorithm for Simultaneous Multidimensional Aggregates." SIGMOD Conference 1997: 159-170

39. Agrawal and Srikant. "Fast Algorithms for Mining Association Rules." VLDB 1994: 487-499.

40. Hellerstein, Haas & Wang: "Online Aggregation". SIGMOD Conference 1997: 171-182

Benchmarking

41. Anon, et al: "A Measure of Transaction Processing Power." Datamation, 31(7): 112-118 [or some chapter from Gray's benchmarking book]

42. Michael J. Carey, David J. DeWitt, Jeffrey F. Naughton: "The 007 Benchmark." SIGMOD Conference 1993: 12-21
A detailed version

43. Michael Stonebraker, et al.: "The Sequoia 2000 Benchmark." SIGMOD Conference 1993: 2-11

Vision and Politics

44. Papadimitriou, "Database Metatheory: Asking the Big Queries." PODS 1995: 1-10

45. Silberschatz, et al.: "Database Research: Achievements and Opportunities Into the 21st Century." CACM 34(10): 110-120 (1991)

46. Silberschatz/Zdonik: "Strategic Directions in Database Systems - Breaking Out of the Box. Computing Surveys 28(4): 764-778 (1996)

2/09/2009

Remote Debugging Using Visual Studio

When developing distributed system or GUI applications, remote debugging is a basic requirement. In this article, I will show how to do remote debugging with VS2005.

1. The need for remote debugging
- for GUI application, remote debugging can help to avoid bothering GUI elements when developers are doing debug works
- for distributed system, "one box" can't expose all the potential problems and bugs, you HAVE TO deploy them to a REAL distributed environment to do test. When investigating bugs/problems found in this kind of environment, remote debugging is a great facility: it can brings you the real distributed context.

2. Software Requirement
To enable remote debugging, you should install Visual Studio Remote Debugging Monitor, which should be located at:$YourMSVSRootDir\Common7\IDE\Remote Debugger\X64"(for x64 version).

Some additional components are needed for .Net application, Web application and HPC application. See Remote Debugging Components for detail information.

3. Security Consideration
When debugging native applications, you should either:
- be the owner user account of the debuggee process
- be the admin group on the remote machine, where debuggee process runs

As to the user account problem between Visual Studio Debugger and the Remote Debugger Monitor, see Remote Debugging Across Domains and Remote Debugging Permission for more information.

4. Attatch to remote process
Open Visual Studio -> Tool -> Attach to Process

In the qualifier text box, input the machine name where a MSVSMON.exe runs

Click Refresh button at the right-bottom corner, all processes on the remote machine will appear. You can now choose the target process and debug it as a local one.

5. Some tips
- the local debugging machine, where debugger runs, should be where the debuggee binaries are build out, otherwise, you should configure the source server
- some remote processes may be produced/deployed dynamically and only runs a very short time(very common in modern data intensive computing infrastructure), to get more opportunity to attach to this kind of process, you can add some Sleep(n)/system("pause") statements at the beginning of the entry(main) function.

Update on 02/11/2009:
- Microsoft NtDebug Team Blog has an great article on Remote Debugging using WinDbg.

2/07/2009

An Intresting Cryptography Story

据solidot报道:有位小伙心仪的对象是个学心理学又喜欢研究古典密码学的姑娘,在向其表白后,收到一段密文,说是经过五层加密的结果,其明文就是她对这段缘份的答复。小伙子束手无策时在百度的密码吧上求助,得到众网友的热心回复......

Here is the whole decryption process

the original ciphertext:

****-/*----/----*/****-/****-/*----/---**/*----/****-/*----/-****/***--/****-/*----/----*/**---/-****/**---/**---/***--/--***/****-/

Layer 1 - Morse Coding

Rule: standard Morse Decoding

Output: 41 94 41 81 41 63 41 92 62 23 74

Layer 2 - Telephone Keypad Coding
Rule: mn stands for the n-th letter on the m-th key on telephone keypad (see ITU-E.161 standard and ISO/IEC9995 standarad)

Output: G Z G T G O G X N C S

Layer 3 - Computer Keyboard Coding

Rule: QWERT -> ABCDE and so on ... (see Keyboard Layout and ISO/IEC9995 standarad)

Output: O T O E O I O U Y V L

Layer 4 - Route Cipher

Rule: one of the transportion cipher method that belongs to classical cipher, in which plaintext is written down in row order but read off in column order

first, break it into two lines:
O T O E O I
O U Y V L

then remerge them using another order(up->down first, left->right second)

Output: OOTUOYEVOLI

Layer 5 - inverse transportiong

Rule: read off the ciphertext from the end to beginning

Output:
I LOVE YOU TOO

从加密方法来看,全是古典密码学中基本的的替换法转置法。但是经过这么多层层嵌套组合,要想得到最后的明文,还是得花点功夫连试带猜才行。

相关报道
百度贴吧原贴
Solidot原文
布丁通讯的报道
百度空间描述的详细过程

2/06/2009

Deploying and Monitoring in Windows Cluster

When solving data/compute intensive problems in large scale cluster, system deploying and monitoring are important tasks besides software developing. In this article, I will show how to accomplish these two tasks in Windows Cluster without(or with very little) software developing efforts.

First let's clarify the two terms:
Deploying - copy bin and data(usually from a file server) to remote machines, set some configuration files
Monitoring - Start/Stop process on remote machines and get its status periodically

If performance is critical when deploying, or if your deploy policies are very complicated, you may want to develop your own management software or adopt some mature cluster management tools from ISV/open source community.

But here, I only focus on showing you how to accomplish this kind of works using ready-to-use(available from OS or well known community) tools and scripts.

[In the following sections, I assume you just have a bare Windows OS installed on the nodes in cluster. I have an previous article on how to install Windows OS in large cluster automatically]

1. How to deploy your data and executable

The simplest and most intuitive solution is to make some share folder on each machine and copy data/bin to that folder.
But how to make remote folder shared? You have two choice:
- Use windows default share. You can run "net share" command to see what's the share folders in a default installation.
- Run "net share" command on remote machines to share whatever folders you want. PsExec is an ideal tool for run remote commands, this blog has is an previous article on how PsExec works.
When share folders are ready, you can use copy/xcopy/robocopy to copy your application data and executable to them.

2. How to start/stop/monitor remote executable

Let's see starting up first. Windows OS has a great feature that you can remotely create/start/stop windows service, try "sc /?" to learn more. You can leverage this mechanism to control remote application:
- Make all your executable as windows service. It's convenient for management, but need many coding efforts.
- Write one "God" windows service, it start/stop common local windows applications according to some configuration file. You just use SC command to control this service. Most production use solution adopt this method. You can also add some cool feature, for example restart application when termination detected, to this "God" service.

But you have an alternative choice, use the great PsExec. It can start remote applications in a graceful manner. But:
- It can only start executable
- Itself is actually based on the remote service controlling feature of Windows OS
- to stop remote executable, you may use a system tool called taskkill, which comes with Windows OS

To monitor remote executable status, you can:
- use tasklist tool provided by Windows OS
- if the executable is windows service, you can use SC command

3. Security, AD Account and Access Control

Many windows clusters use dedicated/independent domain, consequently, you have to use some kind of boundary server to access those clusters.

According to my personal experience, do NOT run the upper deploy/monitor commands on boundary servers! The account and access control problems are really annoying!
It's better to do these tasks on a machine that is part of the dedicated cluster. So no user/password is needed for all your commands. The only problem is that you may need some extra efforts to make your data/executable repository accessible by this machine.

The final tip:
- You may want a decent and professional output from your scripts, so color console output is highly desired. cecho may help you, it is a colorful version of echo command.

2/02/2009

How to Read Source Code

Part I - General Steps and Principles

1. Define a Clear Goal
- what's your reading purpose? to know how, to own components, to modify and extend?
- results driven, what are the desired final outcomes?
- just focus on what you want to get

2. Know it as Client User
- read user manual
- get an overall big picture
- know what it can do and what can't
- what is it suitable for and what not
- try the software or write some application over the library

3. Thinking Before Reading
- what if you design the whole system?
- what's the core challenges that are unclear to you?
- write down your questions and concerns
- read with questions

4. Know the Architecture/Components
- know the overall architecture first
- divide the whole system into small components
- identify what to focus, what to ignore
- use build file to identify component dependencies
- try building it

5. Read Specific Part in Detail
- make a SMART(Specific, Measurable, Achievable, Results-based, Time-specific) plan
- focus on core parts and ignore trivial ones
- identify entry point: main/wmain function
- identify the main loop (server application)
- identify thread creation/termination
- identify core data structure
- identify operations on core data structure
- use typical scenario to figure out how the system really works
- noting/documenting/charting down while reading

6. Producing Results
- big picture from user's perspective
- big picture from dev's perspective
- arch/logic for individual component
- summarize core data structure
- practice: build/deploy/use/debug/modify/hack the system
- comments on the implementation(what's good, what's bad, what learned)

7. Misc Tips
- read doc(user manual, design doc) before code
- get core data structure doc first if possible
- read the code in both static and dynamic(debug) way
- debug/step into code using specific execution scenario
- read code iteratively, don't deep into detail in the beginning
- use interface/contract to separate concerns
- overall -> detail, but just detail on small areas
- leverage code comprehension tools to get static information
- print out core code and read them on real papers
- try to write unit test/use case for the software
- consider refactoring the code(kind of active reading) if unit tests are given
- if it's really hard to read, consider rewriting it!

Part II - Tools

One of the most frequent activities when reading code is navigating among various source codes files. So tools that help navigating are very important to improve the reading efficiency. Here are some popular tools for this purpose:

1. Source Code Index Generator
cscope http://cscope.sourceforge.net/
ctags http://ctags.sourceforge.net/

2. GUI Frontend for Index Generator
kscope http://kscope.sourceforge.net/
cbrowser http://cbrowser.sourceforge.net

3. Code Index Generating and Navigating
Source Navigator http://sourcenav.sourceforge.net/
Source Insight http://www.sourceinsight.com/
LXR http://lxr.linux.no/

4. For C Language ONLY
CXRef http://www.gedanken.demon.co.uk/cxref/
cflow http://www.gnu.org/software/cflow/

[Reference]
1. Code Comprehension Tool List - http://www.cs.ubc.ca/~murphy/cs319/index.html
2. Code Doc Generating Tool List - http://www.stack.nl/~dimitri/doxygen/links.html
3. A Survey on Code Comprehension Tools - http://www.grok2.com/code_comprehension.html
4. Tips for Code Reading - http://c2.com/cgi/wiki?TipsForReadingCode
5. Reading V.S. Rewriting - http://www.joelonsoftware.com/articles/fog0000000069.html