8/12/2008

On the Failure of our System/Project

How can we say that our system/project succeeded or failed?

It is not easy to define the Failure/Success criteria of a project. Some junior people always think that if our software can run(跑起来), then our project is successful.

First, how to define "software can run"? What's the feature list? What's the configuration? What's the test workload? What's the performance expectation? What's the code/doc quality?

Second, even if all the upper questions can be answered elegantly, but finally no customer will use your software, can you claim your project to be successful?

You may challenge that, if I have good answers to all these hard questions, why those stupid customers won't use our software? There is no one that is really stupid on this planet, especially among your customers.

The customer may not choose to use your runnable software for many reasons:
1. Your software maybe good for others, but not suitable for his specific requirements such as data diversity, importing policies.
2. There are some better alternatives. Customers have the freedom to make choices and they will try their best to investigate various candidates. If there is any better and proven solution to their problem, why should they choose yours?
3. The trust and relationship between you and your customer. Will you provide enough supports if they choose your solution? Will you work on them if the customer propose new feature requirements?

Some people may also say that if the project member learned a lot and can do things that he/her can't before, then they project is successful. I think this can only be used to judge the value of a project, but not the failure/success of a project. Because we can always learn from what we have done, whether the project is successful or not.

In my opinion, failure happens when the results can't meets the initial goals set at the the initial stage of the project. What goals to set is another big problem. You may set the goals according to your budget, competitor's status, market requirements, technical challenges and available resources.

According to the upper definition about "Failure", I think our project failed due to the big gap between the initial goals and final outcomes. Confessing failure is good and important, but the more important thing is to ask:why our system/project failed?

Part I - From the non-tech point of view, they are:
1. No clear success criteria is one of the main root causes. No clear and proper goal/vision decrease the productivity and moral of the whole team.
2. Component owners don't have shared vision, thus not cooperated in a smooth way, which in turn makes the overall system very strange in the user's view.
3. No architect position in the team to make technical decisions.
4. Application(internal customer) owners have too much impact and decision rights on the design/implementation of the platform.
5. Target applications are not clear and we have incorrect ambitious to support all kinds of applications.
6. Leadership team delivered wrong message and team members thus build wrong mindset. Many team members don't think they are doing engineering work and won't do such works as dev test/write docs etc.
7. Team member don't have proper expertise, but not willing to learn, to cooperate.
8. Lack of experienced team member to guide other junior members.
9. Lack of team culture to make the team a good environment to work in.
10. Lack of realistic understanding of our system's weakness and strengthen. Thus can't win the potential customers' trust.

Part II - From the technical point of view:
1. Lack of proper engineering expertise to guide the development process. We have build engineers, testers, coding standards, triage processes. But we don't use it in a proper and professional way. Most of the time, we just use the proper term/concept, but practice in a non-engineering way.For example, no one know what's unit test and code coverage. The have no functional spec and just ask testers to do ad hod tests according to Dev's requirement.
2. Lack of domain expertise. Most team member came from Machine Learning, Data Mining background, they don't know too much about system area knowledge. So the outcome solutions are always ad hoc.
3. Lack of clear target scenarios. Large scale distributed system design is all about trade-offs, if no clear target, proper technical decisions is not feasible to make.
4. Lack of analytical thinking capability to make right decisions. For example, some designs use reliable distributed file system to temporarily create/delete a huge amount of small files. We should have clear understanding what components can do what well and what bad.
5. Have unrealistic technical goals. If we want to do all things well, we will do nothing well in the end.

8/01/2008

On the Failure of our Team

About half an year ago, I joined a big cloud infrastructure team(15p+). During these days, I learned a lot from those things that we may not deal well. In this article, I will try to summarize what caused the team management problems and what we can do better if we had the chance to do it again.

What makes us fail?
1. Team Culture
no culture is a kind of culture.
no trust, no cooperation
respect to individual team member and domain experts
software developing is art but also scientific activity

2. Team Vision and Goal
Define the goal of our work and project. How to define success criteria? many times, people say, we should make this component runnable(把系统跑起来). But how to define "RUN"? what test cases to pass? what code coverage to cover? what user scenarios to pass? what's the performance goal? what's the scalability goal?Are we doing research or engineering? If engineering, be careful about the engineering excellence. If research, what's your uniqueness/innovation?

3. Team Moral
team moral. Salary is not the only way to burn team moral, sometimes, it's not the most effective way. Team building activities(lunch together, team activities(true man CS), hiking together), exciting project vision, comfortable team environment and simple colleague relationship, training courses.

4. Team Member Diversity
Roles: Architect, Developer, Test
Background: Junior, Senior, Experienced Leader

5. Team Leadership
Shield team members from outside interference
Give ownership to team member, they are not just coder

6. Cross Team Cooperation
Platform and Application should be partner relationship, don't challenge, don't be aggressive, just work together.

7. The root cause of all the problems lies on some leaders. Some people never do any risk analysis and challenge estimation. Such unrealistic characters have very bad influence on the project and the whole team.

Things Learned from the Failure:

1. Cooperation: open to partner and other domain expertise owners. Learning from others,leveraging existing knowledge will reduce our risk and cost.

2. Personal development: encourage team member to think, to dream, to be open, to be passionate and to grow. Only in this way, can the team be stable and grow. When team member grows, the reputation, confidence and capability of your team grow.

3. Control your ambitious: "志存高远地梦想固然值得钦佩,但必须要脚踏实地地行事". Daring to dream is always a good thing, but we should be realistic and have to approach our goals step by step. We must be self-aware and know what's our strength and limitations.

4. Good mindset: Clever is important, but not that important. Most projects failed not because of the IQ of the team members, but the chaos of team/project management. For the success of the software project, I think all team member should have following attitudes in mind:
a. Be open and respectful, among team members and with our partners.
b. Pursuing engineering excellence and personal excellence.
c. Passionate about technologies and learning.
d. Teamwork and work with others.

How to improve team moral?
1. Open and respectful among team members, between staff and leadership
2. Trust, passion
3. Exciting project/team vision/goal.
4. Willing to dream and to make dream come true
5. Team building
6. Do important things and make big impact
7. Make team member think their work are of great importance
8. Grant ownership of small components to individuals, don't treat any member as yet another coder.

How to build great team?
1. Build team uniqueness
2. Build domain experts
3. Help team member to think and grow
4. Build Studying organization
5. Set proper team membership bar
6. Control team ambitious, build team reputation/confidence gradually
7. Keep proper amount of new hires to keep team freshness
8. Communication effectiveness (don't treat people as machine, they need to talk with people, not just those cold, boring, harmful, annoying machines)

How to be great team member?
1. Identify your uniqueness and improve it
2. Accumulate domain expertise
3. Improve time/project management skills
4. Learn communication skill
5. Build your personal insight
6. Build great social relationship among colleagues