8/12/2008

On the Failure of our System/Project

How can we say that our system/project succeeded or failed?

It is not easy to define the Failure/Success criteria of a project. Some junior people always think that if our software can run(跑起来), then our project is successful.

First, how to define "software can run"? What's the feature list? What's the configuration? What's the test workload? What's the performance expectation? What's the code/doc quality?

Second, even if all the upper questions can be answered elegantly, but finally no customer will use your software, can you claim your project to be successful?

You may challenge that, if I have good answers to all these hard questions, why those stupid customers won't use our software? There is no one that is really stupid on this planet, especially among your customers.

The customer may not choose to use your runnable software for many reasons:
1. Your software maybe good for others, but not suitable for his specific requirements such as data diversity, importing policies.
2. There are some better alternatives. Customers have the freedom to make choices and they will try their best to investigate various candidates. If there is any better and proven solution to their problem, why should they choose yours?
3. The trust and relationship between you and your customer. Will you provide enough supports if they choose your solution? Will you work on them if the customer propose new feature requirements?

Some people may also say that if the project member learned a lot and can do things that he/her can't before, then they project is successful. I think this can only be used to judge the value of a project, but not the failure/success of a project. Because we can always learn from what we have done, whether the project is successful or not.

In my opinion, failure happens when the results can't meets the initial goals set at the the initial stage of the project. What goals to set is another big problem. You may set the goals according to your budget, competitor's status, market requirements, technical challenges and available resources.

According to the upper definition about "Failure", I think our project failed due to the big gap between the initial goals and final outcomes. Confessing failure is good and important, but the more important thing is to ask:why our system/project failed?

Part I - From the non-tech point of view, they are:
1. No clear success criteria is one of the main root causes. No clear and proper goal/vision decrease the productivity and moral of the whole team.
2. Component owners don't have shared vision, thus not cooperated in a smooth way, which in turn makes the overall system very strange in the user's view.
3. No architect position in the team to make technical decisions.
4. Application(internal customer) owners have too much impact and decision rights on the design/implementation of the platform.
5. Target applications are not clear and we have incorrect ambitious to support all kinds of applications.
6. Leadership team delivered wrong message and team members thus build wrong mindset. Many team members don't think they are doing engineering work and won't do such works as dev test/write docs etc.
7. Team member don't have proper expertise, but not willing to learn, to cooperate.
8. Lack of experienced team member to guide other junior members.
9. Lack of team culture to make the team a good environment to work in.
10. Lack of realistic understanding of our system's weakness and strengthen. Thus can't win the potential customers' trust.

Part II - From the technical point of view:
1. Lack of proper engineering expertise to guide the development process. We have build engineers, testers, coding standards, triage processes. But we don't use it in a proper and professional way. Most of the time, we just use the proper term/concept, but practice in a non-engineering way.For example, no one know what's unit test and code coverage. The have no functional spec and just ask testers to do ad hod tests according to Dev's requirement.
2. Lack of domain expertise. Most team member came from Machine Learning, Data Mining background, they don't know too much about system area knowledge. So the outcome solutions are always ad hoc.
3. Lack of clear target scenarios. Large scale distributed system design is all about trade-offs, if no clear target, proper technical decisions is not feasible to make.
4. Lack of analytical thinking capability to make right decisions. For example, some designs use reliable distributed file system to temporarily create/delete a huge amount of small files. We should have clear understanding what components can do what well and what bad.
5. Have unrealistic technical goals. If we want to do all things well, we will do nothing well in the end.

No comments: