Autopilot: Automatic Data Center Management

Managing large scale data center automatically without too much human involving is always a challenging task. Industrial  giants such as Google and Microsoft are pioneer in this area and very little information is leaked about how they handling such problems. But in 2007,  Michael Isard of Microsoft Research wrote a paper entitled Autopilot: Automatic Data Center Management which describes the technology that Windows Live and Live Search services have used to manage their server farms. This is a great opportunity to look at how industrial giant manage tens of thousands of machines using software.
Design Principle
- Fault tolerant, any component can fail at any time, the system must be reliable enough to continue automatically with some proportion of its computers powered down or misbehaving
- Simplicity, simplicity is as important as fault-tolerance when building a large-scale reliable, maintainable system. Avoid unnecessary optimization, and unnecessary generality.
Datacenter layout
A typical application‖ rack might contain 20 identical multi-core computers, each with 4 direct-attached hard drives. Also in the rack is a simple switch allowing the computers to communicate locally with other computers in the rack, and via a switch hierarchy with the rest of the data center.
Finally each computer has a management interface, either built in to the server design or accessed via a rack-mounted serial concentrator.
The set of computers managed by a single instance of Autopilot is called a cluster.
Autopilot architecture
Autopilot consists of three sub systems
- Hardware management, including machine/switch/router state maintain, auto error repair, os provisioning etc.
- Deployment, automatically deploy application and data to specified machines in a data center.
- Monitoring, monitor the state of device and service inside the data center, collect performance counter and user friendly display UI.
Autopilot Architecture
Hardware Management
- Main responsibility of Device Manager, it maintains a replicated state for each device in the data center
- It makes decision to reboot, re-image or retire a physical machine/switch/router
- It periodically discover new machine through  the special management interface, either built in to the server design or accessed via a rack-mounted serial connector
- It automate the OS installation process through Provisioning Service
- It automate the error repair process using a Repair Service
- It collection device state from various Watchdog Service
- Machine is assigned to a machine function, which indicates what role it plays and what kind of services will run on it
- Machine is also assigned to a scale unit, which is a machine collection that serves as application/os update unit
- Each machine is responsible for running a list of application/autopilot service and this list is stored as service manifest file. Multiple version of manifest file can be stored in a machine, only one is active, others are kept for switch to active or rollback when upgrading failed
- Device manager maintains the manifest file list of each machine in the cluster and its corresponding active version
- Deployment service is a multi-node service which stores all the application/data files listed in the service manifest. These files are synced from external building system.
- Autopilot operator trigger new code deployment by a single command to Device Manager. DM then update service manifest of specified machines accordingly and kick each machine to start to sync bits from deployment service and run them. Machine in the cluster then sync the manifest file and download specified application/data to local disk and start them.
- In normal case, each machine periodically query DM what manifest should be on its local disk. It will fetch one from deployment service if needed manifest files are missing
-  Watchdog, it constantly probe the status of other service/machine and report it back to device manager. Autopilot provides some system wide watchdog, but application developer can build their own ones as long as these service knows how to talk to DM about device status
- Performance counters  are used to record the instantaneous state of components, for example a time-weighted average of the number of requests per second being processed by a particular server.
- The Collection Service forms a distributed collection and aggregation tree for performance counters. It can generate a centralized view of the current state of the cluster‘s performance counters with a latency of a few seconds.
- All collected information is stored in a center SQLServer for fast and complex querying by end user. These data is exposed to application developer and operator through a http based service called cockpit service.
- Besides global view of status of the data center, cockpit is also responsible for access some resources (for example, application/data/log files)
- Predefined status query and abnormal result are combined to form an alter service. It can send out email and even phone call when some critical situations happen.