5/26/2012

The evolution of QZone architecture

Problem Scale at Today- 550M active users
- x10M peak online users
- B scale daily Page View
- Peta scale UGC data
- 100B daily requests
Qzone 1.0 – 3.0 (0 ~ 1M online user, 2004 – 2006)
  • Architecture
    • Special Windows Client (embedded html)
    • Apache + Cache + MySql
      * App/CGI calls different data service to cook a result page for user request
    • One ISP one service cluster
      * Users from Telecom/Netcom are served by different dedicated servers
      * App calls data service in the same ISP
  • Problem (v1/v2)
    • special client -> hard to debug
    • web server is not scalable
    • 30~40 nodes, max up to 500k online user
  • Solution (v3)
    • Rich Client
      • move some logic from server to client
      • Client is ajax based, server logic is simplified
    • Dynamic/Static separation
      • Static data is hosted by light weight web server qHttpd
      • 100x performance improve
    • Web server optimization
      • Replace apache with qzHttp for dynamic logic
      • 3x performance improve
    • Main page caching
      • Staticlize and cache elements of main page
      • Elements are updated periodically or on-demand
Qzone 4.0-5.0 (1M ~ 10M online user)
  • ISP separation problem: dynamic data
    • All dynamic services are hosted within one ISP
    • Other ISP works as proxy to call these services
      • Dedicated network connection between proxy and service
    • User in other ISP don’t call services in other ISP directly
  • ISP separation problem: static data
    • Static:Dynamic ~ 10:1, adopt  CDN solution
    • Redirect static request to ISP specific static data server according to client IP information
      • By Qzone app logic using client Ip info
      • Previously using DNS to do redirection which causes lots of problem
      • Due to local DNS setting problem
  • Improve user experience
    • Improve critical service’s availability
      • Replicated core service
    • Do lossy service for non-critical service
      • Skip some service if time out
      • Can also use default value if failed/time out
    • Fault tolerant design from backend service to client script
      • Default value at client
      • LVS for Qzone web server
      • L5(F5?) for internal critical service
    • Control time-out time for the whole request processing
      • Kind of real time scheduling algorithm
  • Incremental Release
    • Release new feature to end user from small scope to larger scope
    • Team internal dogfood
    • Whitelist (invited) user test
    • Company wide dogfood
    • Vip external user
    • Roll out globally
Qzone 6.0+ ( ~100M online user)
  • Open platform
    • App/Platform separation
    • iFrame based app model
    • App’s dev/test/deploy is totally separated from Qzone platform
    • Separation of concerns and parallel evolving path
  • GEO replication – handle IDC failure
    • One IDC for write
    • Multiple IDCs for read
    • Dedicated synchronization protocol
  • Monitoring
    • Bandwidth/Latency/Error monitoring
    • Problem locating
Comments
  1. All contents are very general, not too many details
  2. Not touched the core problem: how to scale, how to partition so large scale data
  3. Single IDC write will cause service availability problem in case of disaster unless reconfiguration is supported
Reference