8/05/2010

Lecture Notes - Evolution of Google Search Engine

Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. It's actually a presentation on how Google search engine evolves during the past 10 years. Here are my notes for this lecture.

Part I - Overview of Search Engine Evolution: 1999 VS 2009

Factors to Consider when Designing a Information Retrieval System:
1. Corpus Size(# docs to be indexed)
2. QPS(Query Per Second)
3. Freshness/Update Rate
4. Query Latency
5. Complexity/Cost of Scoring/Retrieval Algorithm

Parameter Change 1999 -> 2009:
1. Corpus Size: 70M -> *B ~100X
2. QPS: ~1000X
3. Refresh: Months -> Minutes ~10000X
4. Latency: <1s> <02 .s="" style="font-style: italic; font-weight: bold;">~5X
5. Machine Scale: ~1000X

Consider 10x Growth when designing, Rewrite for 100x Growth!

Part II - Evolution of Google Search Engine

~1997 - Circa, Research Prototype

- Simple Architecture and Focus on System Distributing/Partitioning
- Term vs Doc based Partition: Doc based Win
- Disk Based Index, DocID+Posting List with Position Attributes, Byte Aligned Encoding


~1999 - Circa, Production

- Introduced Cache
-- hit rate is low 30~60% due to index refresh and long tail query
-- very beneficial, reduce large disk i/o
-- hot term first priority to cache, hot and costy request

- Replica Index Data
-- better performance
-- better availability

Some Summary in late 1990's:

Crawler is simple batch system
- start with very few urls
- queue it when found new urls
- stop when have enough docs

Index Serving using cheap machine
- no failure handling
- added record/chunk checksum

Index Update
- once/month
- wait traffic to low -> take replica offline -> do update -> start serving


~2000 - Dealing with Growth

Situation:

- doc size:50m -> 1000m
- ~20% query traffic increase/month
- Yahoo! deal

Solution:

add machines constantly
- more index shards for larger index size
- more index replica for bigger query capacity

And improve software constantly
- better disk scheduling
- better index encoding


~2001 - Adding In-Memory Index

- enough machine memory: holding all index in mem
- machine function: replica -> micro shard holding
- balancer: cordinator
- availability: replicate important docs



~2004 - Adding Infrastructure

- Generalize tree structured query flow
- Generalize balancer concept
- New index encoding: group varint encoding
- GFS appear in production


~2007 - Universal Search
- Universal search: combine results from multiple vertical corpus
- Realtime search: fast url finding -> crawling -> indexing -> serving cycle
- Experiment supporting: have idea -> try it on real data offline and tune -> live experiment on small piece -> roll out and launch

Part III - Future Trends

- Cross Language Information Retrieval
- ACL in large IR system with huge amount of user and dynamic requirement
- Automatic Construction of Efficient IR system (one bin for realtime and regular web index with different parameter configuration)
- Info extraction from semi-structured data

[Reference]
1. Challenges in Building Large Scale Information Retrieval Systems[PDF, Video]
2. Notes by Another Blogger - http://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html