9/11/2011

VLDB 2011 Trip Report

I attended VLDB 2011 during 29/08 ~ 01/09 in Seattle. Here are some brief observations and reports for this conference. Due to broad area coverage of VLDB, I just focus on SystemSearch and Transaction related materials.
(Disclaimer: due to long paper list and lacking of strong DB background, it may contain misunderstandings and personal biases, feel free to follow up and comment.)

=Basic Info=

VLDB is one of the top conferences in DB community (others are SIGMOD and ICDE) which focus on managing data and system for data management. For VLDB 2011, there are:
- 30 research sessions with 104 papers presented and 4 industrial sessions with 12 papers presented in 5 parallel tracks
- 8 out of 104 research papers are contributed by Microsoft people
- 31 out of 104 research papers are first authored by Chinese people (Domestic + Oversea)
o Mainland: 4
o Hong Kong: 4
o Singapore: 6
o Oversea: 17
Industrial Participation:
- Microsoft hosted the reception dinner on 30/08
- Google/Facebook/EMC had their recruiting/advertising booths at the conference site
Best Paper:
RemusDB: Transparent High Availability for Database Systems

=Topics and Trends=

Hot Topics:
- Graph and Social Data Management: 5 sessions (17 papers), 2 tutorials
- Big Data Analyzing and Infrastructure: 3+ sessions (12+ papers), 1 tutorial
- Streaming and Realtime data analyzing
Traditional system topics for DBMS:
Query Processing session covers:
o Partition the storage and querying processing of native XML database
o Use a new GroupJoin operator to speed up GroupBy and Join query
o Optimize Similarity Join using sensitive hashing
Transaction Processing session covers:
o Scale OLTP system on shared-everything architecture using logical + physical partitioning
o Recovery algorithm implementation and optimization in DBMS where data management and transactional functionalities are separated
o New transaction semantic and isolation level definition for cooperated traditional transactions
o Hyder’s optimistic concurrency control algorithm in fast network and storage settings
It’s amazing that we found there were several (distributed) systems related sessions, where several papers are highly related to some on-going projects in our group:
New Hardware Architecture, it covers
o Main memory based, column/row hybrid storage engine driven by application trace
o Compiling query statement directly into binary native code rather than iterator based execution plan
o Parallel B+ tree algorithm for many core processor
Cloud Computing and High Availability, it covers
o Database storage live migration
o High available Database based on reliable Virtual Machine
Distributed System (2 sessions), it covers
o Selectively partial replicating large-scale web databases
o Paxos based high available datastore
o DBMS like indexing on overlay network
MapReduce and Hadoop, it covers
o Adding data co-location optimization in Hadoop for column-oriented storage application
o Automatic optimize Hadoop program using code analyzing
- GPU-based Architectures and Column-Store Indexing, it covers
o Sparse matrix-vector multiplication by leveraging GPU
o Transaction execution on GPU
o List intersection and index compressing using GPU
So 13 out of 104 research papers in VLDB are in system style. It’s also amazing that all system related sessions are crowded with audiences and there are active Q/A after the presentation. While in other sessions that I happened to attend, there are relatively small numbers of attendees and the session is also pretty quiet. System related publication institutes cover CMU, IBM Research, Intel Research and Yahoo!.

=Notable Papers=

Here I only focus on system, search and transaction related papers.
- RemusDB: Transparent High Availability for Database Systems
Umar Farooq Minhas, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Kenneth Salem, Andrew Warfield
This work was rewarded as best paper in VLDB2011 and came from Waterloo University.
The paper proposed the idea of making DBMS high available by leveraging VM HA technology called Remus and doing some DBMS specific performance optimizations for it. The paper first explained why Remus can be used to do DBMS HA without breaking ACID properties and then discussed 4 (3 memory related, 1 network related) DBMS specific optimizations.
To reduce the size of checkpoint synced from active to standby node, they put page content diff, not the whole original page content to checkpoint since most modifications between consecutive checkpoint touch only part of a page.
To avoid checkpointing pages that can be read from disk, they also track disk read operations and put some metadata into checkpoint data and standby server can use these small size metadata to reconstruct those memory pages.
They also implemented an interface to let application developer mark pages not replicated explicitly but didn’t use it in this paper since it has negative performance impact for DBMS software.
The previous optimizations seem not very DBMS specific and are pretty general. Other applications can also benefit from it, so I think it should be optimization work for Remus.
Yet another optimization is DBMS specific: it leverage transaction semantic to avoid Remus’s TCP packet level protecting. In this optimization, the underlying VM only need to protect Transaction related message such as acknowledge to ABORT and COMMIT message from client. This will reduce the latency a lot for irrelevant messages, such as those that comprising the transaction itself.
The ideas seem simple and easy to understand, the results seem very good and the work is done in real world code base: XEN, MySQL and PostgreSQL. These are probably the reasons why it is voted as best paper, although the innovation and technical challenge are not that big in system guy’s eyes.
There are some obvious drawbacks for this work:
o First, it only works with VM, which has some overhead especially for DBMS like applications since it is very I/O sensitive. The paper didn’t mention the overhead of running DBMS inside VM
o It only compare performance with raw Remus, not with other HA technologies, such as MySQL HA cluster. Building HA DBMS using VM may not be the correct way compared with other alternatives.
o Remus’s checkpoint technology don’t have knowledge about the transaction running inside it, so the standby server is consistent with active server in system level, but not transaction level. I.E., the latest state of standby server may not be consistent in terms of ACID, so it can’t be used to serve reading requests under specific isolation level.
- PLP: Page Latch-free Shared everything OLTP
Ippokratis Pandis, Pınar Tozun, Ryan Johnson, Anastasia Ailamaki
This paper aims to improve the salability OLTP system on many core system by combining existing logical (shared everything) and physical (shared nothing) partitioning technologies. The idea seems pretty elegant and the work seems to be very solid in both system’s perspective and DBMS’s perspective.
The meanings of “shared everything” and “shared nothing” in this paper are not the same as in distributed/parallel DBMS settings. They are the technologies used to eliminating the contention bottleneck of OLTP system on many-core platform: the former term refers to the technology of assigning different range of the same shared table to different thread to avoid high level locking among OLTP threads and the later one refers to the technology of partitioning the underlying data pages of one table and assigning each partition to different database instance.
PLP combines these two technologies by a new design called Multi-Rooted B+ Tree:
o Each logical partition has its own sub B+ tree as index, which is similar to shared nothing design
o The underlying data pages are shared among all logical partitions, which is similar to shared everything design
o Transaction manager will divide each transaction into a DAG of tasks, each task is within partition boundary and assigned to dedicated thread for that partition
Thus, this new design remains the benefit of contention free among transaction threads, low cost of repartitioning/rebalancing and eliminated the need for distributed transaction for cross-partition transactions.
But the work is based on a research prototype called Shore-MT which is built by WISC/EPFL, if it’s on top of popular open source DBMS such as MySQL or PostgreSQL, the work will be more convincing and making bigger real world impact.
- Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore
Jun Rao (LinkedIn), Eugene Shekita (IBM Research – Almaden), Sandeep Tata (IBM Research – Almaden)
There was a paper talks about using Paxos to build reliable data store in NSDI 2011 and here comes the similar story for DB. But it’s not a transactional storage, just a key/value style structured storage. And also, the system architecture and the protocol are very similar to that of PacificA.
The system, which is called Spinnaker, is a replicated range partitioned reliable structured storage that providing put/get like operations. The master is based on Apache Zookeeper.
The replication protocol is essentially a combination of two phase commit, majority consensus and group commit. It differs with PacificA on that it only requires majority members’ ack before doing the real commit at partition leader node. Follower recovery is simple and straight forward – learning to catch up with leader state. As for leader recovery, it uses Zookeeper to choose the follower that has the highest prepare number as new leader.
And another trivial difference with PacificA is that it allows reading at follower nodes by providing so called time line consistency semantic.
Replication and consistency is always a hot topic in DB conferences, both VLDB and SIGMOD has dedicated session.
- Column Oriented Storage Techniques for MapReduce
Avrilia Floratou (University of Wisconsin-Madison), Jignesh Patel (University of Wisconsin-Madison), Eugene Shekita (IBM Research – Almaden), Sandeep Tata (IBM Research – Almaden)
This paper presented several techniques to build column oriented structured storage and analyze engine on top of Hapood.
One is for storage enhancement:
o One column is stored as one file in HDFS
o Multiple related columns’ files are collated together by adding a new data placement policy for HDFS
Second technology is lazy record deserialization. The author argued that modern analytical applications are processing more and more complex data types, deserializing them from byte stream is pretty expensive. But most applications only need to process part of the whole records. So they proposed an idea:
o Deserialize only small part of a complex object to determine whether the object is needed to be processed
o Fully deserialize objects that need to be processed
Another optimization is using some specific technology to compress record, for example dictionary based schema to compress text string.
These works seems very trivial and incremental improvement for Hadoop system.
- CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
Mohamed Eltabakh (IBM Research – Almaden), Yuanyuan Tian (IBM Research – Almaden), Fatma Ozcan (IBM Research – Almaden), Rainer Gemulla (Max-Planck-Institut für Informatik), Aljoscha Krettek (IBM Germany), John McPherson (IBM Research – Almaden)
This paper adds a new data placement policy for HDFS in Hadoop and uses it to speed up Join and Sessionized query like log processing tasks.
They observed that many log processing jobs need to process data partitions from different HDFS files, so placing correlated data partitions from different files will speed up the processing since it eliminate many data shuffle and remote I/O.
And also, although it starts from different angle, this paper convers only part of the work in previous paper.
- Automatic Optimization for MapReduce Programs 
Eaman Jahani (University of Michigan), Michael Cafarella (University of Michigan), Christopher Ré (University of Wisconsin-Madison)
This paper proposed the idea of using static code analyzing to improve the performance of unmodified Hadoop jobs.
But this work only uses analyzing result to do storage related optimization:
o Using index to pre-prune useless record for mapping function by selection analyzing
o Pruning useless field for map/reduce by projection analyzing
o Application level compression optimization
They didn’t do any query plan wide optimization using code analyzing result, there seems to be a lot of promising future works here.
- Where in the World is My Data?
Sudarshan Kadambi (Bloomberg), Jianjun Chen (Yahoo!), Brian Cooper (Google), David Lomax (Yahoo!), Raghu Ramakrishnan (Yahoo!), Adam Silberstein (Yahoo!), Erwin Tam (Yahoo!), Hector Garcia-Molina (Stanford University)
This paper proposed an idea to replicate structured data table at record level rather than traditionally table/partition level. The basic reasoning behind this idea is that: most popular website contains global data, but access at different geographical sites have different focus of the global data. They call this idea as selective replication.
In their design, record replicas are divided into 3 types:
- Master, where write/update operation can be executed, asynchronously notify full replicas about the update/write
- Full, where read operation can be executed
- Stub, where only primary key is stored and R/W operations are forwarded to proper other replica
Given this setting, the paper focus on the placement of the 3 types of replicas and optimize it for bandwidth (forwarding/replicating) savings. They introduced static/dynamic placement policy and defined a language to specify replica placement constraints (for example, total replica, forced full replica sites etc)
The major difference between static and dynamic placement policies are that dynamic method can leverage historical access pattern data to adjust the placement and make better bandwidth cost. To reduce the bookkeeping cost of store/analyze historical access data, the dynamic policy is simplified as: promoting to a full replica when we see a read at a stub replica; demoting to a stub replica when full replica is notified to update but not read for a period of time.
Their experiment shows that in a 10% remote friend setting, bandwidth used can be improved by 2x.
The value of stub is that it can reduce one message round trip and avoid master hot spot in case that the placement is not optimal.
Drawbacks:
- Cross row transaction is not supported
- Require a primary key for each record
- Fast Set Intersection in Memory
Bolin Ding University of Illinois at UrbanaChampaign, Arnd Christian K¨onig Microsoft Research
This paper described a fast intersection algorithm for in memory set. The basic idea is: use machine word to encode set elements and use bitwise-AND to accomplish intersection. The author reported about 3x performance gain compared with inverted index based set intersection.
The main problems of this work are:
o it requires complicated and costly preprocessing and dynamically updating the set is also not easy
o it’s only applicable for in-memory big/small set intersection and the result scale should be small
- Efficiently Compiling Efficient Query Plan for Modern Hardware
Thomas Neumann (Technische Universität München)
This paper described a new DB query processing architecture that compiles query statement into machine code directly using LLVM. Current DB query processing is based on iterator model and the advantage of this model is the flexibility and pipelining. But the disadvantages are: 1, it will call next() for each record for each iterator, which results lots of function calls; 2, usually, the next() function calls are virtual functions, this makes the function call cost more expensive; 3, poor code locality for one query execution. So the author tried compiling query plan directly into machine code and the previous 3 drawbacks are eliminated.
But the problems of this approach are that the compiling cost, whether is LLVM code optimizer is good enough for DB query and how to adopt existing query optimization technologies in this method.
- Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units
Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, Sheng Lin
This work comes from Baidu-Nankai joint lab and it aims speeding up index encode/decode and serving by leveraging GPU. I am not familiar with GPU programming, so skip the content here.
Industrial Sessions:
Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows
o Yahoo! Reported a tool used to monitor and debug query processing dataflow.
Jaql: A Scripting Language for Large Scale Semistructured Data Analysis
o IBM reported a script language over hadoop called Jaql to do large scale semi-structured data analyze which is used in its InfoShpere product.
Tenzing – A SQL Implementation on the MapReduce Framework
o Google reported their HIVE copycat called Tenzing, which extended standard SQL with support for advance analysis. The paper also contains some MapReduce enhancement.
o This is probably the hottest paper in VLDB2011 and many famous DBMS gurus are crowded in the meeting room, probably due to the hot debate on Map/Reduce VS Parallel DBMS several years ago.
o Google adopted many technologies learned from Parallel DBMS and Dryad to improve Map/Reduce in order to build a low latency SQL compatible query analyzing engine, which “partially answered the previous debate” (Google presenter’s words).
- Citrusleaf: A Real-Time NoSQL DB which Preserves ACID
o A company called Citrusleaf reported their real-time NoSQL DB that supports ACID, which is called Citrusleaf and is also widely used in some of the world’s largest real-time bidding networks.