Database consistency and recoverability require guaranteeing write atomicity for one or more pages. However, contemporary database systems consider write operations nonatomic. Thus, many database storage engines have traditionally relied on either journaling or copy-on-write approaches for atomic propagation of updated pages to the storage. This reliance achieves write atomicity at the cost of various write amplifications such as redundant writes, treewandering, and compaction. This write amplification results in reduced performance and, for flash storage, accelerates device wear-out.
SHARE paper (206 downloads)
We propose a flash storage interface, SHARE. Being able to explicitly remap the address mapping inside flash storage using SHARE interface enables host-side database storage engines to achieve write atomicity without causing write amplification. We have implemented SHARE on a real SSD board, OpenSSD, and modified MySQL/InnoDB and Couchbase NoSQL storage engines to make them compatible with the extended SHARE interface. Our experimental results show that this SHARE-based MySQL/InnoDB and Couchbase configurations can significantly boost database performance. In particular, the inevitable and costly Couchbase compaction process can complete without copying any data pages.
Redundant Writes in MySQL/InnoDB
The MySQL/InnoDB storage engine takes a variant of journaling, called double-write, to deal with the partial page write problem. As shown in above figure, when a updated page is evicted from the buffer cache, prior to overwriting the old copy in its original database location, the InnoDB engine first appends its new copy (i.e., after-image) to a separate journal area, double-write-buffer. And when the write to the journal area is forcefully completed (using fsync call), InnoDB writes the page to its original location. When the system recovers from a crash, InnoDB can always find a consistent page copy either in the database or in DWB.
The PostgreSQL server is also taking a redundant journaling approach to guarantee write atomicity. Specifically, when the server runs by default with the full_page_write option on, whenever a page is updated first after the last checkpoint, the before-image of the page is saved in the WAL log. It is also well known that SQLite, a popular embedded database system, provides two journaling modes, rollback and write-ahead log, to guarantee atomic page write, and the overhead of journaling either before-image or after-image of every updated page is very expensive.
This journaling-based redundant write paradigm has also been widely adopted in modern file systems such as ext4 and XFS so as to guarantee the consistency of data and metadata despite the torn page problem. However, because journaling both data and metadata pages (i.e., full journaling mode) is too expensive, those file systems are by default configured to journal only metadata (i.e., ordered journaling mode) so that the consistency of file system structure is at least preserved.
Copy-on-Write in Couchbase
Couchbase is a document-oriented NoSQL database system. Its storage engine is based on the append-only B+-tree that takes a copy-on-write strategy for updating document(s) atomically: rather than overwriting the existing old document copy in place, it appends its new copy at the end of the database file. As is illustrated in above figure, when a document is updated, its new copy is written to the end of a database file while its old copy is left intact, but marked as stale. One undesirable consequence of this append-only update policy is that all tree nodes on a path from a leaf pointing to the document to the root should be updated and written in a cascaded way, where each node update also takes a copy-on-write strategy. The Couchbase storage engine
relies on the so-called wandering-tree scheme. A tree may be called a wandering tree if an tree node update requires updating its parent nodes up to the root due to inability to perform in-place updates. Consequently, the wandering-tree scheme amplifies the data to be written with N index node pages, where N is the height of B+-tree.
Despite the high write amplification, Couchbase deliberately adopted this copy-on-write and wandering-tree combination mainly for two purposes:
- As modern storage devices do not support an atomic write feature, the copy-on-write strategy has been adopted as an implementation method to support atomic writes.
- Couchbase has opted for the sequential write pattern of the copy-on-write strategy over the random write pattern of the update-inplace strategy. In particular, because spinning disk write latency is mainly dominated by the mechanical arm movement(e.g., disk seek time), the write strategy Couchbase adopted can provide better write throughput than the traditional update-in-place policy.
As more documents are updated or newly-inserted under the copy-on-write and wandering-tree scheme, the percentage of stale database file pages increases. When the ratio of stale data reaches a configured database file threshold, the costly compaction operation is necessarily invoked to reclaim the unused space the stale data occupies. For this, the compaction task typically reads all non-stale documents and index nodes from the current database file and copies them to a new file. This incurs significant I/O overhead and write amplification. The current database file is later deleted when no longer accessed by any reader. Other NoSQL database systems, such as BigTable, Cassandra, and MongoDB that adopt a Log-Structured Merge (LSM) tree as their underlying storage engines have the similar issue.
The solution is based on the holistic understanding of problems across the layers of application, OS, and storage device. The FTL’s page-grained indirection of logical-to-physical addressing mapping provides an excellent opportunity for solving the write amplification problem occurring in both the journaling and copy-on-write approaches. Let us first illustrate how we can, by exploiting the unique FTL page-level address mapping feature, avoid the Couchbase write amplification without compromising atomic write semantics. As is shown in above figure, when a new copy of document D2, denoted as D2’, is written to flash storage media, each of two document copies, D2 and D2’, has its own logical and physical address. Now, what if the logical address of D2 in FTL can be remapped to point to the physical address of D2’ ? The new D2’ copy can be reached through D2’s LPN. Hence, unlike the original Couchbase, all index nodes along the path from the corresponding leaf node to the root node need not to be copied-on-write. That is, the write amplification due to Couchbase’s wandering tree scheme can be totally avoided while still preserving the atomic write semantics required to update document D2. Similarly, with MySQL/InnoDB double-write-buffer, the redundant write of an updated page to the original location in the primary database can be avoided simply by FTL changing the logical address of its original location to point to the physical addresses of new copy written in the double-write buffer, without losing the atomic write property.
The key contributions of this work are summarized as follows:
- SHARE provides an abstraction which allows host applications to change the address mapping table which has traditionally been managed internally only by FTL. Many database engines and file systems can easily exploit SHARE to achieve write atomicity without causing write amplification. In addition to atomic writes, SHARE can also make other numerous write heavy cases almost cost free. This includes the NoSQL compaction process and file copy operations that can occur almost without copying data.
- We have implemented SHARE on an open SSD development hardware platform called OpenSSD by enhancing its FTL code with the SHARE features. We have demonstrated both relational and NoSQL storage engines can easily exploit SHARE with only minimal code changes.