NoSQL solutions become emerging for large scaled, high performance, schema-flexible applications. WiredTiger is cost effective, non-locking, no-overwrite storage used as default storage engine in MongoDB. Understanding I/O characteristics of storage engine is important not only for choosing suitable solution with an application but also opening opportunities for researchers optimizing current working system, especially building more flash-awareness NoSQL DBMS. This paper explores background of MongoDB internals then analyze I/O characteristics of WiredTiger storage engine in detail. We also exploit space management mechanism in WiredTiger by using TRIM command. MongoDB_Trim_edb2016.pdf (40 downloads)
Space Management in WiredTiger
As shown in the Figure 1(a), WiredTiger uses extend data structure that each includes logical disk offset and size. There are three extent lists for each file that keep track of allocated space, available space and discard space. Before data buffer is actually written out, lasted update version is applied to original on-disk image, then space management allocate the logical disk address for the upcoming writes based on three approaches: (1) first-fit that selects the first extent in available extent list that fit the data buffer, (2) best-fit that select the first smallest extent that fit the data buffer, and (3) append at the end of file.
Fig. 1(b) illustrates how WiredTiger manages extent list and reuse allocated space by using checkpoints. Extent list information is kept in checkpoint structure and write out on persistent storage at the checkpoint time. WiredTiger uses a special unique checkpoint named live checkpoint that only exists when the system is running and located in D-RAM. During properly working time, whenever update requests come from the client, live checkpoint keeps track of both data changes and extend list grow/shrink information. As illustrated in Fig. 1(c), at the time checkpoint server is signaled, the previous checkpoint is fetched from persistent storage to D-RAM and merged with live checkpoint before write out to the storage system. After the
merging process finish, previous disk space occupied by the same data page is available and can be reused for next writes.
Optimizing WiredTiger using TRIM Command
TRIM command is introduced to allow an OS to inform SSD which data blocks are no longer used, so that, those blocks can be skipped in garbage collection operating in FTL of a SSD, that lead to reduce overhead, lower write amplification and longer lifespan of SSDs.
We describe how TRIM command is used to optimize WiredTiger by an example in Figure 2. Suppose each extent e.g. Ext A in WiredTiger map with two logical blocks in SSD e.g. A1, and A2. As showed in Figure 2(a), in the original state (1), WiredTiger views file offsets as extends while SSD has logical view on file as logical block addresses (LBA), Flash Translation Layer (FTL) inside SSD translates logical LBA to physical block address (PBA) before write on flash memory chip. There is an unused space named over provisioning that invisible with higher layers and reserved for GC processing. At state (2), when client e.g. YCSB workload issues an update request for a record again page P1, WiredTiger’s space management use either first-fit or best-fit approach to provide an available address e.g. Ext C for P1, an address replacement from Ext A to Ext C makes Ext A become invalid logically in WiredTiger. However, in the point of view of SSD, its associates LBA A1 and A2 still valid until it reaches the state (3) where space management reuse the Ext A again for a write request. If GC occurs between state (2) and state (3), A1 and A2 are copied back unnecessarily.
In WiredTiger, a data page is written in non-in-place-up-date fashion, hence there are exist multiple versions of that page scatter on disk. Thing becomes worse when the client workload e.g. YCSB includes huge number of small random updates, that not only lead to increase the majority high overhead of GC process but also effect the lifespan of SSD. For that reason, as illustrated in figure 2(b) that similar with previous use case except we adopt TRIM command at application layer i.e. WiredTiger, such that at state (2) whenever address replacement taken, we actively call TRIM command to notify invalid pages to FTL. Logical view of SSD to A1, A2 as free space while physical view to those as invalid, so that in case the GC is called between state (2) and state (3), it will not copy PBA A1, A2 to new block, that reduce the overhead of GC significantly.
EVALUATION AND ANALYSIS
We setup the experiment as below:
- MongoDB server 3.2 (WiredTiger as storage engine).
- CPU: 48 cores Intel Xeon 2.2 GHz
- DRAM: 32 GB
- Storage divice: Samsung SSD 840 Pro
- YCSB: 40 threads, 100% update workload, 30 million records, 30 million operations
Figure 3 shows the I/O patterns of Collection in WiredTiger tracked by blktrace, the x axis is runtime by second, the y axis is logical block address (LBA) that the I/O occurs. Some observations are:
- Figure 3(a) and 3(b) show that WiredTiger collection have random IOs
- For the collection file, there is a different in density of writes occur between areas of file i.e. the bottom and the top (Figure 3 (b)). The reason is some areas are accessed more frequency than others due to the zipfian distribution from YCSB workload
- The checkpoint process is expensive in WiredTiger, we can see the wide blank gap between writes at the most right of Figure 3(b)
Because the data model in YCSB benchmark is simple that has only one collection and one primary index. Almost writes occur in the collection file.
TRIM command optimization evaluation
We modified the original source code to use TRIM command in WiredTiger. Whenever address replacement occurs, the logical invalid offset is saved in the in-memory structure. Because there is an overhead of using TRIM command, instead of calling TRIM command for every address replacement, we accumulate them then call in batch when the number of commands reaches a threshold called trim frequency
Table 1 shows the performance of TRIM command optimization in WiredTiger with the last column shows OPS/s improved percentage of optimized methods with different trim frequency, compare to the original WiredTiger as the baseline. When the trim frequency is small i.e. 10,000, there is a little improvement in OPS/s that just more than five percentages, due to the overhead of TRIM still high. The performance improved to more than ten percentages when the frequency high enough e.g. 15,000 and 20,000 as shown in the table accordingly. However, when we continue increased the frequency, the performance went down for the case of 30,000 and even lower than the baseline. The reason is in such situation, the overhead of merging address space is considered, and sending a huge number of TRIM commands at once stress out the whole system.
In this paper, we brought the background internals of WiredTiger storage engine in detail as well as examined the I/O characteristics of typical files. Collection file is most bottleneck accessed that show randomly read and mixed random-sequential write patterns when direct IO is disabled.We exploited WiredTiger’s space management by delay calling TRIM commands for address replacements on flash-based SSD. The performance is improved more than ten percentages with carefully tuning the value of trim frequency.