Data fragmentation in ﬂash SSDs which is a common problem that leads to performance degradation, especially when the underlying storage devices become aged by heavily updating workloads. This paper addresses that problem in MongoDB, a popular document storage in the current market, by introducing a novel stream mapping scheme that based on unique characteristics of MongoDB. Empirical results showed that in YCSB and Linkbench, our methods improved the throughput by more than 44% and 43.73% respectively; reduced 99th-percentile latency by up to 29% and 24.67% in YCSB and Linkbench respectively. In addition, by tuning the leaf page size in B+Tree of MongoDB, we can signiﬁcantly improve the throughput by 3.37x and 2.14x in YCSB and Linkbench respectively. MongoDB_Multistreamed.pdf (25 downloads)
Multi-streamed technique first proposed by Kang et. al.  The main idea is mapping different streams with different files depended on their lifetime, that leads to reducing the overhead of copying back valid pages in garbage collection (GC) processing.
We explain how multi-streamed technique work in the Figure 1. Suppose that the device had eight logical block addresses (LBAs) and divided into two groups: hot data (LBA2, LBA4, LBA6, LBA8), and cold data are remains. There are two write sequences for both regular SSD and MSSD. The ﬁrst sequence is written continuously from LBA1 to LBA8, and then the second write sequence only includes hot LBAs i.e. LBA6, LBA2, LBA4, and LBA8. In regular SSD, after the ﬁrst write sequence, LBAs are mapped to block 0 and block 1 according to write order regardless of hot or cold data. When the second write sequence occurs, new coming writes are done in empty block 2, corresponding old LBAs become invalid in block 0 and block 1. If the GC process reclaims block 1, here is an overhead for copying back LBA5 and LBA7 to another free block before erasing block 1.
The write sequences are similar in MSSD; however, in the ﬁrst write sequence, LBAs assigned to a corresponding stream according to their hotness values. Consequently, all hot data grouped into block 1. After the second write sequence ﬁnished, all LBAs in block 1 become invalid and erasing block 1 in such case is quite fast, due to the copying back overhead is eliminated.
File-based Stream Mapping
First, we setup the experiment to study the proportions of data written to files in MongoDB with different benchmark i.e. YCSB and LinkbenchX (a modified version of the original Linkbench for MongoDB).
Table 1 shows the proportions of data written to collection ﬁles, index ﬁles and journal ﬁles under various workloads with diﬀerent operations i.e. create, read, update, delete (CRUD). In YCSB benchmark, since the data model is simple key-value with only one collection ﬁle and one primary index ﬁle, almost writes are on collection ﬁle, and there is no update on primary index ﬁle. In Linkbench, however, collection ﬁles and secondary index ﬁles are hot data which have frequency accessed, primary index ﬁles and journal ﬁles are cold data that receive the low proportion of writes i.e. less than 5% in total. This observation implicates that diﬀerence written ratios in ﬁle types lead to hot data and cold data located in the same physical data block in SSD that result in high overhead in GC process.
To solve this problem, we use a simple ﬁle-based stream mapping that assigns diﬀerent streams for diﬀerent ﬁle types. To minimize the overhead of the system, we assign a ﬁle to a corresponding stream only when open that ﬁle. This approach is good with simple data model in YCSB, however, has some drawbacks in the more complicated data model in Linkbench as discussed in the next section.
Boundary-based Stream Mapping
We futher study the I/O patterns of WiredTiger with complicate data model in LinkBench benchmark.
Figure 2 illustrates the written patterns of diﬀerent ﬁle types in the system under Linkbench benchmark with LB-Update-Only workload in two hours using blktrace. The x-axis is the elapsed time in seconds, the y-axis is the ﬁle oﬀset. DirectIO mode is used to eliminate the eﬀect of Operating System cache. Collection ﬁle and secondary index ﬁle have heavily random write pattern on two regions i.e. top and bottom that separated by a boundary in dashed line as illustrated in Figure 2(a), and (c). The primary index ﬁle and journal ﬁle follow sequence write patterns as illustrated in Figure 2(b), and (d) respectively.
At a time point, the amount of data written to two regions i.e. top and bottom is asymmetric and switches after each checkpoint. In this paper, we call that phenomenon is asymmetric regions writing. Due to the asymmetric regions writing phenomenon, for a given ﬁle, there is an internal fragmentation that dramatically aﬀects to the overhead of GC in SSDs.
File-based optimization is inadequate to solve the problem. In this approach, writes on one ﬁle are mapped with one stream, thus inside that stream, internal fragmentation still occurs.
Therefore, we propose a novel stream assignment named boundary-based stream mapping. The key idea is using the ﬁle boundary that separates the logical address of a given ﬁle to the top region and the bottom region. As illustrated in Algorithm 1, ﬁrstly, the boundary of each collection and index ﬁle is computed as the last ﬁle oﬀset after the load phase ﬁnished. Then in query phase, before writing a block data on a given ﬁle, the boundary is retrieved again as in line 4, based on the boundary values and the ﬁle types, the stream mapping is carried out as in line 7, 9, 12, 14, and 16. After stream id is mapped, the write command to the underlying ﬁle is given as
posix_fadvise(fid, offset, sid, advice), where
fid is file identify,
offset is the oﬀset to write on,
sid is stream id mapped and
advice is passed as a predeﬁned constant.
Evaluation and Analysis
We setup the experiment as below:
- MongoDB server 3.2 (WiredTiger as storage engine).
- CPU: 48 cores Intel Xeon 2.2 GHz
- DRAM: 32 GB
- Storage divice: Samsung SSD 840 Pro
- YCSB: 40 threads, 23 million of 1-KB records
- Linkbench: 40 threads, maxid1 = 80,000,000
- Workloads for YCSB and Linkbench are given in Table 1
The experiment methods are showned in Tabel 2. In the original WiredTiger, there is no stream mapping; thus all ﬁle types use stream 0 that reserve for ﬁles in Linux Kernel as default. In ﬁle-based stream mapping, we used total four streams that map each stream to a ﬁle type. In boundary-based stream mapping, metadata ﬁles and journal ﬁles are mapped in the same way with the ﬁle-based approach. The diﬀerent is that there is two streams map with collection ﬁles, one for all top regions and another for the bottom regions. We map streams for index ﬁles in the same manner without considering primary index ﬁles or secondary index ﬁles.
Figure 3 illustrates the throughput results for various benchmarks and workloads. In YCSB benchmark, boundary-based shows the throughput improve up to 23% at 5GB cache size and 44% at the cache size is 30GB in in Y-Update-Heavy and Y-Update-Only workload respectively. For Linkbench benchmark, the boundary-based method has throughput improve up to 23.26%, 28.12%, and 43.73% for LB-Original, LB-Mixed, and LB-Update-Only respectively.
In this paper, we discussed data fragmentation in MongoDB in detail. The ﬁle-based method is the simplest one that solves the data fragmentation due to the diﬀerent lifetime of writes on ﬁle types but remains internal fragmentation caused by asymmetric regions writing. For simple data model in YCSB, the boundary-based approach is adequate to solve the internal fragmentation that shows good performance improvement but lost its beneﬁts with the complex data model in Linkbench. In general, our proposed approaches can adopt to any storage engine that has similar characteristics with WiredTiger i.e. asymmetric ﬁles writing and asymmetric region writing. Moreover, we expect to further optimize the WiredTiger storage engine by solving the problem of boundary-based with complex data model i.e. Linkbench in the next research.
- Kang, J. U., Hyun, J., Maeng, H., & Cho, S.: The multi-streamed solid-state drive. In: 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14)(2014)
- Lee, S. W., Moon, B., & Park, C..: Advances in ﬂash memory SSD technology for enterprise database applications. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 863-870 (2009)