RocksDB

Posted on October 14, 2021 6 minute read ∼ Filed in : A paper note

Motivation: Mainly focus on storing small to medium size key-values on fast storage – flash devices or in-memory.

Storage: RocksDB is a storage engine with key/value interface. It uses a Log-Structured Database Engine for storage and supports various compression algorithms.

Performance: Support efficient point lookups, range scans, and different types of ACID guarantees.

High Level Architecture

RocksDB organizes all data in sorted order and the common operations are Get(key), NewIterator(), Put(key, val), Delete(key), and SingleDelete(key).

The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile (aka. Write Ahead Log(WAL)). The logfile is a sequentially-written file on storage. When the memtable fills up, it is flushed to a sstfile on storage and the corresponding logfile can be safely deleted. The data in an sstfile is sorted to facilitate easy lookup of keys.

Features overview

All data in the database is logically arranged in sorted order

Column Families

partitioning a database instance into multiple column families

RocksDB guarantees users a consistent view across column families, including after crash recovery when WAL is enabled or atomic flush is enabled.

It also supports atomic cross-column family operations via the WriteBatch API.

Update

Put: update a single

Write: update batch of data

Gets, Iterators and Snapshots

get: single key

multiGet: many kv pairs

Iterator: range scan on the database. Seek to a specified key and then the application can start scanning one key at a time from that point

Snapshots: create a point-in-time view of a database

Short-lived/foreground scans are best done via an iterator while long-running/background scans are better done via a snapshot.

An Iterator keeps a reference count on all underlying files that correspond to that point-in-time-view of the database - these files are not deleted until the Iterator is released. A Snapshot, on the other hand, does not prevent file deletions; instead the compaction process understands the existence of Snapshots and promises never to delete a key that is visible in any existing Snapshot.

Snapshots are not persisted across database restarts: a reload of the RocksDB library (via a server restart) releases all pre-existing Snapshots.

Transactions

multi-operational transactions. It supports both of optimistic and pessimistic mode

Prefix Iterators

Options.prefix_extractor is set, a hash of the prefix is also added to the Bloom. An Iterator that specifies a key-prefix (in ReadOptions`) will use the Bloom Filter to avoid looking into data files that do not contain keys with the specified key-prefix

Persistence

RocksDB has a Write Ahead Log (WAL). All write operations (Put, Delete and Merge) are stored in an in-memory buffer called the memtable as well as optionally inserted into WAL. On restart, it re-processes all the transactions that were recorded in the log.

WAL can be configured to be stored in a directory different from the directory where the SST files are stored

Data CheckSuming

RocksDB uses a checksum to detect corruptions in storage. These checksums are for each SST file block (typically between 4K to 128K in size). A block, once written to storage, is never modified.

Multi-Threaded Compactions

It is observed that sustained write rates may increase by as much as a factor of 10 with multi-threaded compaction when the database is on SSDs, as compared to single-threaded compactions

Compaction sytles

Support:

Level Style Compaction.
Universal Style Compaction
FIFO Style Compaction
Customer compaction

metadata storage

A manifest log file is used to record all the database state changes. Add/delete files etc.

Avoiding Stalls

keep a small set of threads explicitly reserved for the sole purpose of flushing memtable to storage.

Compaction Filter

Can drop a key or modify the value of key as part of compaction process

ReadOnly Mode

much higher read performance because oft-traversed code paths avoid locks completely.

Data Compression

supports lz4, zstd, snappy, zlib, and lz4_hc compression

Full Backups and Replication

RocksDB itself is not a replicated, but it provides some helper functions to enable users to implement their replication system on top of RocksDB

Block Cache – Compressed and Uncompressed Data

RocksDB uses a LRU cache for blocks to serve reads. The block cache is partitioned into two individual caches: the first caches uncompressed blocks and the second caches compressed blocks in RAM. If a compressed block cache is configured, users may wish to enable direct I/O to prevent redundant caching of the same data in OS page cache.

Table Cache

The Table Cache is a construct that caches open file descriptors. These file descriptors are for sstfiles. An application can specify the maximum size of the Table Cache, or configure RocksDB to always keep all files open, to achieve better performance.

I/O Control

Users can enable direct I/O so that RocksDB takes full control to the I/O and caching

Memtables:

In-memory data structure serving both read and write.

Pluggable

The default implementation of the memtable for RocksDB is a skiplist. The skiplist is a sorted set, which is a necessary construct when the workload interleaves writes with range-scans. but it’s useless if there are no range-scan.

Three memtables are part of the library: a skiplist memtable, a vector memtable and a prefix-hash memtable

A skiplist memtable:

A vector memtable is appropriate for bulk-loading data into the database

A prefix-hash memtable allows efficient processing of gets, puts and scans-within-a-key-prefix.

Memtable pipelining

RocksDB supports configuring an arbitrary number of memtables for a database.

When a memtable is full => Add to flush pipeline => Background thread flush all the pipelined immutable memtables to storage.

Garbage Collection during Memtable Flush:

Flush triggered:

Memtable size, total

Inline compactionprocess is executed when memtable is flushed to the storage, make sure after flushing, there are no duplicated key in the sstable.

This feature reduces the size of data on storage and write amplification greatly, for some workloads.

Merge Operator

RocksDB natively supports three types of records, a Put record, a Delete record and a Merge record.

When a compaction process encounters a Merge record, it invokes an application-specified method called the Merge Operator. The Merge can combine multiple Put and Merge records into a single one.

It make read-modify-write operations to avoid read .

Write ahead log

In the event of a failure, write ahead logs can be used to completely recover the data in the memtable, which is necessary to restore the database to the original state

mangodb, cardraeesnal, hbase.

why rocksdb?

Embedded application architecture. where the database is a part of application server

berkeleyDB, SQLite, Kyoto TreeDB, levelDB.

no transaciton log, fixed size keys, h

levelDB, low write rates, only one cpu

RocksDB

High Level Architecture

Features overview

Column Families

Update

Gets, Iterators and Snapshots

Transactions

Prefix Iterators

Persistence

Data CheckSuming

Multi-Threaded Compactions

Compaction sytles

metadata storage

Avoiding Stalls

Compaction Filter

ReadOnly Mode

Data Compression

Full Backups and Replication

Block Cache – Compressed and Uncompressed Data

Table Cache

I/O Control

Memtables:

Pluggable

Memtable pipelining

Garbage Collection during Memtable Flush:

Merge Operator

Write ahead log

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

High Level Architecture

Features overview

Column Families

Update

Gets, Iterators and Snapshots

Transactions

Prefix Iterators

Persistence

Data CheckSuming

Multi-Threaded Compactions

Compaction sytles

metadata storage

Avoiding Stalls

Compaction Filter

ReadOnly Mode

Data Compression

Full Backups and Replication

Block Cache – Compressed and Uncompressed Data

Table Cache

I/O Control

Memtables:

Pluggable

Memtable pipelining

Garbage Collection during Memtable Flush:

Merge Operator

Write ahead log

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.