C-Store A Column-oriented DBMS

Posted on May 25, 2023   3 minute read ∼ Filed in  : 

Summary

This paper defines a data model, storage abstractions and corresponding RS/WS design for read performance, compressed techniques and transactions with snapshot isolation (based on 2PL) and 2PC.

Data Model

Tables

=> projections

  • contains any number of columns in the table, or from another table if there are relationships
  • multiple projections for one table could have overlaps
  • tuples in projections are stored column-wise. K columns == K data structures. All are sorted on the same sort key.

=> projection is horizontally partitioned into 1 or more segments

  • each segment has an ID.
  • each segment has multiple rows/records/tuples.
  • each value of each column has a storage key. values with the same storage key belong to one logical row.
  • two projections can join to form the orginal table, and join using the index (seg_id, storage Key)’

Joins: the problem is to determine the projections, segments, sort keys, and join indices to create the logical tables.

ReadStore

It uses 4 compressed schema - 4 encoidngs

  • (v, f, n) : value v first appear at f position and have continue n times. It also have clusterd B-Tree indexs in this column for efficient searching.
  • (v, b): values v appear in positions indicted in bitmap b. It also use B-Trees t map positions in a column to the values contained in that column.
  • represent every value in the column as a delta from the previous value in the column. 1,4,7,7,8,12 would be represented by the sequence: 1,3,3,0,1,4. B-tree tree at the block level can be used to index these coded objects.
  • values unencoded + A dense pack B-tree.

WriteStore

Write store is also a column store and is also partitioned in the same way as RS, thus there is 1:1 mapping.

Write store is not compressed, each value is (v, SK). value and storage key.

Each projection uses two B-tree

  • The first maintains (v, sk) pair.
  • The second maintains (s, SK) pair.

During the query, it first locates sk using s, and then local v using sk.

Storage

It divides projecttions into segments, and assign segments to physical nodes and set some constraints for performance

  • All columns in a single segment should be co-located
  • WS segments and RS segments with the same key range should co-locate.

Big columns are stored in individual files.

Transaction

Write Store is built on top of BerkeleyDB. C-store is designed for large reads with smaller OLTP txs on few seconds. To ensure the performnace, it uses read-only transactions using snapshot isolation (don’t need to set any locks), while the update tx use two-phase locking and set read/write locks.

Snapshot isolation

Snapshot isolation works by allowing read-only transactions to access the database as of some time in the recent past, before which we can guarantee that there are no uncommitted transactions.

Read only transaction can effectivly run between period (Low water mark - High water mark)

The key problem in snapshot isolation is determining which of the records in WS and RS should be visible to a read-only transaction running at effective time ET. Any records inserted bebefore ET and delete after ET is visiable to read-only query.

  • Update cannot happens in-place => convert ot insert and delete, each with insert and delete timestamp per-projection.
  • Reduce the timestamp assignment overhead, it uses epoch to assign timestamps.
  • Reduce the timesatmps management overhead, it ensure no records in RS were instered after
  • It uses a timestamp authority to periodically assign timestamp to each site in 2pc manner.
  • read-only query reads from the most recent time related epoch.

Concurrency Control

It uses 2pl and 2pc. To prevent the deadlock, it uses timeout machinism.

Tuple Mover

The read-store have a LWM and HWM, duirng the moving data from WS to RS. It finds all records inserted timestamp is at or before the LWM.

  • If their delete timestamp is at or before the LWM also, them don’t move them to RS.
  • If their delete timestamp is after LWM, then move them to RS, such that read query could see it.




END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.