Dremel Interactive Analysis of WebScale Datasets

Posted on June 17, 2022 2 minute read ∼ Filed in : A paper note

Introduction

Motivation

Data is often non-relational, and Nested. Normalizing and recombing those is hard. A better solution is to store all values of a given field consecutively to improve retrieval efficiency.

The main challenge is how to preserve all structural information and can reconstruct records from an arbitrary subset of fields.

Contribution

The paper proposes a scalable (trillion-record), interactive, ad-hoc query system for the analysis of read-only nested data.

The paper proposes a novel columnar storage format for nested data. it basically dissects nested records into columns and reassembles them.
Query processing is efficient and does not require restructuring of the records.
Conducts experiments with trillion-record, multi-terabyte datasets. 1k-4k nodes. it shows The system can combine multi-level execution trees and columnar data layout to run aggregation over trillion-row tables in seconds.

System

Data Model

The data model is based on strongly-typed nested records.

Nested Columnar Storage

Mainly address the following challenges:

lossless representation of a record in columnar format => with r and d
fast encoding => tree writer, build a tree from schema.
efficient record assembly.

Repetition and Definition Levels

Use r and d to record the meta information of each value, eg, which record the value belongs to. etc

Definition levels are not stored for values that are always defined. Similarly, repetition levels are stored only if required.

Splitting Records into Columns

Most datasets are sparse and have thousands of fields, the paper tries to process missing fields as cheaply as possible.

The paper uses a tree of field writers, whose structure matches the field hierarchy in the schema. It then updates field writers only when they have their own data.

Record Assembly

Efficiently reconstruct the original records from columnar data.

Query Language & Execution

Mainly for a read-only system. Many queries are one-pass aggregations.

Experiments

All query is about the sum, count, groupBy,

And then measures

Executing time of the same query on row-based and column-based storage.
influences the number of servers in the server tree on execution time.
Scalability: Executing time vs increasing of leaf servers.

Conclusion