Dremel made simple with Parquet
1 minute read ∼ Filed in : A paper noteIntroduction
Advantages to columnar formats:
- better compression
- reduced I/O to only a few columns
Parquet file
https://parquet.apache.org/docs/file-format/
Format:
N Column is divided into M groups.
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"