Lakehouse A New Generation of Open Platforms that Unify DataWarehousing and Advanced Analytics
1 minute read ∼ Filed in : A paper noteIntroduction
History of Datawarehouse
Schema
Schema-on-write: schema is well-defined when writing data into storage.
Schema-on-read: schema is defined when reading data for analysis.
First-generation
It coupled compute and storage into an on-premises appliance, and enterprises have to pay for it.
Datasets are becoming unstructured, video, audio, and texts. The data warehouse cannot store and query them.
Second generation
LakeHouse
The paper shows LakeHouse has the following advantages:
- Based on open direct-access data formats, like Parquet, ORC,
- First-class support for ML
- Offers state-of-art performance.
It can address several challenges: data staleness, reliability, the total cost of ownership, data lock-in, and limited use-case support.
The paper shows that LakeHouse is competitive with cloud data warehouses on TPC-DS