Lqo

Posted on August 8, 2024 30 minute read

Some Ideas

Non-parametric model for model selection for query opt?

Dirichlet Process Mixture: It solves the problem of determining the number of mixture components by allowing the data to drive the complexity of the model.

Settings

config	size
shared_buffers >= N * work_mem	16GB
work_mem (sort/hash operation)	4GB
effective_cache_size (affects whether to use index scans or sequential scans.)	16GB
temp_buffers	4GB

geqo	off
AUTOVACUUM	off

enable_bitmapscan	on
enable_tidscan	on

max_parallel_workers	4
max_parallel_workers_per_gather	4
max_worker_processes	4

Templates

TakeawaysProblems

Assumptions

Techniques

Query Type

Query Encoding

Model

Workloads

Datasets

Dynamic Workload

1. DSB datasets

Existing benchmarks lack of

more distinct query instances
dynamic workloads
guidance of comprehensively comparing the performance trade-offs for workload-driven and traditional database systems.

Shift

Skew

Both exponential / Zipfian distributions are for those where the common value is more frequent
The exponential distribution is often used for modeling time-related data or “waiting times,” while Zipfian is used to model scenarios where there is a ranking or ordering of items, such as popularity.Zipfian

Workloads

skews and correlations, skews on individual columns, correlations between columns in the same table, and correlations of columns across multiple tables.

=> more skews and correlations to individual columns and multiple columns within a table and across tables.
more join patterns: including non-equi joins and cyclic join graphs.

=> new query templates to enrich the join patterns.
slice data at a fine granularity with multiple predicate filters on table columns.

=> fine-grained data slicing with additional predicate filters.
=> configurable parameterization to generate queries.
=> comprehensive user guidance.

Experiments

2. Modeling Shifting Workloads for Learned Database Systems

Takeaways

Problems

The key question is when and how best to re-train the model.

Retraining within near queries => catastrophic forgetting, large queries => time-consuming.
Workload imbalance leads to poor performance.

Assumptions

Assumes the underlying data to be static.

Techniques

Query Type, Query Encoding

As in other papers

Model

distributed shift detector (per l query) => train a new model/retrain the existing model with the importance of past/current queries.

KS test for measuring the distribution of each feature of two samples.
Bonferroni correction to adjust the significance level.

online clustering algorithm (Bayesian non-parametric models) => managing the replay buffer => providing data to re-train.

cluster based on the query feature encoding, other than learned embedding or gradients.
propose DP-Medoides based K-Medoids, since it is more robust to outliers/supports customerize distnace computes.
manage the reply buffer such that its balance, diversified.

re-training: combined loss of pass and current queries.

Workloads

Datasets

IMDB with JOB

DSB (50GB)

the query join graph and/or the distribution of query predicates may shift with time.

Dynamic Workload

Generate a query pool with each query classes having 200 queries.
shifting workloads:
1. randomly pick three classes (fixed)
2. for each class, randomly pick 100 queries
3. concatenate those queries.
4. repeate from 2-4 to form a workload with queries A(100), B(100), C(100), A(100)…
measure performance after retraining and evaluate in other classes (except the current one).

3. Balsa

Takeaways

Neo-impl is still not robust enough to generalize to unseen test queries and suffers from high variance.

Balsa’s better generalization is due to a broader state coverage offered by simulation, on-policy learning, and safe exploration

the main point of this paper is to learn without expert demonstrations.

Problems

Existing RL-based methods assume the availability of a mature query optimizer to learn from.

Assumptions

the database content is kept static.

Updates to the schema, appends, or in-place updates can be handled by retraining.

This assumption implies that the agent need not solve a learning problem with a shifting distribution. Another assumption is that Balsa currently optimizes select-project-join (SPJ) blocks.

Techniques

Formulation

query optimization as an MDP:

state: partial query plan.
action step in building a query plan in a bottom-up fashion， either join two tables, or assign a scan method.
a reward is given only at the final (terminal) state, each action only gets zero rewards

Query Type

Query Encoding

same as NEO

Process

learn to optimize queries without learning from an existing expert optimizer

Simulation and reality phases:

simulation phase trains a value network based on the simple CE cost, although this CE cost is not accurate, it enables the agent to avoid the worst plans.
reality phase finetunes the value network.
during exploration, Balsa generates several best-predicted plans (instead of the best) and then picks the best unseen one out of them.

Generalization:

train with high diversity, merge experiences with several independently agent with different seeds.

Some techniques

timeouts,
exploration only on the best few plans.
on-policy learning better than retraining.

Workloads

TPCH

4. Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

Takeaways

A number of join is an irrelevant proxy for execution time.

GEQO is used for large join number queries, which can be disabled to fairly compare.

Problems

The absence of a unified reproducible framework make it currently impossible to fairly compare the results of LQOs.

Many filter combinations result in the same selectivity. Hence, the model would potentially suffer from the mismatch between features and target variables and will only perform well if this inconsistency is mitigated.

Assumptions

Techniques

Query Type & Query Encoding

All existing LQO are query-driven methods other than data-driven.

Encoding schema for a query should be both expressive and robust.

roubust: unique 1-to-1 mapping between the feature variables and the latency or cost of a query and its given plan
expressive: the encoding should clearly reflect both the global and local context.

Bao does not use talbes but only cardinalities and costs.

=> easier to retrain and schema-agnosticism.

=> not one-to-one mapping: different filters in a query has same cardinalities for the same table. ??

Bao deos not use query encoding => increases the probability of converging to a local optimum.

Model

An architecture that chains the ML models like in HybridQO, where different target variables are served to different models in the ML pipeline

Workloads

Dataset

JOB and STACK

We do not use the STATS-CEB benchmark, as it was originally developed for challenges in cardinality estimation as opposed to end-to-end query optimization, which is the focus of this paper.

Dynamic Workload

four worklods

Leave One Out Sampling.
Random Sampling.
Base Query Sampling.
Covariate Shift on JOB

5. NEO

Takeaways

Problems

However, none demonstrate that their improved cardinality estimations actually lead to better query plans.

Furthermore, unlike join order selection, selecting join operators (e.g., hash join, merge join) and choosing indexes cannot be entirely reduced to cardinality estimation.

Assumptions

existence of a Sample Workload consisting of queries representative of the user’s total workload

existence of the underlying engine’s capabilities (i.e., exercising a representative set of operators).

Techniques

end-to-end learning approach, including join order, index, and physical operator selection.

Formulation

query optimization as an MDP:

state: partial query plan.
action step in building a query plan in a bottom-up fashion
a reward is given only at the final (terminal) state, each action only gets zero rewards

Query Type

Query Encoding

Query level information (join graph) + plan level information (join order).

query-level:
- treats the join graph as an adjacency matrix and only encodes the upper triangular portion into one dim vec.
- column predicates vector: 1-hot, hist, row-vector. row vector is to represent each row as a sentence.
plan-level
- plan as a tree, each node is a vector of size J+2R, J is a number of join types, and R is a number of relations. Each relation has two values: table use or not? scan type

Model

Value Model (tree-convolution) to predict the final execution time for a partial or complete plan.

value network does not predict the cost of a subplan, but rather the best possible latency achievable from an execution plan that includes a given subplan.

best-first search + value network => valuate iteration techniques to generate the final plan.

init-state: every scan is unspecified and there are no joins.
action: either (1) fusing together two roots with a join operator or (2) turning an unspecified scan into a table or index scan
not greedily follow the suggestions of the value network.
with bootstrapping (warm-ups), it can avoid sample inefficiency problem

retraining from scratch other than fine-tuning.

Workloads.

Datasets

JOB, TPCH, Corp

Dynamic Workload

Train and test on various templates.

6.Lero

Takeaways

The hint set-based strategy has some intrinsic limitations

Problems

existing methods are

unstable performance.
high learning costs.
slow model updating.

It is unncessary to estimate cost with model.

Assumptions

We assume constant resource budget per plan execution in this paper

Techniques

Plan explorer + model two-class classification.

Formulation

pair-wise classification problem

Query Type

Query Encoding

each node in the tree: join method, scane method, cardinality, row wideth, table used. (No estimated costs)

Model

Pre-train on the synthetic workloads based on the estimated cost in the postgresql.

randomly generate a number of plans with different join orders and predicates on different tables,
randomly set the cardinality for each sub-plan of each plan and feed them into the native cost model (without executing the plans) to derive the estimated plan costs as labels.

Online training

get the real execution latency and re-train the model.
retrian every 100 queries on IMDB and STATS, every 30 queries on TPC-H and every 50 queries on TPC-DS.

Workloads.

Datasets

tpch, tpc-ds, stats, imdb.

Dynamic Workload

“time series split” strategy, always evaluated on the new template.

7.LEON

Takeaways

Theoretically, as long as the cardinality estimations and the cost model are accurate, this architecture obtains the optimal query plan

The db query optimizer can do transformation for logical expression, such as

unnest an in/exists subquery to a semi-join query.
subquery pushing involves evaluating a subquery earlier in the execution process and “pushing” its results into the main query. This changes the order of operations to reduce computational cost or improve query performance.

Those rules cannot be used in ML-based method, and thus ML-basd method cannot conduct the plan enumeration completely.

DB can be extendend with new operators, while the ML-based model can only learn in trail-and-error manner.

cost model to initialize the model => cold start problem.

Problems

ML-based methods learn domain-specifc knowledge in a data-driven manner. Thus, it suffers from the following two fundamental limitations:

only for spj query and cannot handle complicated query.
cold-start problem

Therefore, ML should aid the traiditonal query optimizer otherthan replacing it. And ML model should start from the traditional query optimizer.

Assumptions

Techniques

start from current performance and self-adjust to a certain dataset or workload.

cost estimation as a contextual parise-ranking problem.
robust plan exploration strategy
- a plan with a higher ranking position should be explored with a higher probability
- learned optimizer should correct its errors that lead to sub-optimal query plans

Formulation

formalizes query optimization as a contextual pair-wise plan ranking problem

Query Type

Query Encoding

logical properties: output cardinality and join graph of the sub query.

physical properties: plan tree and the physical property of the whole plan. Each node is one-hot encoding representing the physical operators and sort order.

Model

Intra-equivalent set Model: estimate the cost for each plan and uncertainty.

Inter-equivalent set Model: it is used to prune the redundant search space.

Workloads.

Datasets

Dynamic Workload

Performance

training efficiency and latency performance.

8. BASE

Takeaways

transfer the model pre-trained on the cost and then to the latency sounds promising.

learn a mapping from cost to latency sounds promising.

Problems

The DBMS cost model can estimate the cost for an execution plan, which is expected to reflect the latency of the plan, but they are not positively correlated in practice

Assumptions

Techniques

Two stage RL-based framework.

Stage 1: fit RL to chose the operation with leatest cost.

Stage 2: transfer a pre-trained policy based on the cost to a better policy based on latency

leverage cost as a trade-off to increase training efficiency and then transfer the pre-trained model based on the cost to a new model that can adapt to latency signals

Formulation

BASE formulates the procedure of locating and filling the gap between cost and latency as a variant of IRL.

Query Type

SPJ

Query Encoding

Same as NEO

Model

In the 2nd stage, we use a calibrated reward function by applying a parameterized calibration function to the cost signals.

calibration function only makes the calibrated cost signals and the latency signals more correlated

Workloads.

Datasets

STATS, IMDB.

Dynamic Workload

None

Performance

Fast training

9.Reinforcement Learning with Tree-LSTM for Join Order Selection

Takeaways

This handle the schema change by using dynamic pooling to map anylengthh input to a fixed sized vector.

Problems

Existing work

fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans.
Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.

Assumptions

Techniques

Two stage RL-based framework.

Stage 1: pre-train the Deep Reinforcement Learning with cost,

Stage 2: finetune the model with the latency.

Formulation

BASE formulates the procedure of locating and filling the gap between 푟휙and 푟(푙 )as a variant of IRL.

Query Type

SPJ

Query Encoding

query + data

query: contains join information in a query. i.e., matrix to represent all join relation, then flatten into one vector. Then it applies FC to get the final representation
columns:
- numertical value: vector of 4 value, if column is in join, is =? is>? is<?
- string value: only use the selectivity information and get the column representions.
table: concatenate all columns representions.
join tree: use tree-LSTM to learn a representation.
- combine both Child-Sum Tree-LSTM and N-ar Tree LSTM.

Model

Representation: use a Tree-LSTM

RL: use DQN.

cost training and latency tunning.
not move target from cost to latecy directly, but instead treate them as multi task learning, where train two Q network but use the same input representation.

Handle the dynamic worklaod:

use dynamic pooling to encoding any dynamic length query
Adding columns also use dynamic pooling

Workloads.

Datasets

tpch, imdb

Dynamic Workload

insert/delete columns, add new tables etc.

Performance

10.Theoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability

Takeaways

Problems

no theoretical work shows any advantage in using the learned methods in dynamic datasets and under distribution shift.

The goal of this paper is to theoretically understand the capabilities of learned models for database operations, show why and when they outperform non-learned alternatives and provide theoretical guarantees on their performance

Assumptions

Techniques

indexing, cardinality estimation and sorting when presence of insertions from a possibly changing data distribution

Formulation

data after inserting n times is D_n

its distribution is learnable with parameter T, S, B.

T: the model can be evaluted in T operatios.
take S to store.
n*B to learn the model.

11.Eraser: Eliminating Performance Regression on Learned Query Optimizer

Takeaways

This is mainly focus on the dataset side other than model side, for any trained model, it filter the search spcae based on model performance to increate the effectiveness of the model.

Problems

learned model has regression due to reasons like

inherent difficulty of the learning problem,
the low generalization ability of the prediction model on new data,
the under-fitting on training data due to insufficient training data,
loss of features, noisy labels and inappropriate model training methods.

Assumptions

Learned model has low generalizatin ability, thus has big regression problem.

Techniques

Two stage RL-based framework.

Stage 1: coarse-grained filter to remove highly risky plans

Insight: ML prediction model is likely to underperform due to encountering unseen feature values, therefore unexpected plan space into subspaces based on unseen feature values. this is like data labelling process.
train a model to predict precise or imprecise of each subspace.

Stage 2: cluster plans in fine-grained manner, evalutes each cluster based on the prediction result.

Cluster the remaining precise plans into groups based on their feature values and the model’s reliability.
Assign each cluster a reliability interval

Formulation

A performance elimination method perf_elim such that benefits are maximum and loss is minimized.

Query Type

SPJ

Query Encoding

plan-level features: join relations, filtering predicates and operator types

join: join type, scan type,
filter: low and high value
relation: 0/1 to indicate existance.
structure: use a categorical variable to indicates it, bushy tree, left-deep and right-deep, on plans joining 4 tables

data-level features: estimated cardinality and data distribution

Model

unexpected plan explorer filter bad plans based on the relability threshold.

cluster the remining plans and use the segment model, and also assign relability value to reflect the quality.

Then filter again, and use plan selection method to balance the benefit of improvements and the risk of regression.

Workloads.

Datasets

tpch, imdb

Dynamic Workload

trian on 25% etc and evaluate on the fixed test dataset.

Performance

12.Access Path Selection in a Relational Database Management System

Takeaways

This handle the schema change by using dynamic pooling to map anylengthh input to a fixed sized vector.

Problems

This paper will address the issues of access path selection for queries

single table queries
join
nested queries

Techniques

query processing are 4 phases:

parsing:
optimization:
code generation:
execution:

13.CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Takeaways

learn from the transferable features, and not from the workload specific features such as table and attribute names.This approach enables learning of zero-shot CE models.

Problems

Traditional CE techniques used in modern database systems have well-known limitations, as they make simplistic data modeling assumptions, such as data uniformity and independence of columns in the tables.

Learned CE have not been adapted in practice, due to their high training overheads.

Techniques

zero-shot models are pre-trained on a broad spectrum of different datasets and use transferable features (dataset-agnostic), such as table sizes, allowing them to generalize to unseen datasets.

Query Type

SPJ

Model

GNN model to encode the query structure.

MLP-based cardinality model to get the final prediction.

Workloads.

Datasets

tpch, imdb

Dynamic Workload

Performance

14.LOGER: A Learned Optimizer towards Generating Efficient and Robust Query Execution Plans

Takeaways

This work aims to change the search space compared with bao, but still use hint, it input action(disable a join algorithm) and input join order into the network to predict the latency.

Problems

Assumptions

Techniques

reward weighting to reduce the impact of previous poor operators and make LOGER pay less attention to disastrous plans by log transformation.

Formulation

RL to build a tree

state: subplan’s join order and operator

action: join table and with restricted operator (disable a join algorithm, like disable hash join)

Query Type

spj without sub-queries

Query Encoding

Graph Transformer: GT not only efficiently captures relationships between table representations that include both table and predicate information, but also integrates global structural information of join graph into table representations.

table: table-level learned embedding vector.
column: column-level statistic vector with column type, proportion of unique values, unique value count, null values, index.
predicts: join exist or not, selectivity (via approximation of DBMS cardinality estimator.)

Model

The value model evaluates state-action pair candidates of subplans by predicting the reachable lowest latency among all plans that can be reached from the pair.

state network:

action network: take all join as input,

Workloads.

Datasets

Dynamic Workload

change schema.

Performance

2 hours of training, 2x speedup

15.Presto’s History-based Query Optimizer

Takeaways

share-everything => latency over scalability.

Problems

learning based CE/cost method requires large trainng efforts, less robustness, and hard to debug.

Techniques

HBO tracks query execution statistics at the operator node, and uses those to predict future performance for similar queries.

Similary query fetching: only record the template,
HBO is more powerful than CBO as it can store various runtime statistics related to scheduling as well

join reordering: HBO record the stats for some join order, and thus accurate. If not meet before, we use the CBO.

16.PARQO: Penalty-Aware Robust Plan Selection in Query Optimization

Takeaways

This paper mentioned that some CE is sensitive while other are not.

The whole paper based on the probability of f(s∣s′):

It starts as an error distribution estimated from historical data.
It drives sensitivity analysis by identifying critical dimensions.
It enables robust plan selection by sampling likely true selectivities and guiding the optimizer to focus on realistic scenarios.

Problems

Accurate estimation of the cardinality and cost comes at the cost of runtime monitoring and ongoing maintenance.

Therefore, how much influence of such uncertainty bring to the selectivity estimates still underexplored.

Robustness is tied to the uncertainty, and use the uncertainty in practise has challenges:

errors distributed on data and query worklaod is not well utiized in exsting work.
complex query has large cardinality estimate spaces, and thus has overhead to evaluate all other competing plans.
support learned optimzier at scale and in efficient way.

Assumptions

Techniques

getting the distribution of f(s

s`), where s` is estimated cardinalities, and s is the true selectivities.

based on the distribution, find the sensitive dimensions for a given query plan got from estimates s`.

identify the dimensions with biggest impact on the user-defined penalty function.
using sobol method

build a pool of candidate plan by sampling from the distribution of true selectivities conditioned on their estimates, then select one with lowest penalty.

Error profling:

build a model to estimate f(s s`) given Q and s`

break the query into querylet, which is a subquery pattern. like join with local selection.

for each querlet in (single-table querylet, two table querley, three-table querylet with fixed pattern), measure the true cardinalities and estimated cardinalities.

one model for low-selectivity, one model for hight selectivity.
Sensitive analysis

it uses sobol sensitiity to identify the sensitive selectivity dimensions for a given plan.
Plan selection use sampling to sample a pool of plans based on the probability f(s, s`), estimate the cost

Formulation

Query Type

Query Encoding

Model

KDE

Workloads.

Datasets

IMDB, DSB, STATS

Dynamic Workload

split database into multiple instance in time series manner.

Performance

Implementation details

ok, so from the beginning of the execution, i think this is the overall logic

it pre-defines some query templates, and then run lots of queries and collect (target dimension (which is only base table + number of join between any two pairs), estimate seleivity, and real seleivity.)
given a query, it will first fit a kde model for the error of each target dimension based on this datasets, and get a list of kde for each target dimension, then it sample 1000 points and get mean selectivity error value as the so called real error.
it then perform sensitive analysis. it first get the all predefined possible sensitive dims, named as pre_defined_querylet_dims,which may not exactly the same the target dimensions. (from the paper the thoery, if the dimensions in the target dimention has no selection, then we don;t care it, it’s CE in pg is accurate)
It get optimizaion plan fisrt (this is by feeding the optimzier with the real cardinalities and selectivility), then it get the hint to generate this plan in later. in the adjust_selectivity process, it first update selective of the the target_dimensions and update by provided error_samples, which can be inaccurate, since it is to measure the influence, is it? And then it updates all others dimensiton which is not in the target_dimensions but in the sampled_real_sel_err_mean_list, each key in the sampled_real_sel_err_mean_list is actually the predefined possible sensitives, we make all those who are not in the target as accurate as possible, so it will not casuing any influence in later. therefore, those dimens are update based on the sampled_real_sel_err_mean_list, while the target is based on the error_samples. finally all changes are updated to the updated_join_sel, which is a litst of original q_card_info.est_join_sel and note, this incldue all the join imforamtions, not one-to-one join, but one-to-multiple joins. here is a questions, the code after “what the fk is this” feels like has not effetc, is it?
it also get the default execution plan, which is the postgresql plan
then it generate sampels from joint error dist , name is joint error dist, but actually it sampled indenpendently. gfinally it get A and B sampels.
then it generates a problem, which is basiclaly for the sobol.
then i think i’m not familiart with the implementatin of the sobol, what is tht following do? in the theory, i think i just need to estimate ethe improtance of reach dimension by computing s = var()/Var(Y), is it? and we use MC sampling to do that, so we need A, B , and mc smapling also change the computing method of s. i only know htose

17.PLAQUE: Automated Predicate Learning at Query Time

Takeaways

Discovering the proper predicates is also important to improve the efficiency.

Problems

Predicates pushdown is useful in speeding up the query processing, but query optimzier only pushdown the predictes exist in the query. However, tables with no predicated also has a chance to prune.

Therefore, this paper try to discover the predicates and add that to the query in query execution.

18. FASTgres: Making Learned Query Optimizer Hinting Effective

Takeaways

This paper learn multiple GB models for each context, reduing the model complexity and training inefficiency. But not quite handle the change of data or query ? data is not changed in this setting.

Problems

bao cannot achieve the potetial to the full extent.

Assumptions

Techniques

Divide and conquer approach.

workload -> partitioned into query groups, each with different join tables and join predicates -> context.
learn a separate and context-sensitive ML model for each context. here since the partition is based on join pattern, join is not required to learn, we can only learn to map the filter predicates to the hint set.

Formulation

Divide and conquer approach.

Query Type

Query Encoding

predicate-based encoding

Model

GB models

Workloads.

Datasets

Stack [18], JOB [13], and TPC-H

Dynamic Workload

randonmly sample training query

Performance

19.OS Pre-trained Transformer: Predicting Query Latencies across Changing System Contexts

Takeaways

This paper propose to train a generatl embedding to ecndoe the informatin of the system envs, such as cpu, memory etc, and finally combine the result of this and instance-specifc embedding together to predict the final result of the task.

Problems

Model leared in one context fail when tested on a new system context. Thus, there is a generalization gap.

Therefore, existing work cannot utilize the interaction of different resources to accurately predict query performance.

Assumptions

system state does not change drastically just before and during query execution.

Techniques

Formulation

query plan, OS logs -> query latency.

Query Type

Query Encoding

GCN for query plans, the model should produce an embedding vector with all query specifc properties to predict the latency.

Model

Database specifialized query plan model: encoding the plan into a vecotr. like using GCN.
A universal (i.e., applicable to any workload) transformer model which takes as input recent OS logs, and produces an embedding vector that captures the system state.
Embedding vectors are used as input to a universal Multi Layer Perceptron (MLP) module to produce the latency prediction.
When in new datasets, it fix the Osprey model and prediction head, and only update the query embedding model, the GCN.

Workloads.

Datasets

150k PostgreSQL query execution plans and corresponding linux system logs for over 2500 unique queries from 14 database workloads executed on 10 AWS instance types, and under various system loads

Dynamic Workload

Performance

20. Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction

Takeaways

Problems

training data collection is high cost.

simply model often under/over-estimates the true costs of queries, since they cannot capture complex interactions in the query plan and data.

Assumptions

we only focus on the transfer of learned cost models across databases for a single database system on a fixed hardware.

Techniques

New query and data representation that allows zero-shot cost models to be pre-trained across databases.

Formulation

The goal of zero-shot cost estimation is to predict query latencies (i.e., runtimes) on an unseen database without having observed any query on this unseen database.

Query Type

Query Encoding

A new representation of queries that can generalize across databases.

Encoded information:

physical plan operators as nodes.
input columns, tables, predicate information.

Encoded in transferable way:

predicate structure: data types of columns/operators, intermediate cardinalities

Model

a database-agnostic model to estimate the runtime cost and a databasedependent model (e.g., histograms) to capture data characteristics.

database-agnostic cost model: feed with both estimated cardinalities and output of database-dependent model, it generates a cost.

Workloads

Datasets

Dynamic Workload

Performance

21. Rethinking Learned Cost Models: Why Start from Scratch?

Takeaways

Problems

end to end learning based cost model is hard to interpret and use, and retraining is costly.

learning-based models are trained on a specific database with specific hardware and software configurations. retraining is costly.
data collection is time-consuming and expensive process.
no insight of how the cost estimate was generated compared with the rules-based cost model.

learned cost model should be easy to train, hight transferability, and interpretabe.

Assumptions

we assume that the formula-based cost model can provide a good estimation, if all R-params in Table 1 are set correctly.

Techniques

Two stage offlien training stage and online refinement stage.

Static and Dynamic C parameters:

hardware, OS, database can be static configuration parameters.
data type and column correlation are dynamic configuration parameters.

Data collection:

Execute queries on different CPU, memory, SSDs, HDDs, operating systems, etc

Offline phase: capature relations between hyperparamers and static configuration parameters.

pre-train a decision tree for each of 8 operator, there are totally 8 trees.

Online phase: adaptively tune each tree for the dynamic query/data factors.

hundreds of dynamic configuration parameters, such as query related, data related and 97 configuration parameters.
online phase invoked when the cost model cannot provide a precise estimation for more than K queries.
few-show recommendatin model to rank all candidate parameters based on the effects on cost estimations.

Formulation

based on the system and workload configutaion, it learn a mapping from the those configutaion to the hyperparameter of the cost function, i.e., determine optimal choices of R-params in the built-in cost formulas of major database.

G: C -> R

The total operators considered are 8.

Query Type

Query Encoding

Model

decision tree model.

Workloads

Datasets

IMDB, TPC-H and TPC-DS

Dynamic Workload

Performance

###