Lqo
25 minute readSome Ideas
Non-parametric model for model selection for query opt?
Dirichlet Process Mixture: It solves the problem of determining the number of mixture components by allowing the data to drive the complexity of the model.
Settings
config | size |
---|---|
shared_buffers >= N * work_mem | 16GB |
work_mem (sort/hash operation) | 4GB |
effective_cache_size (affects whether to use index scans or sequential scans.) | 16GB |
temp_buffers | 4GB |
geqo | off |
AUTOVACUUM | off |
enable_bitmapscan | on |
enable_tidscan | on |
max_parallel_workers | 4 |
max_parallel_workers_per_gather | 4 |
max_worker_processes | 4 |
Templates
TakeawaysProblems
Assumptions
Techniques
Query Type
Query Encoding
Model
Workloads
Datasets
Dynamic Workload
1. DSB datasets
Existing benchmarks lack of
- more distinct query instances
- dynamic workloads
- guidance of comprehensively comparing the performance trade-offs for workload-driven and traditional database systems.
Shift
Skew
- Both exponential / Zipfian distributions are for those where the common value is more frequent
- The exponential distribution is often used for modeling time-related data or “waiting times,” while Zipfian is used to model scenarios where there is a ranking or ordering of items, such as popularity.Zipfian
Workloads
-
skews and correlations, skews on individual columns, correlations between columns in the same table, and correlations of columns across multiple tables.
=> more skews and correlations to individual columns and multiple columns within a table and across tables.
-
more join patterns: including non-equi joins and cyclic join graphs.
=> new query templates to enrich the join patterns.
-
slice data at a fine granularity with multiple predicate filters on table columns.
=> fine-grained data slicing with additional predicate filters.
-
=> configurable parameterization to generate queries.
-
=> comprehensive user guidance.
Experiments
2. Modeling Shifting Workloads for Learned Database Systems
Takeaways
Problems
The key question is when and how best to re-train the model.
-
Retraining within near queries => catastrophic forgetting, large queries => time-consuming.
-
Workload imbalance leads to poor performance.
Assumptions
Assumes the underlying data to be static.
Techniques
Query Type, Query Encoding
As in other papers
Model
distributed shift detector (per l query) => train a new model/retrain the existing model with the importance of past/current queries.
- KS test for measuring the distribution of each feature of two samples.
- Bonferroni correction to adjust the significance level.
online clustering algorithm (Bayesian non-parametric models) => managing the replay buffer => providing data to re-train.
- cluster based on the query feature encoding, other than learned embedding or gradients.
- propose DP-Medoides based K-Medoids, since it is more robust to outliers/supports customerize distnace computes.
- manage the reply buffer such that its balance, diversified.
re-training: combined loss of pass and current queries.
Workloads
Datasets
IMDB with JOB
DSB (50GB)
the query join graph and/or the distribution of query predicates may shift with time.
Dynamic Workload
- Generate a query pool with each query classes having 200 queries.
- shifting workloads:
- randomly pick three classes (fixed)
- for each class, randomly pick 100 queries
- concatenate those queries.
- repeate from 2-4 to form a workload with queries A(100), B(100), C(100), A(100)…
- measure performance after retraining and evaluate in other classes (except the current one).
3. Balsa
Takeaways
Neo-impl is still not robust enough to generalize to unseen test queries and suffers from high variance.
Balsa’s better generalization is due to a broader state coverage offered by simulation, on-policy learning, and safe exploration
the main point of this paper is to learn without expert demonstrations.
Problems
Existing RL-based methods assume the availability of a mature query optimizer to learn from.
Assumptions
the database content is kept static.
Updates to the schema, appends, or in-place updates can be handled by retraining.
This assumption implies that the agent need not solve a learning problem with a shifting distribution. Another assumption is that Balsa currently optimizes select-project-join (SPJ) blocks.
Techniques
Formulation
query optimization as an MDP:
- state: partial query plan.
- action step in building a query plan in a bottom-up fashion, either join two tables, or assign a scan method.
- a reward is given only at the final (terminal) state, each action only gets zero rewards
Query Type
Query Encoding
same as NEO
Process
learn to optimize queries without learning from an existing expert optimizer
Simulation and reality phases:
- simulation phase trains a value network based on the simple CE cost, although this CE cost is not accurate, it enables the agent to avoid the worst plans.
- reality phase finetunes the value network.
- during exploration, Balsa generates several best-predicted plans (instead of the best) and then picks the best unseen one out of them.
Generalization:
- train with high diversity, merge experiences with several independently agent with different seeds.
Some techniques
- timeouts,
- exploration only on the best few plans.
-
on-policy learning better than retraining.
Workloads
TPCH
4. Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective
Takeaways
A number of join is an irrelevant proxy for execution time.
GEQO is used for large join number queries, which can be disabled to fairly compare.
Problems
The absence of a unified reproducible framework make it currently impossible to fairly compare the results of LQOs.
Many filter combinations result in the same selectivity. Hence, the model would potentially suffer from the mismatch between features and target variables and will only perform well if this inconsistency is mitigated.
Assumptions
Techniques
Query Type & Query Encoding
All existing LQO are query-driven methods other than data-driven.
Encoding schema for a query should be both expressive and robust.
- roubust: unique 1-to-1 mapping between the feature variables and the latency or cost of a query and its given plan
- expressive: the encoding should clearly reflect both the global and local context.
Bao does not use talbes but only cardinalities and costs.
=> easier to retrain and schema-agnosticism.
=> not one-to-one mapping: different filters in a query has same cardinalities for the same table. ??
Bao deos not use query encoding => increases the probability of converging to a local optimum.
Model
An architecture that chains the ML models like in HybridQO, where different target variables are served to different models in the ML pipeline
Workloads
Dataset
JOB and STACK
We do not use the STATS-CEB benchmark, as it was originally developed for challenges in cardinality estimation as opposed to end-to-end query optimization, which is the focus of this paper.
Dynamic Workload
four worklods
- Leave One Out Sampling.
- Random Sampling.
- Base Query Sampling.
- Covariate Shift on JOB
5. NEO
Takeaways
Problems
However, none demonstrate that their improved cardinality estimations actually lead to better query plans.
Furthermore, unlike join order selection, selecting join operators (e.g., hash join, merge join) and choosing indexes cannot be entirely reduced to cardinality estimation.
Assumptions
existence of a Sample Workload consisting of queries representative of the user’s total workload
existence of the underlying engine’s capabilities (i.e., exercising a representative set of operators).
Techniques
end-to-end learning approach, including join order, index, and physical operator selection.
Formulation
query optimization as an MDP:
- state: partial query plan.
- action step in building a query plan in a bottom-up fashion
- a reward is given only at the final (terminal) state, each action only gets zero rewards
Query Type
Query Encoding
Query level information (join graph) + plan level information (join order).
- query-level:
- treats the join graph as an adjacency matrix and only encodes the upper triangular portion into one dim vec.
- column predicates vector: 1-hot, hist, row-vector. row vector is to represent each row as a sentence.
- plan-level
- plan as a tree, each node is a vector of size J+2R, J is a number of join types, and R is a number of relations. Each relation has two values: table use or not? scan type
Model
Value Model (tree-convolution) to predict the final execution time for a partial or complete plan.
- value network does not predict the cost of a subplan, but rather the best possible latency achievable from an execution plan that includes a given subplan.
best-first search + value network => valuate iteration techniques to generate the final plan.
- init-state: every scan is unspecified and there are no joins.
-
action: either (1) fusing together two roots with a join operator or (2) turning an unspecified scan into a table or index scan
- not greedily follow the suggestions of the value network.
- with bootstrapping (warm-ups), it can avoid sample inefficiency problem
retraining from scratch other than fine-tuning.
Workloads.
Datasets
JOB, TPCH, Corp
Dynamic Workload
Train and test on various templates.
6.Lero
Takeaways
The hint set-based strategy has some intrinsic limitations
Problems
existing methods are
- unstable performance.
- high learning costs.
- slow model updating.
It is unncessary to estimate cost with model.
Assumptions
We assume constant resource budget per plan execution in this paper
Techniques
Plan explorer + model two-class classification.
Formulation
pair-wise classification problem
Query Type
Query Encoding
each node in the tree: join method, scane method, cardinality, row wideth, table used. (No estimated costs)
Model
Pre-train on the synthetic workloads based on the estimated cost in the postgresql.
- randomly generate a number of plans with different join orders and predicates on different tables,
- randomly set the cardinality for each sub-plan of each plan and feed them into the native cost model (without executing the plans) to derive the estimated plan costs as labels.
Online training
- get the real execution latency and re-train the model.
- retrian every 100 queries on IMDB and STATS, every 30 queries on TPC-H and every 50 queries on TPC-DS.
Workloads.
Datasets
tpch, tpc-ds, stats, imdb.
Dynamic Workload
“time series split” strategy, always evaluated on the new template.
7.LEON
Takeaways
Theoretically, as long as the cardinality estimations and the cost model are accurate, this architecture obtains the optimal query plan
The db query optimizer can do transformation for logical expression, such as
- unnest an in/exists subquery to a semi-join query.
- subquery pushing involves evaluating a subquery earlier in the execution process and “pushing” its results into the main query. This changes the order of operations to reduce computational cost or improve query performance.
Those rules cannot be used in ML-based method, and thus ML-basd method cannot conduct the plan enumeration completely.
DB can be extendend with new operators, while the ML-based model can only learn in trail-and-error manner.
cost model to initialize the model => cold start problem.
Problems
ML-based methods learn domain-specifc knowledge in a data-driven manner. Thus, it suffers from the following two fundamental limitations:
- only for spj query and cannot handle complicated query.
- cold-start problem
Therefore, ML should aid the traiditonal query optimizer otherthan replacing it. And ML model should start from the traditional query optimizer.
Assumptions
Techniques
start from current performance and self-adjust to a certain dataset or workload.
-
cost estimation as a contextual parise-ranking problem.
-
robust plan exploration strategy
- a plan with a higher ranking position should be explored with a higher probability
- learned optimizer should correct its errors that lead to sub-optimal query plans
Formulation
formalizes query optimization as a contextual pair-wise plan ranking problem
Query Type
Query Encoding
logical properties: output cardinality and join graph of the sub query.
physical properties: plan tree and the physical property of the whole plan. Each node is one-hot encoding representing the physical operators and sort order.
Model
Intra-equivalent set Model: estimate the cost for each plan and uncertainty.
Inter-equivalent set Model: it is used to prune the redundant search space.
Workloads.
Datasets
Dynamic Workload
Performance
training efficiency and latency performance.
8. BASE
Takeaways
transfer the model pre-trained on the cost and then to the latency sounds promising.
learn a mapping from cost to latency sounds promising.
Problems
The DBMS cost model can estimate the cost for an execution plan, which is expected to reflect the latency of the plan, but they are not positively correlated in practice
Assumptions
Techniques
Two stage RL-based framework.
Stage 1: fit RL to chose the operation with leatest cost.
Stage 2: transfer a pre-trained policy based on the cost to a better policy based on latency
leverage cost as a trade-off to increase training efficiency and then transfer the pre-trained model based on the cost to a new model that can adapt to latency signals
Formulation
BASE formulates the procedure of locating and filling the gap between cost and latency as a variant of IRL.
Query Type
SPJ
Query Encoding
Same as NEO
Model
In the 2nd stage, we use a calibrated reward function by applying a parameterized calibration function to the cost signals.
calibration function only makes the calibrated cost signals and the latency signals more correlated
Workloads.
Datasets
STATS, IMDB.
Dynamic Workload
None
Performance
Fast training
9.Reinforcement Learning with Tree-LSTM for Join Order Selection
Takeaways
This handle the schema change by using dynamic pooling to map anylengthh input to a fixed sized vector.
Problems
Existing work
- fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans.
- Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.
Assumptions
Techniques
Two stage RL-based framework.
Stage 1: pre-train the Deep Reinforcement Learning with cost,
Stage 2: finetune the model with the latency.
Formulation
BASE formulates the procedure of locating and filling the gap between 푟휙and 푟(푙 )as a variant of IRL.
Query Type
SPJ
Query Encoding
query + data
- query: contains join information in a query. i.e., matrix to represent all join relation, then flatten into one vector. Then it applies FC to get the final representation
- columns:
- numertical value: vector of 4 value, if column is in join, is =? is>? is<?
- string value: only use the selectivity information and get the column representions.
- table: concatenate all columns representions.
- join tree: use tree-LSTM to learn a representation.
- combine both Child-Sum Tree-LSTM and N-ar Tree LSTM.
Model
Representation: use a Tree-LSTM
RL: use DQN.
- cost training and latency tunning.
- not move target from cost to latecy directly, but instead treate them as multi task learning, where train two Q network but use the same input representation.
Handle the dynamic worklaod:
- use dynamic pooling to encoding any dynamic length query
- Adding columns also use dynamic pooling
Workloads.
Datasets
tpch, imdb
Dynamic Workload
insert/delete columns, add new tables etc.
Performance
10.Theoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability
Takeaways
Problems
no theoretical work shows any advantage in using the learned methods in dynamic datasets and under distribution shift.
The goal of this paper is to theoretically understand the capabilities of learned models for database operations, show why and when they outperform non-learned alternatives and provide theoretical guarantees on their performance
Assumptions
Techniques
indexing, cardinality estimation and sorting when presence of insertions from a possibly changing data distribution
Formulation
data after inserting n times is D_n
its distribution is learnable with parameter T, S, B.
- T: the model can be evaluted in T operatios.
- take S to store.
- n*B to learn the model.
11.Eraser: Eliminating Performance Regression on Learned Query Optimizer
Takeaways
This is mainly focus on the dataset side other than model side, for any trained model, it filter the search spcae based on model performance to increate the effectiveness of the model.
Problems
learned model has regression due to reasons like
- inherent difficulty of the learning problem,
- the low generalization ability of the prediction model on new data,
- the under-fitting on training data due to insufficient training data,
- loss of features, noisy labels and inappropriate model training methods.
Assumptions
Learned model has low generalizatin ability, thus has big regression problem.
Techniques
Two stage RL-based framework.
Stage 1: coarse-grained filter to remove highly risky plans
- Insight: ML prediction model is likely to underperform due to encountering unseen feature values, therefore unexpected plan space into subspaces based on unseen feature values. this is like data labelling process.
- train a model to predict precise or imprecise of each subspace.
Stage 2: cluster plans in fine-grained manner, evalutes each cluster based on the prediction result.
- Cluster the remaining precise plans into groups based on their feature values and the model’s reliability.
- Assign each cluster a reliability interval
Formulation
A performance elimination method perf_elim such that benefits are maximum and loss is minimized.
Query Type
SPJ
Query Encoding
plan-level features: join relations, filtering predicates and operator types
- join: join type, scan type,
- filter: low and high value
- relation: 0/1 to indicate existance.
- structure: use a categorical variable to indicates it, bushy tree, left-deep and right-deep, on plans joining 4 tables
data-level features: estimated cardinality and data distribution
Model
unexpected plan explorer filter bad plans based on the relability threshold.
cluster the remining plans and use the segment model, and also assign relability value to reflect the quality.
Then filter again, and use plan selection method to balance the benefit of improvements and the risk of regression.
Workloads.
Datasets
tpch, imdb
Dynamic Workload
trian on 25% etc and evaluate on the fixed test dataset.
Performance
12.Access Path Selection in a Relational Database Management System
Takeaways
This handle the schema change by using dynamic pooling to map anylengthh input to a fixed sized vector.
Problems
This paper will address the issues of access path selection for queries
- single table queries
- join
- nested queries
Techniques
query processing are 4 phases:
- parsing:
- optimization:
- code generation:
- execution:
13.CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases
Takeaways
learn from the transferable features, and not from the workload specific features such as table and attribute names.This approach enables learning of zero-shot CE models.
Problems
Traditional CE techniques used in modern database systems have well-known limitations, as they make simplistic data modeling assumptions, such as data uniformity and independence of columns in the tables.
Learned CE have not been adapted in practice, due to their high training overheads.
Techniques
zero-shot models are pre-trained on a broad spectrum of different datasets and use transferable features (dataset-agnostic), such as table sizes, allowing them to generalize to unseen datasets.
Query Type
SPJ
Model
GNN model to encode the query structure.
MLP-based cardinality model to get the final prediction.
Workloads.
Datasets
tpch, imdb
Dynamic Workload
Performance
14.LOGER: A Learned Optimizer towards Generating Efficient and Robust Query Execution Plans
Takeaways
This work aims to change the search space compared with bao, but still use hint, it input action(disable a join algorithm) and input join order into the network to predict the latency.
Problems
Assumptions
Techniques
reward weighting to reduce the impact of previous poor operators and make LOGER pay less attention to disastrous plans by log transformation.
Formulation
RL to build a tree
state: subplan’s join order and operator
action: join table and with restricted operator (disable a join algorithm, like disable hash join)
Query Type
spj without sub-queries
Query Encoding
Graph Transformer: GT not only efficiently captures relationships between table representations that include both table and predicate information, but also integrates global structural information of join graph into table representations.
- table: table-level learned embedding vector.
- column: column-level statistic vector with column type, proportion of unique values, unique value count, null values, index.
- predicts: join exist or not, selectivity (via approximation of DBMS cardinality estimator.)
Model
The value model evaluates state-action pair candidates of subplans by predicting the reachable lowest latency among all plans that can be reached from the pair.
state network:
action network: take all join as input,
Workloads.
Datasets
Dynamic Workload
change schema.
Performance
2 hours of training, 2x speedup
15.Presto’s History-based Query Optimizer
Takeaways
share-everything => latency over scalability.
Problems
learning based CE/cost method requires large trainng efforts, less robustness, and hard to debug.
Techniques
HBO tracks query execution statistics at the operator node, and uses those to predict future performance for similar queries.
- Similary query fetching: only record the template,
- HBO is more powerful than CBO as it can store various runtime statistics related to scheduling as well
- join reordering: HBO record the stats for some join order, and thus accurate. If not meet before, we use the CBO.
16.PARQO: Penalty-Aware Robust Plan Selection in Query Optimization
Takeaways
This paper mentioned that some CE is sensitive while other are not.
The whole paper based on the probability of f(s∣s′):
- It starts as an error distribution estimated from historical data.
- It drives sensitivity analysis by identifying critical dimensions.
- It enables robust plan selection by sampling likely true selectivities and guiding the optimizer to focus on realistic scenarios.
Problems
Accurate estimation of the cardinality and cost comes at the cost of runtime monitoring and ongoing maintenance.
Therefore, how much influence of such uncertainty bring to the selectivity estimates still underexplored.
Robustness is tied to the uncertainty, and use the uncertainty in practise has challenges:
- errors distributed on data and query worklaod is not well utiized in exsting work.
- complex query has large cardinality estimate spaces, and thus has overhead to evaluate all other competing plans.
- support learned optimzier at scale and in efficient way.
Assumptions
Techniques
getting the distribution of f(s | s`), where s` is estimated cardinalities, and s is the true selectivities. |
based on the distribution, find the sensitive dimensions for a given query plan got from estimates s`.
- identify the dimensions with biggest impact on the user-defined penalty function.
- using sobol method
build a pool of candidate plan by sampling from the distribution of true selectivities conditioned on their estimates, then select one with lowest penalty.
-
Error profling:
build a model to estimate f(s s`) given Q and s` break the query into querylet, which is a subquery pattern. like join with local selection.
for each querlet in (single-table querylet, two table querley, three-table querylet with fixed pattern), measure the true cardinalities and estimated cardinalities.
one model for low-selectivity, one model for hight selectivity.
-
Sensitive analysis
it uses sobol sensitiity to identify the sensitive selectivity dimensions for a given plan.
-
Plan selection use sampling to sample a pool of plans based on the probability f(s, s`), estimate the cost
Formulation
No
Query Type
No
Query Encoding
No
Model
KDE
Workloads.
Datasets
IMDB, DSB, STATS
Dynamic Workload
split database into multiple instance in time series manner.
Performance
17.PLAQUE: Automated Predicate Learning at Query Time
Takeaways
Discovering the proper predicates is also important to improve the efficiency.
Problems
Predicates pushdown is useful in speeding up the query processing, but query optimzier only pushdown the predictes exist in the query. However, tables with no predicated also has a chance to prune.
Therefore, this paper try to discover the predicates and add that to the query in query execution.
18. FASTgres: Making Learned Query Optimizer Hinting Effective
Takeaways
This paper learn multiple GB models for each context, reduing the model complexity and training inefficiency. But not quite handle the change of data or query ? data is not changed in this setting.
Problems
bao cannot achieve the potetial to the full extent.
Assumptions
Techniques
Divide and conquer approach.
- workload -> partitioned into query groups, each with different join tables and join predicates -> context.
- learn a separate and context-sensitive ML model for each context. here since the partition is based on join pattern, join is not required to learn, we can only learn to map the filter predicates to the hint set.
Formulation
Divide and conquer approach.
Query Type
Query Encoding
predicate-based encoding
Model
GB models
Workloads.
Datasets
Stack [18], JOB [13], and TPC-H
Dynamic Workload
randonmly sample training query
Performance
19.OS Pre-trained Transformer: Predicting Query Latencies across Changing System Contexts
Takeaways
This paper propose to train a generatl embedding to ecndoe the informatin of the system envs, such as cpu, memory etc, and finally combine the result of this and instance-specifc embedding together to predict the final result of the task.
Problems
Model leared in one context fail when tested on a new system context. Thus, there is a generalization gap.
Therefore, existing work cannot utilize the interaction of different resources to accurately predict query performance.
Assumptions
system state does not change drastically just before and during query execution.
Techniques
Formulation
query plan, OS logs -> query latency.
Query Type
Query Encoding
GCN for query plans, the model should produce an embedding vector with all query specifc properties to predict the latency.
Model
-
Database specifialized query plan model: encoding the plan into a vecotr. like using GCN.
-
A universal (i.e., applicable to any workload) transformer model which takes as input recent OS logs, and produces an embedding vector that captures the system state.
- Embedding vectors are used as input to a universal Multi Layer Perceptron (MLP) module to produce the latency prediction.
- When in new datasets, it fix the Osprey model and prediction head, and only update the query embedding model, the GCN.
Workloads.
Datasets
150k PostgreSQL query execution plans and corresponding linux system logs for over 2500 unique queries from 14 database workloads executed on 10 AWS instance types, and under various system loads
Dynamic Workload
Performance
20. Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction
Takeaways
Problems
training data collection is high cost.
simply model often under/over-estimates the true costs of queries, since they cannot capture complex interactions in the query plan and data.
Assumptions
we only focus on the transfer of learned cost models across databases for a single database system on a fixed hardware.
Techniques
New query and data representation that allows zero-shot cost models to be pre-trained across databases.
Formulation
The goal of zero-shot cost estimation is to predict query latencies (i.e., runtimes) on an unseen database without having observed any query on this unseen database.
Query Type
Query Encoding
A new representation of queries that can generalize across databases.
Encoded information:
- physical plan operators as nodes.
- input columns, tables, predicate information.
Encoded in transferable way:
- predicate structure: data types of columns/operators, intermediate cardinalities
Model
a database-agnostic model to estimate the runtime cost and a databasedependent model (e.g., histograms) to capture data characteristics.
database-agnostic cost model: feed with both estimated cardinalities and output of database-dependent model, it generates a cost.
Workloads
Datasets
Dynamic Workload
Performance