PostgreSQL 14 Internals

Posted on April 17, 2024 22 minute read ∼ Filed in : A paper note

Notes

Buffer states retrieval UDF on page 171 could be used as states for RL.

pg_statio_all_tables on page 177 could show how many pages are read for a table into the buffer cache.

ANALYZE big; updates the statistics of the table for the query planner. It is an essential step for PostgreSQL to make informed decisions about query execution plans.

Page 184 has an SQL to check what fraction of each relation is cached, and whether this data is hot (a page is considered hot here if its usage count is bigger than one.

enable debug_print_parse parameter in PostgreSQL, we can view the full parse tree in the server log.

If you want to explore full plan trees, dump them into the server log by enabling the debug_print_plan parameter. But in practice, it is usually enough to view the text representation of the plan displayed by the EXPLAIN command.

Visibility of the table as on page 308 may be used.

SQL in 315 can update the default cardinalities.

The correlation field on page 323 introduces some methods to get column correlation related to the disk.

SQL on page 326 can show all statistics of the column for an expression.

Statistics dependence can be created (page 329), and is useful.

The query on page 349 will tell how to able or disable the query. force_parallel_mode

Introduction

Data organization

single Postgresql instance can serve several databases at a time. They are called database clusters.

Logical Storage

A catalog** in PostgreSQL is a set of system tables that store metadata about all database objects in a cluster, such as tables, indexes, functions, and schemas. This metadata includes details like object names, data types, and access permissions.

Catalog tables are all begin with pg_

Schema is a namespace with many objectives. predefined schemas:

public (for user objectives), pg_catalog (system catalog tables), information_schema (view for system catalog), pg_toast, og_temp

Relations: all tables, indexes, and views are called relations.

all data associated with a relation is stored in several forks, a fork is basically a single file of 1GB (can be configured), and the file is also named a segment. The sequence number of the segment is added to the end of its filename.

Physical Storage

PGDATA is a directory that contains all the files related to the database cluster, at the beginning, it contains

template0 database: is used for restoring data or creating a database with a different encoding.
template1 database: is a template for all other databases that a user can create (by copying template1) in the cluster.

Tablespaces is a physical data layout, it is a dir in the file system, database stores the data in several tablespaces. During the database initialization, two tablespaces are created:

pg_default: default tablespace unless another tablespace is selected for this purpose.
pg_global: system catalog objects common to the whole cluster.
postgres database uses tablespace xyzzy as the default one

Pages. To facilitate I/O, all files are logically split into pages (or blocks), which represent the minimum amount of data that can be read or written. Consequently, many internal PostgreSQL algorithms are tuned for page processing.

TOAST: Each row must fit a single page: there is no way to continue a row on the next page. To store long rows, PostgreSQL uses a special mechanism called TOAST (The Oversized Attributes Storage Technique). Long attribute values are separated into smaller chunks, which will be stored in the TOAST table. The chunk size is decided such that one page of the TOAST table can contain four rows.

Each TOAST table has three columns, chunk_id, chunk_seq, and length. columns

For index, the toast mechanism can offer only compression.

PostgreSQL supports a few strategies:

plain means that ToAST is not used (this strategy is applied to data types that are known to be“short,” such as the integer type).
extended allows both compressing attributes and storing them in a separate TOAST table.
external implies that long attributes are stored in the ToAST table in an uncom- pressed state.
main requires long attributes to be compressed first; they will be moved to the TOAST table only if compression does not help.

Due to TOAST, each table will have at least three files (or “forks”): the main data file, the TOAST data file, and the TOAST index file.

Process

A PostgreSQL server instance consists of several interacting processes.

postmaster: spawns all other processes and supervises them (restart the failed one if there is one).

startup restores the system after a failure. auto vacuum removes stale data from tables and indexes.

checkpointer executes checkpoints. writer flushes dirty pages to the disk. stats collector collects usage statistics for the instance.

wal writer writes WAL entries to disk. wal sender sends WAL entries to a replica. wal receiver gets WAL entries on a replica

Drawbacks:

static shared memory allocation does not allow resizing structures like buffer cache on the fly;
parallel algorithms are hard to implement and less efficient than they could be; (due to inefficient sharing state via IPC.)
sessions are tightly bound to processes.

To enable process interaction, the postmaster allocates shared memory, which is available to all the processes.

PostgreSQL uses a large portion of the shared memory for caching frequently accessed data.

One client connection will trigger spawning a new process, it is until the session continues or lost. This is inefficient and may cause problems when too many clients try to connect.

One solution is using the connection pool (PgBouncer or Odyssey), but this limits the maximum connection number.

Since the process uses MVCC in building various isolation levels, it will generate many old versions of data, which can be deleted by using a vacuum.

Memory

Lots of shared memory is used to store the buffered cache.

In the buffer search and eviction, it uses a clock sweep algorithm, which goes around the buffer cache and reduces the usage count for each cached page by one as it passes.

The first unpinned buffer with the zero counts found by the clock hand will be cleared.

Thus, the usage count is incremented each time the buffer is accessed (that is, pinned), and reduced when the buffer manager is searching for pages to evict. As a result, the least recently used pages are evicted first, while those that have been accessed more often will remain in the cache longer.

Each process has its cache, which stores the Temporary Table. e.g., CREATE TEMPORARY TABLE tmp AS SELECT 1;

Query Execution

parsed, transformed, planned, executed.

Parser: Text => Parse Tree

It uses a lexer (previously used in compilers, interpreters, and many types of text processing systems ) to split the query text into a set of lexemes (such as keywords, string literals, and numeric literals).

It then uses a parser to validate this set against the SQL language grammar, and then build a Parse Tree.

Then it performs semantic analysis to determine whether whether the database contains any tables or other objects that this query refers to by name, and whether the user has permission to access these objects by checking the catalog.

Transformation: Parse Tree => Rewriten Parse Tree

This rewrites the query:

replace the view in the parse tree with the subtree, security reason.

Transformation is based on the query rewrite rule system. https://www.postgresql.org/docs/14/rules.html

Planning/Optimizer: Rewriten Parse Tree => Plan Tree

Several join tables => grow the plans, the optimal and non-optimal plans can differ by orders of magnitude.

Optimizer uses a dynamic programming algorithm combined with some heuristics to reduce the search space.

A genetic algorithm** is used to optimize the query if geqo_threshold is set, which defines the number of elements at one level to optimize. optimizer/gecko/

Cost-based optimizer estimates the resources required for execution. (I/O operation CPU cycles)

Startup Cost: Prepare for the node execution.
- The cost associated with setting up the necessary data structures, loading initial data into memory, and any other preparatory tasks required before actual data processing starts. For example, if a query involves a sequential scan of a table, the startup cost would include the time and resources needed to locate the table on disk.
total cost: comprises all the expenses incurred by fetching the result.

If using the cursor, it reads data row by row, if not, it reads the whole result at once. Different methods differ in plan generation.

w cursor, it minimizes the total cost.
w/o a cursor, it minimizes the cost of cursor_tuple_fraction.

The cardinality estimation calculations rely on the collected statistics, such as table sizes and data distribution in table columns.

Cardinality Estimation

Steps:

Estimate the number of input rows of each node. => 1
Estimate the selectivity of the node == the fraction of input rows that will remain at the output. => 2
CE of a node = multiplying the =>1 by the => 2 applied at that node.

Methods to compute CE:

CE of filter condition: each column is considered independently. Thus the below functions may have errors. sel_{x and y} = sel_x*sel_y,

sel_{x or y} = 1- (1-sel_x)(1-sel_y)
CE of join: Cartesian product. If the first dataset has N rows and the second dataset has M rows, then their Cartesian product results in N x M rows. (maximum possible number of rows that could result from a join)

Cost Model:

disk I/o and CPU resources

cost =sF(CE)

some operation has no prerequisites, so their execution starts immediately. while other has. e.g., sort needs to wait to collect all other data. The startup cost of such nodes is usually higher than zero.

Execution

executor opens a portal in backend memory, which keeps the state of the query currently being executed.

The execution process starts from the root,

nested loop join does not need to wait until all rows are received to start producing output. It processes and passes on rows as soon as they meet the join condition.

As for the join operation, it uses a work_mem to decide the maximum amount of memory that each database operation, such as sorting or hash joins, can use before spilling to disk.

PostgresoL has no global cache for queries

Statistics in database

During analysis, For analysis purposes, 300× 100 default_statistics_target random rows are sampled. The sample size required to build statistics of a particular accuracy has a low dependency on the volume of analyzed data, so the size of the table is not taken into account.

It uses default_statistics_target to decide how many to use to decide the statistics for each column.

Basic Statistics

Number of tuples in a relation (rel-tuples)
Relation size, in pages (replaces)
Number of pages tagged in the visibility map (relallvisible)

Null values

Distinct values

There is a n_distinct variable for each column

Most common values

each column has two features

most_common_vals:
most_common_freqs:

Histogram

If distinct values are too many to be stored in an array, PostgresoL employs a histogram.

The histogram is divided into multiple small ranges, each with the same number of values.

The histogram is used for operations like estimating the selectivity of greater than and less than conditions.

Correlation

correlation between the physical order of data and the logical order defined by comparison operations.

This is used for the cost estimation of index scans.

Expression Statistics

For customer functions or transformations to a column in a SQL query, PostgreSQL’s query planner may struggle to accurately estimate the distribution of the data after the transformation. It uses default ones, 0.5%.

The 0.5% figure is a cautious estimate, assuming that any given transformed value (like extracting a month, converting strings, or applying mathematical operations) might only match a small fraction of total rows.

Therefore, we need to collect expression statistics ourselves.

Create extended-expression statistics by using CREATE STATISTICS.
Statistics index for expression indexes. Which can update automatically.

multivariate statistics

span several table columns.

There is a well-known problem of correlated predicates, planner assumes predicates do not depend on each other, so the selectivity is estimated as the product of the selectivities of filter conditions combined by logical and. As a result, the planner will underestimate the row number.

plain index scan wins for fetching a small number of tuples,
bitmaps can win for a somewhat larger number of tuples,
season wins if you’re fetching a large percentage of the whole table.

Dependence can be created between columns.

Summary

The statistics gathered by PostgreSQL’s query optimizer impact various parts of a query, particularly in the following areas:

Where Conditions:
- Basic Statistics such as rel-tuples and relpages help in estimating the overall size of the data and the amount of data to process, influencing whether a sequential scan or an index scan is more efficient.
- Distinct Values (n_distinct) and Most Common Values (most_common_vals and most_common_freqs) are crucial for estimating the selectivity of filters in the WHERE clause. These statistics determine how many rows are likely to match a given condition, especially when filtering on specific column values.
- Histograms are used to estimate the selectivity for range conditions (e.g., column > value or column < value). They help the optimizer understand the distribution of data within a column when exact matches are not feasible to calculate due to a high number of distinct values.
Join Conditions:
- Multivariate Statistics play a significant role in join conditions by helping to estimate the selectivity and distribution of values that result from joining tables. These are particularly useful when columns from different tables are correlated, which might not be apparent through individual column statistics.
- Expression Statistics and Correlation data influence the optimizer’s decision on which type of join (nested loop, hash join, or merge join) is most efficient based on the expected size of join outputs and the physical order of data.
Index Scans:
- The correlation between physical and logical order of data greatly affects whether an index scan is beneficial. High correlation means that the data is physically stored in a way that aligns well with the index, making index scans faster and more predictable.

In essence, these statistics help PostgreSQL’s optimizer make informed decisions about every aspect of query execution—from choosing scan types to optimizing joins and applying filters—thereby enhancing performance by selecting the most efficient execution paths based on data distribution and query structure.

Table Access Methods

Postgresql allows you to plug in various engines (pluggable storage engines).

Engine defines

tuple format and data structure
table scan implementation and cost estimation
implementation of insert, delete, update, and lock operations
visibility rules.
vacuum and analysis procedures.

A default storage engine is a heap, optional ones are zheac and zedstore (columnar storage).

Sequential scans.

The scan process will use an extra buffer ring, which not affect the main buffer cache.

multiple processes can share the same buffer ring, thus reading the same data.

cost estimate

disk I/o and CPU resources.

there are many default values for each operation. And they are defined based on the hardware. They can be updated via the tablespace level.

#define DEFAULT_SEQ_PAGE_COST  1.0
#define DEFAULT_RANDOM_PAGE_COST  4.0
#define DEFAULT_CPU_TUPLE_COST 0.01
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
#define DEFAULT_CPU_OPERATOR_COST  0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST  1000.0

#define DEFAULT_EFFECTIVE_CACHE_SIZE  524288   /* measured in pages */

i/o cost = cost of single page * number of page to read

The planner will not consider parallel execution at all if the estimated volume of heap data to be read does not exceed the 8MB min_parallel_table_scan_size value.

Not all queries can be parallelized.

Query which modified or locked data UPDATE, DELETE, SELECT FOR UPDATE, and the like.
Queries that can be paused. It applies to queries run within cursors, including FOR loops in PL/pgSQL
Queries that call PARALLEL UNSAFE functions. By default, these are all user-defined functions and a few standard ones.
Queries within functions if these functions are called from a parallelized query (to avoid recursive growth of the number of workers).

Index Access Methods

PostgreSQL supports six built-in index access methods: btree, hash, gist, gin sexiest, brin

Tuples in Postgresql are referred to by six-byte tuple IDs. TIDS

Data amount vs scan methods:

Select a large amount of data => sequential scan
Select a small amount of data => index scan
Select the mid-level amount of data => bitmap scan

Index scans

index-only scan

the cost = estimated costs of index access operations and heap page reads.

I/o estimation depends on both the index scan selectivity and the correlation between the physical order of tuples on disk and the order in which the access method returns their IDs.

Bitmap scan

A Bitmap Index Scan is used to efficiently determine “where to look” in the table, and a Bitmap Heap Scan is used to actually “look” at the data and retrieve it.

Bitmap index scan is less dependent on the correlation.

Join methods

Postgresql provides several join methods:

a nested loop join
a hash join
a merge join

Nested Loop Joins

supports (inner join + left outer join), not support right join and full join.

Traverse the outer set, for each record, it checks the inner set to see if there are matches, and it checks the inner set many times.

Therefore, the efficiency of nested loop joins depends on several factors:

The cardinality of the outer set of rows.
Availability of an access method that can efficiently fetch the needed rows of the inner set.
Recurrent access to the same rows of the inner set.

A SQL could execute it in this way:

Scan the inner set and save data into a materialized file, then reuse it many times.

CE: Cartesian product is estimated at the product of cardinalities of the joined data sets

Cost:

start cost: sum of the child node’s start cost
cost of fetching all rows in the outer set,
the cost of a single retrieval of all the rows of the inner set
the cost of processing each row to be returned

if the inner set is repeatedly scanned with the same parameter values, it can cache those rows, (page 407). However, since the data is too large, it can edit the caching results.

Hash Joins

Supports any type of join.

Suitable if the table can be accommodated in RAM. If not, two passes will be applied.

The smaller set is usually used as the inner one, as it results in a smaller hash table.

In the first stage, it builds a hash table by pulling the whole inner set of rows from its child node.

columns references in the join condition are hash key, value is all queried fields of the inner set.

In the second stage, it traverses the other table, computes the hash key, and searches.

Costs:

build hash table: total cost of fetching inner set, calculate the hash function, insert rows into a hashtable.
fetching the outer set of rows.

If the data is too large to fit into the memory, the inner set is divided into many files, one is in memory while others are inside the disk. For the outer set, it first calculates the hash for each row key, and can directly decide which batch that row belongs to, if belongs to the current batch (the batch in memory), it performs join, otherwise, it waits for the second pass.

if some temp file is too large, it cannot fit well into the memory, So if the hash table being built turns out too big, the number of batches is increased (doubled) on the fly. Each batch is virtually split into two new ones: about half of the rows (assuming that the distribution is uniform) are left in the hash table, while the other half is saved into a new temporary file

In the case of non-uniform distribution, increasing the number of batches may not help. For example, if the key column contains the same value in all its rows, it will be placed into the same batch since the hash function will return the same value over and over again. Unfortunately, the hash table will continue growing in this case, regardless of the imposed restrictions.

In theory, this issue could be addressed by a multi-pass join, which would perform partial scans of the batch, but it is not supported.

This problem is not solved yet?

Merge Join

The merge join algorithm can be used with any types of joins

sort data by join key and return results which are also sorted similarly.

If there is order by, merge sort is used mostly.

Merge join requires the input tables to be sorted on the join keys. If the tables are not already sorted, PostgreSQL will sort them before performing the merge join. This sorting operation is represented in the plan by the Sort node. It only uses one pass over both data sets and does not take any additional memory. It uses two pointers to current rows.

PostgreSQL can decide to use to various sorting methods: quicksort, external merge, top_N heapsort

If the data set to be sorted fits the 4MB work_mem chunk, the classic quicksort method is applied.
If a data set needs to be sorted only partially (as defined by the LIMIT clause), the heapsort method can be applied.
If the scan shows that the data set is too big to be sorted in memory, the sorting node switches over to external merge sorting: all rows are written into several pre-sorted files, and rows inside each file are sorted, but across various files are not sorted. Then it uses a merge sort algorithm to merge those files.

Comparison

The nested loop join

does not have any prerequisites and can start returning the first rows of the result set right away
It is the only join method that does not have to fully scan the inner set (as long as index access is available for it).
These properties make the nested loop algorithm (combined with indexes) an ideal choice for short OLTP queries,which deal with rather small sets of rows.
supports all join conditions

However

Thus, the complexity of the nested loop algorithm often shows linear growth rather than quadratic one, even if with a high linear coefficient.

A hash join

only one pass over two data sets. works best on large data sets.
Combined with sequential table scans,this algorithm is typically used for OLAP queries,which compute the result based on a large volume of data.
However, if the response time is more important than throughput, a hash join is not the best choice: it will not start returning the resulting rows until the whole hash table is built
response time in this context, we are referring to the time between sending a query and receiving the first row of the result set

Merge Join

A merge join can perfectly handle both short OLTP queries and long OLAP ones.
An added bonus of a merge join is the equivalence of the inner and outer sets. The efficiency of both nested loop and hash joins is highly dependent on whether the planner can assign inner and outer sets correctly.

Notes

Introduction

Data organization

Process

Memory

Query Execution

Parser: Text => Parse Tree

Transformation: Parse Tree => Rewriten Parse Tree

Planning/Optimizer: Rewriten Parse Tree => Plan Tree

Cardinality Estimation

Cost Model:

Execution

Statistics in database

Basic Statistics

Null values

Distinct values

Most common values

Histogram

Correlation

Expression Statistics

multivariate statistics

Summary

Table Access Methods

Sequential scans.

cost estimate

Index Access Methods

Index scans

index-only scan

Bitmap scan

Join methods

Nested Loop Joins

Hash Joins

Merge Join

Comparison

The nested loop join

A hash join

Merge Join

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.