Introduction
This is the documentatation for Bao for PostgreSQL.
- GitHub: https://learned.systems/bao
Bao is designed for PostgreSQL 12. We've tested that the Bao code compiles for version 12.17, and suggest starting with that version.1 You'll also need Python 3.8 or later.
Bao is a learned query optimizer for PostgreSQL. Bao works by providing automatic coarse-grained query hints (e.g., SET enable_nestloop TO off
) on a per-query basis. Bao uses reinforcement learning, so Bao learns from it mistakes.
Bao has two components: the Bao server, which is a standalone Python application, and the PostgreSQL extension, which integrates directly with PostgreSQL and communicates with the Bao server. The best way to try out Bao is to follow the tutorial.
This implementation has a number of features:
- In the default configuration, Bao works as a learned query optimizer, providing coarse-grained hints to the PostgreSQL query planner and incorporating feedback from query execution to improve its recommendations.
- Bao provides a continually-updated query performance prediction model that is custom-tailored to your DB and workload. Even if you do not use Bao for query optimization, you can still use it to predict the runtime of your queries. Runtime predictions are made available via
EXPLAIN
. - Bao can be used as an advisor, simply providing the coarse-grained hints that Bao would use if Bao were running as a full optimizer. This allows you to manually apply Bao's recommendations to only a few queries.
- Since Bao uses reinforcement learning, Bao must balance exploration and exploitation, and will occasionally try out a query plan that may be slower than the one chosen by PostgreSQL; you have to make mistakes in order to learn! However, when regressions on certain queries are unacceptable, these special queries can be pre-explored using Bao's exploratory mode. Queries added to Bao's exploratory mode are tested at user-defined times, and future Bao models are checked to properly handle these queries. Bao will never pick a regressed query plan for a query processed in exploratory mode.
- Separate server process that can run on a different machine from your database. You can offload model training, potentially to a machine with a GPU. You can also co-locate multiple Bao servers together, if you have multiple DBs, so they share training resources.
Bao is provided under a GPLv3 license. Specifically, please note:
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
Thanks to Yao Tian for determining applicable PG versions!
Bao Tutorial
In this tutorial, we'll set up Bao with PostgreSQL, load the IMDB dataset, and then execute a small sample workload with and without Bao to see the difference. Then, we'll look at how Bao's exploratory mode can be used to prevent query regressions. Finally, we will look at how Bao can be used as a query advisor.
- Setup PostgreSQL with an example DB and the Bao extension
- Start the Bao server
- Run an example workload, and compare performance with PostgreSQL's optimizer
- Use exploration mode to eliminate query regressions
- Use Bao as an advisor
- Notes about this tutorial
PostgreSQL Setup
In this tutorial, we'll use the IMDB dataset from the "How Good are Query Optimizers, Really?"1 paper. The fastest way to get started is to use the pre-built Docker image which contains a PostgreSQL database and the compiled Bao extension. Yu can launch a container like this:
$ docker run -p 127.0.0.1:5432:5432/tcp --add-host host.docker.internal:host-gateway --shm-size=8g ryanmarcus/imdb_bao:v1
If you don't want to use the Docker container, follow the rest of the instructions on this page. Otherwise, continue to the next step.
To load the IMDB data into Postgres, there are two options. You can setup a PostgreSQL database with this data via a virtual machine, or you can download a a PostgreSQL dump from the Harvard dataverse and manually load the data yourself.
Don't forget to configure PostgreSQL with sane defaults. The VM setup will do this automatically, but at a minimum you should set the shared_buffers
variable to something larger than the default (around 25% to 40% of your total RAM is recommended).
Assuming you've used the virtual machine, we can test that the DB was setup correctly:
# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb
psql (12.3)
Type "help" for help.
imdb=# select count(*) from title;
count
---------
2528312
(1 row)
Install the Bao extension
With our PostgreSQL database setup, it is time to install Bao. If you are using a virtual machine, the following steps should be performed from within the VM (i.e., type vagrant ssh
to login).
$ git clone https://github.com/learnedsystems/BaoForPostgreSQL
$ cd BaoForPostgreSQL
$ ls
bao_server branding COPYING LICENSE pg_extension
README.md run_queries.py sample_queries
The directory pg_extension
contains the code for the PostgreSQL extension. We'll install that next, using the PGXS system. Make sure your machine has the these installed.
For Ubuntu, you need these packages:
sudo apt-get install postgresql-server-dev-all postgresql-common
For Arch Linux:
sudo pacman -S postgresql-libs
With the correct packages installed, we can proceed to install the Bao extension. First, turn off PostgreSQL:
systemctl stop postgresql
Next, build and install the Bao PostgreSQL extension.
# depending on your setup, this may require sudo
cd pg_extension
make USE_PGXS=1 install
If everything goes correctly, you should see output like below:
$ make USE_PGXS=1 install
/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/usr/include/postgresql/server -I/usr/include/postgresql/internal -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o main.bc main.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fPIC -I. -I./ -I/usr/include/postgresql/server -I/usr/include/postgresql/internal -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -c -o main.o main.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fPIC -shared -o pg_bao.so main.o -L/usr/lib -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/lib -Wl,--as-needed
/usr/bin/mkdir -p '/usr/lib/postgresql'
/usr/bin/mkdir -p '/usr/share/postgresql/extension'
/usr/bin/mkdir -p '/usr/share/postgresql/extension'
/usr/bin/install -c -m 755 pg_bao.so '/usr/lib/postgresql/pg_bao.so'
/usr/bin/install -c -m 644 .//pg_bao.control '/usr/share/postgresql/extension/'
/usr/bin/install -c -m 644 .//pg_bao--0.0.1.sql '/usr/share/postgresql/extension/'
/usr/bin/mkdir -p '/usr/lib/postgresql/bitcode/pg_bao'
/usr/bin/mkdir -p '/usr/lib/postgresql/bitcode'/pg_bao/
/usr/bin/install -c -m 644 main.bc '/usr/lib/postgresql/bitcode'/pg_bao/./
cd '/usr/lib/postgresql/bitcode' && /usr/bin/llvm-lto -thinlto -thinlto-action=thinlink -o pg_bao.index.bc pg_bao/main.bc
Now that the Bao extension is installed, we have to tell PostgreSQL to load it. To do this, modify your postgresql.conf
file to load the pg_bao
shared library. If you are using the VM, you can do this like so (as a superuser):
echo "shared_preload_libraries = 'pg_bao'" >> /media/data/pg_data/data/postgresql.conf
Next, we restart PostgreSQL:
systemctl restart postgresql
We can now reconnect to the database and test to make sure Bao is installed:
# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb
psql (12.3)
Type "help" for help.
imdb=# SHOW enable_bao;
enable_bao
------------
off
(1 row)
If the enable_bao
is present (it defaults to off
), then the Bao extension is installed. If instead you see an error like below, then the Bao extension has not been installed:
imdb=# show enable_bao;
ERROR: unrecognized configuration parameter "enable_bao"
Notes
Leis, Viktor, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. “How Good Are Query Optimizers, Really?” PVLDB, VLDB ’15, 9, no. 3 (2015): 204–215. https://doi.org/10.14778/2850583.2850594.
Bao Server Setup
After configuring the PostgreSQL extension, we'll next setup the Bao server. The Bao server is responsible for loading, maintaining, and serving the learned query optimization model. For now, we'll assume that the Bao server will run on the same machine as PostgreSQL, although this is not required.
The Bao server is configured with the BaoForPostgreSQL/bao_server/bao.cfg
configuration file, but the defaults should work fine for this tutorial. Before starting the server, several Python dependencies are required. If you are using the VM, these are already installed. On Arch Linux, these can be installed by:
pacman -S python-scikit-learn python-numpy python-joblib python-pytorch-opt
If you are using a different Linux distribution, the package names may be slightly different. If you'd prefer, you can also install these dependencies with pip
:
pip3 install scikit-learn numpy joblib
pip3 install torch==1.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
Once these dependencies are installed, we can launch the Bao server.
$ cd bao_server
$ python3 main.py
Listening on localhost port 9381
Spawning server process...
With the Bao server running, we can now test to see if PostgreSQL can connect to it. If you are not running the Bao server on the same node as PostgreSQL, you'll need to change ListenAddress
in bao.cfg
and set the PostgreSQL variables bao_host
and bao_port
.
# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb
psql (12.3)
Type "help" for help.
imdb=# SET enable_bao TO on;
SET
imdb=# EXPLAIN SELECT count(*) FROM title;
QUERY PLAN
------------------------------------------------------------------------------------------
Bao prediction: NaN
Bao recommended hint: (no hint)
Finalize Aggregate (cost=50166.51..50166.52 rows=1 width=8)
-> Gather (cost=50166.29..50166.50 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=49166.29..49166.30 rows=1 width=8)
-> Parallel Seq Scan on title (cost=0.00..46532.63 rows=1053463 width=0)
(7 rows)
If everything is setup correctly, you should see two lines in the EXPLAIN
output related to Bao -- the prediction and the hint. Since Bao currently has no experience, it has no model, so there's no prediction or hint provided.
Next, we'll test to make sure Bao can correctly record feedback from PostgreSQL query executions. In the same session (with enable_bao
set to ON
), execute a query:
imdb=# SELECT count(*) FROM title;
count
---------
2528312
(1 row)
If you look at the stdout
of the Bao server, you should see a line like:
Logged reward of 2103.556027
This indicates that Bao has recorded a runtime of 2 seconds for this simple query plan. Note that Bao does not record your SQL queries, only a scrubbed version of your query plans. Bao keeps the type of each node, the estimated cost and cardinality, and the names of the involved relations. Bao never stores, for example, the predicate values from an executed query. Here's what Bao stores for the above query:
{
"Plan": {
"Node Type": "Other",
"Node Type ID": "42",
"Total Cost": 50166.515833,
"Plan Rows": 1,
"Plans": [
{
"Node Type": "Other",
"Node Type ID": "45",
"Total Cost": 50166.500833,
"Plan Rows": 2,
"Plans": [
{
"Node Type": "Other",
"Node Type ID": "42",
"Total Cost": 49166.300833,
"Plan Rows": 1,
"Plans": [
{
"Node Type": "Seq Scan",
"Node Type ID": "19",
"Relation Name": "title",
"Total Cost": 46532.633333,
"Plan Rows": 1053463
}
]
}
]
}
]
},
"Buffers": {
"title_pkey": 1,
"kind_id_title": 1
}
}
Run an example workload
Next, we'll use Bao to execute a small sample workload. Then, we'll execute that same workload without Bao, and analyze the results.
The sample_queries
folder in the repository root contains 40 sample queries drawn from the original join order benchmark (JOB)1 and from extended JOB2:
$ cd sample_queries
$ ls
q10_2a265.sql q17_7a164.sql q23_19d.sql q29_6e.sql q36_7a136.sql q5_8a423.sql
q11_17e.sql q18_7a103.sql q24_32a.sql q30_18c.sql q37_2a1291.sql q6_16b.sql
q12_17a.sql q1_8a463.sql q25_13d.sql q31_2a39.sql q3_7a99.sql q7_7a48.sql
q13_7a121.sql q19_2a471.sql q26_2a274.sql q32_2a493.sql q38_2a1870.sql q8_6a505.sql
q14_6a349.sql q20_24b.sql q27_3c.sql q33_2a156.sql q39_2a2781.sql q9_5a48.sql
q15_18a.sql q21_2a396.sql q28_13a.sql q34_1a275.sql q40_2a8120.sql
q16_26c.sql q22_8a27.sql q2_8a82.sql q35_1a1508.sql q4_8a122.sql
The run_queries.py
script will execute a random workload with 500 queries drawn from these samples. First, 25 queries will be executed to provide some basic training data for Bao. Then, and for every 25 queries processed afterwards, the script will pause query execution to retrain Bao's model.
The run_queries.py
script assumes your DB is reachable on localhost
, with the username imdb
in the database imdb
. If this is not the case, modify the PG_CONNECTION_STR
variable at the top of the file.
Start the run:
$ python3 run_queries.py sample_queries/*.sql | tee ~/bao_run.txt
We use the tee
command to both show us the output and redirect the output to a file, which we analyze later. Grab a coffee, this run will take a while to finish (around 75 minutes on my hardware).
Next, once this run is finished, change the line in run_queries.py
:
USE_BAO = True
To:
USE_BAO = False
This will cause the run_queries.py
script to execute the exact same workload, but without using Bao to select query plans and without retraining a model every 25 queries. Start the run:
$ python3 run_queries.py sample_queries/*.sql | tee ~/pg_run.txt
... and grab another coffee. This took just under 3 hours on my hardware. The fact that the workload finishes faster with Bao enabled is already telling, but next we will analyze these two runs in detail.
Notes
Leis, Viktor, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. “How Good Are Query Optimizers, Really?” PVLDB, VLDB ’15, 9, no. 3 (2015): 204–215. https://doi.org/10.14778/2850583.2850594.
Marcus, Ryan, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. “Neo: A Learned Query Optimizer.” PVLDB, VLDB ’19, 12, no. 11 (2019): 1705–18.
Analyzing the executions
With logs of our saved runs in bao_run.txt
and pg_run.txt
, we can next analyze them. If you are using the VM, you can move these files to /vagrant/
to access them from outside the VM.
To analyze the logs, run the analyze_bao.ipynb
Python notebook. Make sure bao_run.txt
and pg_run.txt
are in the same folder. You'll need Jupyter to run the notebook if you don't already have it.
The notebook will first generate a queries complete vs. time graph. Depending on your hardware, the results will vary. Below is what I saw on my hardware:
The graph shows the number of queries that have been completed since the start of the experiment, including the training time of Bao. Each red circle indicates a time when query execution was paused and the Bao model was trained. Obviously, when deployed, one does not have to pause query execution to retrain the Bao model, and this retraining can either be offloaded to another machine or done at a fixed time during the day, concurrent with query processing. On my hardware, Bao executed the example workload about twice as fast as the default PostgreSQL optimizer.
Where do Bao's gains come from? Finishing the workload faster could come from making every query a little faster, or making a few queries much faster. To examine this, we look at the CDF of query times, which is the next pair of graphs generated by the notebook.
The left figure shows the CDF with a linear y-axis, whereas the right figure shows the same CDF with a log scale. We can see that the majority of Bao's gains come from decreasing latency "at the tail" of the distribution: the 90%, 95%, etc. Bao incurs a slightly longer longest-running query (best visible on the left plot), and increases the query time of the fastest queries slightly (best visible on the right plot).
Bao increases the latency of the fastest query by a small margin because of increased optimization time. This is because query planning with Bao requires executing the PostgreSQL planner several times, and running inference through a neural network. If you care about a particular query taking 0.01 seconds instead of 0.05 seconds, you can always disable Bao on a per-query basis.
What about query regressions? The next table shows the latency of each query when using the PostgreSQL optimizer, and the worst / best time achieved by Bao.
PG | Bao worst | Bao best | |
---|---|---|---|
Q | |||
q1 | 275.415884 | 12.206382 | 6.005776 |
q2 | 71.049927 | 198.310226 | 9.242487 |
q3 | 10.982070 | 290.048801 | 10.805816 |
q4 | 26.890862 | 26.966064 | 1.527303 |
q5 | 9.692364 | 9.354480 | 1.350012 |
q6 | 21.741243 | 19.851484 | 7.341236 |
q7 | 51.935738 | 51.321676 | 7.288905 |
q8 | 28.725613 | 15.981973 | 5.995388 |
q9 | 15.645138 | 16.394102 | 7.327004 |
q10 | 11.720967 | 9.688347 | 7.373339 |
q11 | 15.163100 | 7.686548 | 5.853226 |
q12 | 12.934380 | 9.379889 | 4.565600 |
q13 | 18.687008 | 11.803825 | 3.417922 |
q14 | 11.100027 | 14.864732 | 7.060695 |
q15 | 9.641760 | 8.258874 | 4.153027 |
q16 | 5.312640 | 7.992982 | 1.221813 |
q17 | 6.404161 | 17.702658 | 5.868285 |
q18 | 11.912653 | 20.336241 | 6.772051 |
q19 | 9.943220 | 33.939818 | 10.330661 |
q20 | 0.143906 | 0.679753 | 0.344254 |
q21 | 1.022706 | 1.292618 | 0.921263 |
q22 | 16.113360 | 51.231996 | 8.196555 |
q23 | 12.050350 | 13.501636 | 7.194857 |
q24 | 0.025990 | 0.153196 | 0.100763 |
q25 | 3.906976 | 5.511800 | 2.178743 |
q26 | 10.439918 | 16.880665 | 7.772304 |
q27 | 0.759958 | 1.491062 | 0.461650 |
q28 | 1.784515 | 2.679448 | 1.671500 |
q29 | 0.279165 | 0.263327 | 0.113964 |
q30 | 6.967469 | 7.600260 | 5.197018 |
q31 | 1.877799 | 3.540210 | 1.459715 |
q32 | 0.981562 | 7.478595 | 0.731711 |
q33 | 2.215288 | 4.805330 | 1.749660 |
q34 | 5.175736 | 8.535833 | 2.692968 |
q35 | 6.402113 | 14.327323 | 5.302665 |
q36 | 11.992452 | 15.665057 | 8.864666 |
q37 | 12.208148 | 22.470210 | 9.716367 |
q38 | 13.334725 | 34.440121 | 10.314824 |
q39 | 8.051381 | 16.096334 | 8.621814 |
q40 | 14.709921 | 19.853354 | 11.598819 |
We can see immediately that Bao has huge gains on Q1, which takes almost 5 minutes for PostgreSQL to process. However, on Q2 and Q3, while Bao normally found a pretty good plan (the Bao best column), sometimes Bao picked a regressing plan that took significantly longer to execute than the PostgreSQL plan.
For some applications, the raw workload speedup achieved by Bao may be more important than any of these query regressions. For other applications, these query regressions may be a huge problem. Next, we discuss how to use Bao's exploration mode to avoid these regressions.
Use Exploration Mode
Bao is powered by reinforcement learning, and must therefore strike a balance between the exploration of new plans and the exploitation of plans that are known to be good. If you do no exploration, you'll never do any better than PostgreSQL -- if you do too much exploration, you'll suffer many query regressions.
Bao uses an algorithm called Thompson sampling, which has a theoretical guarantee about regret. Regret is the difference in performance between the plan chosen by Bao and the (unknown) best possible choice. Thompson sampling ensures that, in the long run, regret approaches zero.1 This is considered to be an optimal balance of exploration and exploitation.
However, sometimes "in the long run" isn't good enough for an application. For example, you may need to ensure that a particular query never regresses. In order to do this, Bao supports exploration mode, a special facility that lets Bao explore plans for specific queries offline, and then ensure the best plan is always chosen at runtime.
At a high level, exploration mode works as follows. You tell Bao about a particular SQL query that you never want to regress. Then, you give Bao a fixed period of time -- like 20 minutes -- to run experiments against your database. Bao uses this time (and no more) to run as many experiments as it can, saving the results. When a new model is trained, that model is checked to make sure it would make the right decision for each executed experiment. If the model would not, the model is retrained, with increased emphasis on those experiments.
We'll start from scratch, removing our previous Bao model and the experience we observed. Stop the Bao server (i.e., Control + C), then delete the model, Bao DB, and restart PostgreSQL:
$ rm -rf bao_server/bao_default_model
$ rm bao_server/bao.db
$ systemctl restart postgresql
Configure Bao to talk to PostgreSQL
Until now, the PostgreSQL extension has communicated with the Bao server, but the Bao server has never directly connected to the database. For exploration mode, we'll need such a connection. Edit the bao_server/bao.cfg
file to tell Bao how to connect to your PostgreSQL instance:
# ==============================================================
# EXPLORATION MODE SETTINGS
# ==============================================================
# maximum time a query should reasonably take (used in
# exploration mode).
MaxQueryTimeSeconds = 120
# psycopg2 / JDBC connection string to access PostgreSQL
# (used by the experiment runner to prevent regressions)
PostgreSQLConnectString = user=imdb
MaxQueryTimeSeconds
is an upper-bound on how long any non-regressed query plan you add to exploration mode should take. For this sample workload, 120 seconds was a reasonable value for this workload. Bao uses this value as a cutoff for its experiments, assuming any plan that takes longer than this amount of time must be a regression. Don't set this value too tightly, however, because Bao can still gain knowledge from observing how much a query plan regressed.PostgreSQLConnectString
is the JDBC-like string used by the Bao server to connect to the postgreSQL database. You can find documentation for it from the psycopg docs.
Testing the connection
To test the connection, we can use the baoctl.py
script. You can find this script in the bao_server
directory. It should be executed on the same machine as the Bao server is running.
Run baoctl.py --test-connection
to see if Bao can connect to your PostgreSQL instance:
$ python3 baoctl.py --test-connection
Connection successful!
Adding exploration queries
Once Bao can connect to our PostgreSQL instance, we can add exploration queries. Let's add the first three workload queries:
$ python3 baoctl.py --add-test-query ../sample_queries/q1_8a463.sql
Added new test query.
$ python3 baoctl.py --add-test-query ../sample_queries/q2_8a82.sql
Added new test query.
$ python3 baoctl.py --add-test-query ../sample_queries/q3_7a99.sql
Added new test query.
Start exploration
You can give Bao as much or as little time to execute experiments as you'd like. It is a good idea to provide at least a little longer than MaxQueryTimeSeconds
, to ensure that at least one experiment is fully executed.
Each query added creates 5 experiments, so we currently have 15 un-ran experiments. We can see this by running baoctl.py --status
:
$ python3 baoctl.py --status
Unexecuted experiments : 15
Completed experiments : 0
Exploration queries : 3
We can start executing these 15 experiments by running baoctl.py --experiment
. We'll give Bao 30 minutes to run these experiments so that they can all be finished:
$ python3 baoctl.py --experiment 1800
We have 15 unexecuted experiment(s).
Running on backend PID 57718
...
Finished all experiments
Note: sometimes, a particular configuration will cause the PostgreSQL session to crash. When this is the case, PostgreSQL automatically restarts / recovers. However, since this can be an expensive operation, Bao ends any experimentation when this occurs. Bao notes when a configuration causes such a crash, and will not execute it again. If this occurs during the tutorial, restart the exploration process until every experiment finishes. When this occurs, the output looks like this:
Time remaining: 1539696 ms
Server down after experiment with arm 1
Treating this as a timeout and ceasing further experiments.
Logged reward of 240000
Finished all experiments
Once these experiments finish, we can re-run the entire workload and analyze the performance:
$ python3 run_queries.py sample_queries/*.sql | tee ~/bao_with_regblock.txt
Grab yet another coffee. This one will be faster, but will still take some time.
Notes
Precisely, the gaurantee is that the limit as time tends to infinity is that regret will tend towards zero. This gaurantee makes several assumptions about the underlying reward function, which do not strictly apply to query optimization. Nevertheless, Thompson sampling appears to be an effective algorithm in this domain.
Analyze the workload with exploration mode
We'll use the same notebook as before to analyze our new bao_with_regblock.txt
results. You'll need to change the SHOW_RG = False
line in the 2nd cell to SHOW_RG = True
to plot both our previous run (without exploration mode) and our new run.
First, we'll look at queries completed vs. time.
The new green line shows the performance of Bao with our three test queries entered into experimental mode. In terms of overall workload performance, exploration mode doesn't help all tha tmuch: the workload finishes only a little bit faster.
Next, we'll look at the query latency CDFs.
The green line shows that tail latency has been significantly reduced, which is accounted for almost entirely by avoiding a few regressing query plans. We can verify this with the same table we looked at before:
PG | Bao worst | Bao best | Bao + E worst | Bao + E best | |
---|---|---|---|---|---|
Q | |||||
q1 | 275.415884 | 12.206382 | 6.005776 | 12.455495 | 6.148398 |
q2 | 71.049927 | 198.310226 | 9.242487 | 72.339166 | 10.068161 |
q3 | 10.982070 | 290.048801 | 10.805816 | 14.475681 | 5.646478 |
q4 | 26.890862 | 26.966064 | 1.527303 | 26.190447 | 1.468367 |
q5 | 9.692364 | 9.354480 | 1.350012 | 6.007022 | 1.319892 |
q6 | 21.741243 | 19.851484 | 7.341236 | 24.368830 | 8.481136 |
q7 | 51.935738 | 51.321676 | 7.288905 | 53.320758 | 8.071399 |
q8 | 28.725613 | 15.981973 | 5.995388 | 25.003634 | 5.759860 |
q9 | 15.645138 | 16.394102 | 7.327004 | 15.451688 | 7.699232 |
q10 | 11.720967 | 9.688347 | 7.373339 | 23.513964 | 7.451328 |
q11 | 15.163100 | 7.686548 | 5.853226 | 14.888963 | 10.835752 |
q12 | 12.934380 | 9.379889 | 4.565600 | 16.264112 | 4.621375 |
q13 | 18.687008 | 11.803825 | 3.417922 | 6.843567 | 3.754824 |
q14 | 11.100027 | 14.864732 | 7.060695 | 15.598768 | 7.015126 |
q15 | 9.641760 | 8.258874 | 4.153027 | 14.213644 | 3.983182 |
q16 | 5.312640 | 7.992982 | 1.221813 | 6.572023 | 1.271791 |
q17 | 6.404161 | 17.702658 | 5.868285 | 18.423400 | 5.759903 |
q18 | 11.912653 | 20.336241 | 6.772051 | 21.853793 | 6.889149 |
q19 | 9.943220 | 33.939818 | 10.330661 | 21.690789 | 10.453242 |
q20 | 0.143906 | 0.679753 | 0.344254 | 0.460239 | 0.357460 |
q21 | 1.022706 | 1.292618 | 0.921263 | 1.653735 | 0.884557 |
q22 | 16.113360 | 51.231996 | 8.196555 | 51.013479 | 7.209244 |
q23 | 12.050350 | 13.501636 | 7.194857 | 12.140767 | 7.034203 |
q24 | 0.025990 | 0.153196 | 0.100763 | 0.181307 | 0.107129 |
q25 | 3.906976 | 5.511800 | 2.178743 | 3.643400 | 2.291787 |
q26 | 10.439918 | 16.880665 | 7.772304 | 12.206994 | 6.855079 |
q27 | 0.759958 | 1.491062 | 0.461650 | 1.744088 | 0.467688 |
q28 | 1.784515 | 2.679448 | 1.671500 | 2.584871 | 1.753601 |
q29 | 0.279165 | 0.263327 | 0.113964 | 0.330460 | 0.128979 |
q30 | 6.967469 | 7.600260 | 5.197018 | 9.345165 | 5.021910 |
q31 | 1.877799 | 3.540210 | 1.459715 | 3.384164 | 1.439300 |
q32 | 0.981562 | 7.478595 | 0.731711 | 1.171264 | 0.626129 |
q33 | 2.215288 | 4.805330 | 1.749660 | 4.287353 | 1.832341 |
q34 | 5.175736 | 8.535833 | 2.692968 | 8.329981 | 2.556046 |
q35 | 6.402113 | 14.327323 | 5.302665 | 10.737619 | 5.267236 |
q36 | 11.992452 | 15.665057 | 8.864666 | 30.359994 | 10.390017 |
q37 | 12.208148 | 22.470210 | 9.716367 | 20.191838 | 10.248038 |
q38 | 13.334725 | 34.440121 | 10.314824 | 17.423221 | 10.154855 |
q39 | 8.051381 | 16.096334 | 8.621814 | 19.279665 | 8.618636 |
q40 | 14.709921 | 19.853354 | 11.598819 | 73.711118 | 10.908722 |
The first column shows the latency from the query plans produced by the PostgreSQL optimizer. The next two columns show the latency from the query plans produced Bao optimizer. The final two columns show our new results, the latency from the query plans produced by the Bao optimizer with exploration mode.
The large regressions on query 2 and 3 are eliminated, with both having a much more reasonble worst case time.
Bao as an advisor
For some applications, any amount of exploration -- and thus regression -- on any query is unacceptable. In these scenarios, it is possible to use Bao as an advisor instead of a full-blown optimizer.
To demonstrate this, let's use a simple psql
session.
$ psql -U imdb -h localhost
psql (12.2, server 12.3)
Type "help" for help.
imdb=#
Bao can be controlled through three main session-level PostgreSQL configs.
enable_bao
is the top level config. When set tooff
, the default, Bao does not observe or interfere with the query optimizer at all. When set toon
, Bao will behave according to the values ofenable_bao_rewards
andenable_bao_selection
.enable_bao_rewards
determines whether or not Bao collects additional experience from queries from this session.enable_bao_selection
determines whether or not Bao will use its value model to select query plans.
To use Bao as a pure advisor, we can set enable_bao
but disable enable_bao_rewards
and enable_bao_selection
.
imdb=# SET enable_bao TO on;
SET
imdb=# SET enable_bao_selection TO off;
SET
imdb=# SET enable_bao_rewards TO off;
SET
Next, we'll execute a simple EXPLAIN
statement:
imdb=# EXPLAIN SELECT count(*) FROM title;
QUERY PLAN
------------------------------------------------------------------------------------------
Bao prediction: 894.349 ms
Bao recommended hint: (no hint)
Finalize Aggregate (cost=50165.40..50165.41 rows=1 width=8)
-> Gather (cost=50165.19..50165.40 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=49165.19..49165.20 rows=1 width=8)
-> Parallel Seq Scan on title (cost=0.00..46531.75 rows=1053375 width=0)
(7 rows)
imdb=#
Since enable_bao_selection
is off, this plan is generated using the PostgreSQL optimizer, exactly as it would be if you were not using Bao. Two additional lines are added to the outputof the EXPLAIN
plan:
- Bao prediction shows the time that Bao thinks this query plan will take to execute. In this case, about a second.
- Bao recommended hint shows the query hint that Bao would use if
enable_bao_selection
was on. In this case, Bao would not use any query hints.
Let's execute the query and run the same EXPLAIN
statement again:
imdb=# SELECT count(*) FROM title;
count
---------
2528312
(1 row)
imdb=# EXPLAIN SELECT count(*) FROM title;
QUERY PLAN
------------------------------------------------------------------------------------------
Bao prediction: 661.193 ms
Bao recommended hint: (no hint)
Finalize Aggregate (cost=50165.40..50165.41 rows=1 width=8)
-> Gather (cost=50165.19..50165.40 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=49165.19..49165.20 rows=1 width=8)
-> Parallel Seq Scan on title (cost=0.00..46531.75 rows=1053375 width=0)
(7 rows)
Bao's prediction for the query changed, even though the query plan, cost estimates, and cardinality estimates are all the same as before! This is because the buffer pool has changed states: after the execution of the SELECT
query, more data relevant to this query has been cached, so Bao predicts that it will execute faster.
Of course, predictions are exactly that -- predictions. While Bao's value model should get better over time, these predictions should be used as advice, and have no bounds whatsoever (although Bao will never predict a negative query runtime).
Let's look at q1
from our sample workload. You can copy and paste the below statement to see the EXPLAIN
output.
EXPLAIN SELECT COUNT(*) FROM title as t, kind_type as kt, info_type as it1, movie_info as mi1, cast_info as ci, role_type as rt, name as n, movie_keyword as mk, keyword as k, movie_companies as mc, company_type as ct, company_name as cn WHERE t.id = ci.movie_id AND t.id = mc.movie_id AND t.id = mi1.movie_id AND t.id = mk.movie_id AND mc.company_type_id = ct.id AND mc.company_id = cn.id AND k.id = mk.keyword_id AND mi1.info_type_id = it1.id AND t.kind_id = kt.id AND ci.person_id = n.id AND ci.role_id = rt.id AND (it1.id IN ('7')) AND (mi1.info in ('MET:','OFM:35 mm','PCS:Digital Intermediate','PFM:35 mm','PFM:Video','RAT:1.33 : 1','RAT:1.37 : 1')) AND (kt.kind in ('episode','movie','tv movie')) AND (rt.role in ('actor','actress')) AND (n.gender in ('f','m') OR n.gender IS NULL) AND (n.name_pcode_cf in ('A5362','J5252','R1632','R2632','W4525')) AND (t.production_year <= 2015) AND (t.production_year >= 1925) AND (cn.name in ('Fox Network','Independent Television (ITV)','Metro-Goldwyn-Mayer (MGM)','National Broadcasting Company (NBC)','Paramount Pictures','Shout! Factory','Sony Pictures Home Entertainment','Universal Pictures','Universal TV')) AND (ct.kind in ('distributors','production companies'));
This should result in something like this:
imdb=# EXPLAIN SELECT COUNT(*) FROM title as t, kind_type as kt, info_type as it1, movie_info as mi1, cast_info as ci, role_type as rt, name as n, movie_keyword as mk, keyword as k, movie_companies as mc, company_type as ct, company_name as cn WHERE t.id = ci.movie_id AND t.id = mc.movie_id AND t.id = mi1.movie_id AND t.id = mk.movie_id AND mc.company_type_id = ct.id AND mc.company_id = cn.id AND k.id = mk.keyword_id AND mi1.info_type_id = it1.id AND t.kind_id = kt.id AND ci.person_id = n.id AND ci.role_id = rt.id AND (it1.id IN ('7')) AND (mi1.info in ('MET:','OFM:35 mm','PCS:Digital Intermediate','PFM:35 mm','PFM:Video','RAT:1.33 : 1','RAT:1.37 : 1')) AND (kt.kind in ('episode','movie','tv movie')) AND (rt.role in ('actor','actress')) AND (n.gender in ('f','m') OR n.gender IS NULL) AND (n.name_pcode_cf in ('A5362','J5252','R1632','R2632','W4525')) AND (t.production_year <= 2015) AND (t.production_year >= 1925) AND (cn.name in ('Fox Network','Independent Television (ITV)','Metro-Goldwyn-Mayer (MGM)','National Broadcasting Company (NBC)','Paramount Pictures','Shout! Factory','Sony Pictures Home Entertainment','Universal Pictures','Universal TV')) AND (ct.kind in ('distributors','production companies'));
QUERY PLAN
----------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------
Bao prediction: 74053.945 ms
Bao recommended hint: SET enable_nestloop TO off;
Aggregate (cost=9644.44..9644.45 rows=1 width=8)
-> Gather (cost=1006.53..9644.44 rows=1 width=0)
Workers Planned: 1
-> Nested Loop (cost=6.53..8644.34 rows=1 width=0)
-> Hash Join (cost=6.10..8643.30 rows=1 width=4)
...
Here, Bao thinks the query plan generated by PostgreSQL will take a little over a minute to execute. Bao recommends a hint to disable nested loop joins for this query. Let's take the hint, and re-run the EXPLAIN:
imdb=# SET enable_nestloop TO off;
SET
imdb=# EXPLAIN SELECT COUNT(*) FROM title as t, kind_type as kt, info_type as it1, movie_info as mi1, cast_info as ci, role_type as rt, name as n, movie_keyword as mk, keyword as k, movie_companies as mc, company_type as ct, company_name as cn WHERE t.id = ci.movie_id AND t.id = mc.movie_id AND t.id = mi1.movie_id AND t.id = mk.movie_id AND mc.company_type_id = ct.id AND mc.company_id = cn.id AND k.id = mk.keyword_id AND mi1.info_type_id = it1.id AND t.kind_id = kt.id AND ci.person_id = n.id AND ci.role_id = rt.id AND (it1.id IN ('7')) AND (mi1.info in ('MET:','OFM:35 mm','PCS:Digital Intermediate','PFM:35 mm','PFM:Video','RAT:1.33 : 1','RAT:1.37 : 1')) AND (kt.kind in ('episode','movie','tv movie')) AND (rt.role in ('actor','actress')) AND (n.gender in ('f','m') OR n.gender IS NULL) AND (n.name_pcode_cf in ('A5362','J5252','R1632','R2632','W4525')) AND (t.production_year <= 2015) AND (t.production_year >= 1925) AND (cn.name in ('Fox Network','Independent Television (ITV)','Metro-Goldwyn-Mayer (MGM)','National Broadcasting Company (NBC)','Paramount Pictures','Shout! Factory','Sony Pictures Home Entertainment','Universal Pictures','Universal TV')) AND (ct.kind in ('distributors','production companies'));
QUERY PLAN
----------------------------------------------------------------------------------
Bao prediction: 15032.299 ms
Bao recommended hint: SET enable_nestloop TO off;
Aggregate (cost=10000977592.82..10000977592.83 rows=1 width=8)
-> Nested Loop (cost=10000683246.19..10000977592.82 rows=1 width=0)
-> Seq Scan on info_type it1 (cost=0.00..2.41 rows=1 width=4)
Filter: (id = 7)
...
For the new query plan, Bao predicts a time of only 15 seconds. If you'd like, you can execute both query plans to measure the quality of Bao's predictions.
Keep in mind that using Bao as we've configured it here won't let the value model get any better. In order to improve its model, Bao needs experience. For any given session, you can allow Bao to collect experience and thus improve its model, but still prevent Bao from modifying any query plans. To do this, just set enable_bao
and enable_bao_rewards
to ON
, but set enable_bao_selection
to OFF
.
Final Notes
This tutorial was intended to give you an idea for what Bao could do and how the various components of Bao work. However, please note that the gains we saw in this tutorial might not be reflected in your real workload. Here's a few things to keep in mind:
- This workload was artifically constructed to represent a mixture of queries where the PostgreSQL optimizer found the best plan, and where PostgreSQL found a terrible plan. Most queries fall somewhere in between these two extremes. Check out the Bao paper for a more detailed analysis on more realistic workloads.
- Bao's query optimizer has overhead. For queries that run very quickly (under 500ms), Bao is unlikely to be helpful, and may just slow down these queries with additional optimization time.
- Bao's value model needs to be retrained and verified, beyond what exploration mode could do. Most of the time, the verification done by Bao currently is sufficient. If training goes awry, Bao always saves the previous model. Training could easily be configured with a
cron
job. We're currently working on more advanced mechanisms for this, but currenlty it has to be done manually.
PostgreSQL Configuration Variables
Bao can be configured through a number of session level PostgreSQL configuration variables. These variables are set to their default values every time you open a session (e.g., psql
session or a connection from an application).
Variable | Default | Description |
---|---|---|
pg_bao.enable_bao |
OFF | Top-level switch for all Bao functions. When disabled, Bao operates purely in advisor mode. |
pg_bao.enable_bao_rewards |
ON | If this and enable_bao are on, Bao will report and record rewards from queries executed in this session. |
pg_bao.enable_bao_selection |
ON | If this and enable_bao are on, Bao will use its value model to select query plans at optimization time. |
pg_bao.bao_host |
localhost | Host where the Bao server is running. Can be changed to put the Bao server on a different machine than PostgreSQL. |
pg_bao.bao_port |
9381 | Port to access the Bao server. If you change this, you may also need to change the port the Bao server listens on. |
pg_bao.bao_num_arms |
5 | The number of arms Bao should examine during query optimization. Lower values decrease optimization time, higher values may lead to better query plans. |
pg_bao.bao_include_json_in_explain |
OFF | Includes the JSON of the Bao plan in the output of EXPLAIN. Used internally. |
More information about each variable can be found in the pg_settings
internal table:
SELECT name, extra_desc FROM pg_settings WHERE name LIKE '%bao%';
Bao Server Configuration Variables
The Bao sever is configured through the bao.cfg
file in the bao_server
directory. The default configuration file, reproduced below, contains a description of each variable.
[bao]
# ==============================================================
# BAO SERVER SETTINGS
# ==============================================================
# port to listen on. Note that the corresponding PostgreSQL
# variable, bao_port, must be set to match.
Port = 9381
# network address to listen on. If not localhost, don't forget
# to set the PostgreSQL bao_host variable.
ListenOn = localhost
# ==============================================================
# EXPLORATION MODE SETTINGS
# ==============================================================
# maximum time a query should reasonably take (used in
# exploration mode).
MaxQueryTimeSeconds = 120
# psycopg2 / JDBC connection string to access PostgreSQL
# (used by the experiment runner to prevent regressions)
PostgreSQLConnectString = user=imdb